Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [RFC] virtio-net: share receive_*() and add_recvbuf_*() with virtio-vsock
From: Michael S. Tsirkin @ 2019-07-16 10:01 UTC (permalink / raw)
  To: Stefano Garzarella; +Cc: Jason Wang, Stefan Hajnoczi, virtualization, netdev
In-Reply-To: <20190716094024.ob43g5lxga5uwb7z@steredhat>

On Tue, Jul 16, 2019 at 11:40:24AM +0200, Stefano Garzarella wrote:
> On Mon, Jul 15, 2019 at 01:50:28PM -0400, Michael S. Tsirkin wrote:
> > On Mon, Jul 15, 2019 at 09:44:16AM +0200, Stefano Garzarella wrote:
> > > On Fri, Jul 12, 2019 at 06:14:39PM +0800, Jason Wang wrote:
> 
> [...]
> 
> > > > 
> > > > 
> > > > I think it's just a branch, for ethernet, go for networking stack. otherwise
> > > > go for vsock core?
> > > > 
> > > 
> > > Yes, that should work.
> > > 
> > > So, I should refactor the functions that can be called also from the vsock
> > > core, in order to remove "struct net_device *dev" parameter.
> > > Maybe creating some wrappers for the network stack.
> > > 
> > > Otherwise I should create a fake net_device for vsock_core.
> > > 
> > > What do you suggest?
> > 
> > Neither.
> > 
> > I think what Jason was saying all along is this:
> > 
> > virtio net doesn't actually lose packets, at least most
> > of the time. And it actually most of the time
> > passes all packets to host. So it's possible to use a virtio net
> > device (possibly with a feature flag that says "does not lose packets,
> > all packets go to host") and build vsock on top.
> 
> Yes, I got it after the latest Jason's reply.
> 
> > 
> > and all of this is nice, but don't expect anything easy,
> > or any quick results.
> 
> I expected this... :-(
> 
> > 
> > Also, in a sense it's a missed opportunity: we could cut out a lot
> > of fat and see just how fast can a protocol that is completely
> > new and separate from networking stack go.
> 
> In this case, if we will try to do a PoC, what do you think is better?
>     1. new AF_VSOCK + network-stack + virtio-net modified
>         Maybe it is allow us to reuse a lot of stuff already written,
>         but we will go through the network stack
> 
>     2. new AF_VSOCK + glue + virtio-net modified
>         Intermediate approach, similar to Jason's proposal
> 
>     3, new AF_VSOCK + new virtio-vsock
>         Can be the thinnest, but we have to rewrite many things, with the risk
>         of making the same mistakes as the current implementation.
> 

1 or 3 imho. I wouldn't expect a lot from 2.  I slightly favor 3 and
Jason 1. So take your pick :)

> > Instead vsock implementation carries so much baggage from both
> > networking stack - such as softirq processing - and itself such as
> > workqueues, global state and crude locking - to the point where
> > it's actually slower than TCP.
> 
> I agree, and I'm finding new issues while I'm trying to support nested
> VMs, allowing multiple vsock transports (virtio-vsock and vhost-vsock in
> the KVM case) at runtime.
> 
> > 
> 
> [...]
> 
> > > > 
> > > > I suggest to do this step by step:
> > > > 
> > > > 1) use virtio-net but keep some protocol logic
> > > > 
> > > > 2) separate protocol logic and merge it to exist Linux networking stack
> > > 
> > > Make sense, thanks for the suggestions, I'll try to do these steps!
> > > 
> > > Thanks,
> > > Stefano
> > 
> > 
> > An alternative is look at sources of overhead in vsock and get rid of
> > them, or rewrite it from scratch focusing on performance.
> 
> I started looking at virtio-vsock and vhost-vsock trying to do very
> simple changes [1] to increase the performance. I should send a v4 of that
> series as a very short term, then I'd like to have a deeper look to understand
> if it is better to try to optimize or rewrite it from scratch.
> 
> 
> Thanks,
> Stefano
> 
> [1] https://patchwork.kernel.org/cover/10970145/

^ permalink raw reply

* Re: [RFC] virtio-net: share receive_*() and add_recvbuf_*() with virtio-vsock
From: Stefano Garzarella @ 2019-07-16  9:40 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Jason Wang, Stefan Hajnoczi, virtualization, netdev
In-Reply-To: <20190715134115-mutt-send-email-mst@kernel.org>

On Mon, Jul 15, 2019 at 01:50:28PM -0400, Michael S. Tsirkin wrote:
> On Mon, Jul 15, 2019 at 09:44:16AM +0200, Stefano Garzarella wrote:
> > On Fri, Jul 12, 2019 at 06:14:39PM +0800, Jason Wang wrote:

[...]

> > > 
> > > 
> > > I think it's just a branch, for ethernet, go for networking stack. otherwise
> > > go for vsock core?
> > > 
> > 
> > Yes, that should work.
> > 
> > So, I should refactor the functions that can be called also from the vsock
> > core, in order to remove "struct net_device *dev" parameter.
> > Maybe creating some wrappers for the network stack.
> > 
> > Otherwise I should create a fake net_device for vsock_core.
> > 
> > What do you suggest?
> 
> Neither.
> 
> I think what Jason was saying all along is this:
> 
> virtio net doesn't actually lose packets, at least most
> of the time. And it actually most of the time
> passes all packets to host. So it's possible to use a virtio net
> device (possibly with a feature flag that says "does not lose packets,
> all packets go to host") and build vsock on top.

Yes, I got it after the latest Jason's reply.

> 
> and all of this is nice, but don't expect anything easy,
> or any quick results.

I expected this... :-(

> 
> Also, in a sense it's a missed opportunity: we could cut out a lot
> of fat and see just how fast can a protocol that is completely
> new and separate from networking stack go.

In this case, if we will try to do a PoC, what do you think is better?
    1. new AF_VSOCK + network-stack + virtio-net modified
        Maybe it is allow us to reuse a lot of stuff already written,
        but we will go through the network stack

    2. new AF_VSOCK + glue + virtio-net modified
        Intermediate approach, similar to Jason's proposal

    3, new AF_VSOCK + new virtio-vsock
        Can be the thinnest, but we have to rewrite many things, with the risk
        of making the same mistakes as the current implementation.


> Instead vsock implementation carries so much baggage from both
> networking stack - such as softirq processing - and itself such as
> workqueues, global state and crude locking - to the point where
> it's actually slower than TCP.

I agree, and I'm finding new issues while I'm trying to support nested
VMs, allowing multiple vsock transports (virtio-vsock and vhost-vsock in
the KVM case) at runtime.

> 

[...]

> > > 
> > > I suggest to do this step by step:
> > > 
> > > 1) use virtio-net but keep some protocol logic
> > > 
> > > 2) separate protocol logic and merge it to exist Linux networking stack
> > 
> > Make sense, thanks for the suggestions, I'll try to do these steps!
> > 
> > Thanks,
> > Stefano
> 
> 
> An alternative is look at sources of overhead in vsock and get rid of
> them, or rewrite it from scratch focusing on performance.

I started looking at virtio-vsock and vhost-vsock trying to do very
simple changes [1] to increase the performance. I should send a v4 of that
series as a very short term, then I'd like to have a deeper look to understand
if it is better to try to optimize or rewrite it from scratch.


Thanks,
Stefano

[1] https://patchwork.kernel.org/cover/10970145/


^ permalink raw reply

* Re: [RFC bpf-next 0/8] bpf: accelerate insn patching speed
From: Jiong Wang @ 2019-07-16  8:50 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Jiong Wang, Alexei Starovoitov, Daniel Borkmann, Edward Cree,
	Naveen N. Rao, Andrii Nakryiko, Jakub Kicinski, bpf, Networking,
	oss-drivers, Yonghong Song
In-Reply-To: <CAEf4BzYDAVUgajz4=dRTu5xQDddp5pi2s=T1BdFmRLZjOwGypQ@mail.gmail.com>


Andrii Nakryiko writes:

> On Mon, Jul 15, 2019 at 2:21 AM Jiong Wang <jiong.wang@netronome.com> wrote:
>>
>>
>> Andrii Nakryiko writes:
>>
>> > On Thu, Jul 11, 2019 at 4:22 AM Jiong Wang <jiong.wang@netronome.com> wrote:
>> >>
>> >>
>> >> Andrii Nakryiko writes:
>> >>
>> >> > On Thu, Jul 4, 2019 at 2:31 PM Jiong Wang <jiong.wang@netronome.com> wrote:
>> >> >>
>> >> >> This is an RFC based on latest bpf-next about acclerating insn patching
>> >> >> speed, it is now near the shape of final PATCH set, and we could see the
>> >> >> changes migrating to list patching would brings, so send out for
>> >> >> comments. Most of the info are in cover letter. I splitted the code in a
>> >> >> way to show API migration more easily.
>> >> >
>> >> >
>> >> > Hey Jiong,
>> >> >
>> >> >
>> >> > Sorry, took me a while to get to this and learn more about instruction
>> >> > patching. Overall this looks good and I think is a good direction.
>> >> > I'll post high-level feedback here, and some more
>> >> > implementation-specific ones in corresponding patches.
>> >>
>> >> Great, thanks very much for the feedbacks. Most of your feedbacks are
>> >> hitting those pain points I exactly had ran into. For some of them, I
>> >> thought similar solutions like yours, but failed due to various
>> >> reasons. Let's go through them again, I could have missed some important
>> >> things.
>> >>
>> >> Please see my replies below.
>> >
>> > Thanks for thoughtful reply :)
>> >
>> >>
>> >> >>
>> >> >> Test Results
>> >> >> ===
>> >> >>   - Full pass on test_verifier/test_prog/test_prog_32 under all three
>> >> >>     modes (interpreter, JIT, JIT with blinding).
>> >> >>
>> >> >>   - Benchmarking shows 10 ~ 15x faster on medium sized prog, and reduce
>> >> >>     patching time from 5100s (nearly one and a half hour) to less than
>> >> >>     0.5s for 1M insn patching.
>> >> >>
>> >> >> Known Issues
>> >> >> ===
>> >> >>   - The following warning is triggered when running scale test which
>> >> >>     contains 1M insns and patching:
>> >> >>       warning of mm/page_alloc.c:4639 __alloc_pages_nodemask+0x29e/0x330
>> >> >>
>> >> >>     This is caused by existing code, it can be reproduced on bpf-next
>> >> >>     master with jit blinding enabled, then run scale unit test, it will
>> >> >>     shown up after half an hour. After this set, patching is very fast, so
>> >> >>     it shows up quickly.
>> >> >>
>> >> >>   - No line info adjustment support when doing insn delete, subprog adj
>> >> >>     is with bug when doing insn delete as well. Generally, removal of insns
>> >> >>     could possibly cause remove of entire line or subprog, therefore
>> >> >>     entries of prog->aux->linfo or env->subprog needs to be deleted. I
>> >> >>     don't have good idea and clean code for integrating this into the
>> >> >>     linearization code at the moment, will do more experimenting,
>> >> >>     appreciate ideas and suggestions on this.
>> >> >
>> >> > Is there any specific problem to detect which line info to delete? Or
>> >> > what am I missing besides careful implementation?
>> >>
>> >> Mostly line info and subprog info are range info which covers a range of
>> >> insns. Deleting insns could causing you adjusting the range or removing one
>> >> range entirely. subprog info could be fully recalcuated during
>> >> linearization while line info I need some careful implementation and I
>> >> failed to have clean code for this during linearization also as said no
>> >> unit tests to help me understand whether the code is correct or not.
>> >>
>> >
>> > Ok, that's good that it's just about clean implementation. Try to
>> > implement it as clearly as possible. Then post it here, and if it can
>> > be improved someone (me?) will try to help to clean it up further.
>> >
>> > Not a big expert on line info, so can't comment on that,
>> > unfortunately. Maybe Yonghong can chime in (cc'ed)
>> >
>> >
>> >> I will described this latter, spent too much time writing the following
>> >> reply. Might worth an separate discussion thread.
>> >>
>> >> >>
>> >> >>     Insn delete doesn't happen on normal programs, for example Cilium
>> >> >>     benchmarks, and happens rarely on test_progs, so the test coverage is
>> >> >>     not good. That's also why this RFC have a full pass on selftest with
>> >> >>     this known issue.
>> >> >
>> >> > I hope you'll add test for deletion (and w/ corresponding line info)
>> >> > in final patch set :)
>> >>
>> >> Will try. Need to spend some time on BTF format.
>> >> >
>> >> >>
>> >> >>   - Could further use mem pool to accelerate the speed, changes are trivial
>> >> >>     on top of this RFC, and could be 2x extra faster. Not included in this
>> >> >>     RFC as reducing the algo complexity from quadratic to linear of insn
>> >> >>     number is the first step.
>> >> >
>> >> > Honestly, I think that would add more complexity than necessary, and I
>> >> > think we can further speed up performance without that, see below.
>> >> >
>> >> >>
>> >> >> Background
>> >> >> ===
>> >> >> This RFC aims to accelerate BPF insn patching speed, patching means expand
>> >> >> one bpf insn at any offset inside bpf prog into a set of new insns, or
>> >> >> remove insns.
>> >> >>
>> >> >> At the moment, insn patching is quadratic of insn number, this is due to
>> >> >> branch targets of jump insns needs to be adjusted, and the algo used is:
>> >> >>
>> >> >>   for insn inside prog
>> >> >>     patch insn + regeneate bpf prog
>> >> >>     for insn inside new prog
>> >> >>       adjust jump target
>> >> >>
>> >> >> This is causing significant time spending when a bpf prog requires large
>> >> >> amount of patching on different insns. Benchmarking shows it could take
>> >> >> more than half minutes to finish patching when patching number is more
>> >> >> than 50K, and the time spent could be more than one hour when patching
>> >> >> number is around 1M.
>> >> >>
>> >> >>   15000   :    3s
>> >> >>   45000   :   29s
>> >> >>   95000   :  125s
>> >> >>   195000  :  712s
>> >> >>   1000000 : 5100s
>> >> >>
>> >> >> This RFC introduces new patching infrastructure. Before doing insn
>> >> >> patching, insns in bpf prog are turned into a singly linked list, insert
>> >> >> new insns just insert new list node, delete insns just set delete flag.
>> >> >> And finally, the list is linearized back into array, and branch target
>> >> >> adjustment is done for all jump insns during linearization. This algo
>> >> >> brings the time complexity from quadratic to linear of insn number.
>> >> >>
>> >> >> Benchmarking shows the new patching infrastructure could be 10 ~ 15x faster
>> >> >> on medium sized prog, and for a 1M patching it reduce the time from 5100s
>> >> >> to less than 0.5s.
>> >> >>
>> >> >> Patching API
>> >> >> ===
>> >> >> Insn patching could happen on two layers inside BPF. One is "core layer"
>> >> >> where only BPF insns are patched. The other is "verification layer" where
>> >> >> insns have corresponding aux info as well high level subprog info, so
>> >> >> insn patching means aux info needs to be patched as well, and subprog info
>> >> >> needs to be adjusted. BPF prog also has debug info associated, so line info
>> >> >> should always be updated after insn patching.
>> >> >>
>> >> >> So, list creation, destroy, insert, delete is the same for both layer,
>> >> >> but lineration is different. "verification layer" patching require extra
>> >> >> work. Therefore the patch APIs are:
>> >> >>
>> >> >>    list creation:                bpf_create_list_insn
>> >> >>    list patch:                   bpf_patch_list_insn
>> >> >>    list pre-patch:               bpf_prepatch_list_insn
>> >> >
>> >> > I think pre-patch name is very confusing, until I read full
>> >> > description I couldn't understand what it's supposed to be used for.
>> >> > Speaking of bpf_patch_list_insn, patch is also generic enough to leave
>> >> > me wondering whether instruction buffer is inserted after instruction,
>> >> > or instruction is replaced with a bunch of instructions.
>> >> >
>> >> > So how about two more specific names:
>> >> > bpf_patch_list_insn -> bpf_list_insn_replace (meaning replace given
>> >> > instruction with a list of patch instructions)
>> >> > bpf_prepatch_list_insn -> bpf_list_insn_prepend (well, I think this
>> >> > one is pretty clear).
>> >>
>> >> My sense on English word is not great, will switch to above which indeed
>> >> reads more clear.
>> >>
>> >> >>    list lineration (core layer): prog = bpf_linearize_list_insn(prog, list)
>> >> >>    list lineration (veri layer): env = verifier_linearize_list_insn(env, list)
>> >> >
>> >> > These two functions are both quite involved, as well as share a lot of
>> >> > common code. I'd rather have one linearize instruction, that takes env
>> >> > as an optional parameter. If env is specified (which is the case for
>> >> > all cases except for constant blinding pass), then adjust aux_data and
>> >> > subprogs along the way.
>> >>
>> >> Two version of lineration and how to unify them was a painpoint to me. I
>> >> thought to factor out some of the common code out, but it actually doesn't
>> >> count much, the final size counting + insnsi resize parts are the same,
>> >> then things start to diverge since the "Copy over insn" loop.
>> >>
>> >> verifier layer needs to copy and initialize aux data etc. And jump
>> >> relocation is different. At core layer, the use case is JIT blinding which
>> >> could expand an jump_imm insn into a and/or/jump_reg sequence, and the
>> >
>> > Sorry, I didn't get what "could expand an jump_imm insn into a
>> > and/or/jump_reg sequence", maybe you can clarify if I'm missing
>> > something.
>> >
>> > But from your cover letter description, core layer has no jumps at
>> > all, while verifier has jumps inside patch buffer. So, if you support
>> > jumps inside of patch buffer, it will automatically work for core
>> > layer. Or what am I missing?
>>
>> I meant in core layer (JIT blinding), there is the following patching:
>>
>> input:
>>   insn 0             insn 0
>>   insn 1             insn 1
>>   jmp_imm   >>       mov_imm  \
>>   insn 2             xor_imm    insn seq expanded from jmp_imm
>>   insn 3             jmp_reg  /
>>                      insn 2
>>                      insn 3
>>
>>
>> jmp_imm is the insn that will be patched, and the actually transformation
>> is to expand it into mov_imm/xor_imm/jmp_reg sequence. "jmp_reg", sitting
>> at the end of the patch buffer, must jump to the same destination as the
>> original jmp_imm, so "jmp_reg" is an insn inside patch buffer but should
>> be relocated, and the jump destination is outside of patch buffer.
>
>
> Ok, great, thanks for explaining, yeah it's definitely something that
> we should be able to support. BUT. It got me thinking a bit more and I
> think I have simpler and more elegant solution now, again, supporting
> both core-layer and verifier-layer operations.
>
> struct bpf_patchable_insn {
>    struct bpf_patchable_insn *next;
>    struct bpf_insn insn;
>    int orig_idx; /* original non-patched index */
>    int new_idx;  /* new index, will be filled only during linearization */
> };
>
> struct bpf_patcher {
>     /* dummy head node of a chain of patchable instructions */
>     struct bpf_patchable_insn insn_head;
>     /* dynamic array of size(original instruction count)
>      * this is a map from original instruction index to a first
>      * patchable instruction that replaced that instruction (or
>      * just original instruction as bpf_patchable_insn).
>      */
>     int *orig_idx_to_patchable_insn;
>     int cnt;
> };
>
> Few points, but it should be pretty clear just from comments and definitions:
> 1. When you created bpf_patcher, you create patchabe_insn list, fill
> orig_idx_to_patchable_insn map to store proper pointers. This array is
> NEVER changed after that.
> 2. When replacing instruction, you re-use struct bpf_patchable_insn
> for first patched instruction, then append after that (not prepend to
> next instruction to not disrupt orig_idx -> patchable_insn mapping).
> 3. During linearizations, you first traverse the chain of instructions
> and trivially assing new_idxs.
> 4. No need for patchabe_insn->target anymore. All jumps use relative
> instruction offsets, right?

Yes, all jumps are pc-relative.

> So when you need to determine new
> instruction index during linearization, you just do (after you
> calculated new instruction indicies):
>
> func adjust_jmp(struct bpf_patcher* patcher, struct bpf_patchable_insn *insn) {
>    int old_jmp_idx = insn->orig_idx + jmp_offset_of(insn->insn);
>    int new_jmp_idx = patcher->orig_idx_to_patchable_insn[old_jmp_idx]->new_idx;
>    adjust_jmp_offset(insn->insn, new_jmp_idx) - insn->orig_idx;
> }

Hmm, this algo is kinds of the same this RFC, just we have organized "new_index"
as "idx_map". And in this RFC, only new_idx of one original insn matters,
no space is allocated for patched insns. (As mentioned, JIT blinding
requires the last insn inside patch buffer relocated to original jump
offset, so there was a little special handling in the relocation loop in
core layer linearization code)

> The idea is that we want to support quick look-up by original
> instruction index. That's what orig_idx_to_patchable_insn provides. On
> the other hand, no existing instruction is ever referencing newly
> patched instruction by its new offset, so with careful implementation,
> you can transparently support all the cases, regardless if it's in
> core layer or verifier layer (so, e.g., verifier layer patched
> instructions now will be able to jump out of patched buffer, if
> necessary, neat, right?).
>
> It is cleaner than everything we've discussed so far. Unless I missed
> something critical (it's all quite convoluted, so I might have
> forgotten some parts already). Let me know what you think.

Let me digest a little bit and do some coding, then I will come back. Some
issues can only shown up during in-depth coding. I kind of feel handling
aux reference in verifier layer is the part that will still introduce some
un-clean code.

<snip>
>> If there is no dead insn elimination opt, then we could just adjust
>> offsets. When there is insn deleting, I feel the logic becomes more
>> complex. One subprog could be completely deleted or partially deleted, so
>> I feel just recalculate the whole subprog info as a side-product is
>> much simpler.
>
> What's the situation where entirety of subprog can be deleted?

Suppose you have conditional jmp_imm, true path calls one subprog, false
path calls the other. If insn walker later found it is also true, then the
subprog at false path won't be marked as "seen", so it is entirely deleted.

I actually thought it is in theory one subprog could be deleted entirely,
so if we support insn deletion inside verifier, then range info like
line_info/subprog_info needs to consider one range is deleted.

Thanks.
Regards,
Jiong

^ permalink raw reply

* Re: memory leak in new_inode_pseudo (2)
From: syzbot @ 2019-07-16  8:28 UTC (permalink / raw)
  To: davem, linux-kernel, netdev, syzkaller-bugs
In-Reply-To: <000000000000111cbe058dc7754d@google.com>

syzbot has found a reproducer for the following crash on:

HEAD commit:    be8454af Merge tag 'drm-next-2019-07-16' of git://anongit...
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=13d5f750600000
kernel config:  https://syzkaller.appspot.com/x/.config?x=d23a1a7bf85c5250
dashboard link: https://syzkaller.appspot.com/bug?extid=e682cca30bc101a4d9d9
compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=155c5800600000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=1738f800600000

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+e682cca30bc101a4d9d9@syzkaller.appspotmail.com

executing program
executing program
executing program
executing program
BUG: memory leak
unreferenced object 0xffff888128ea0980 (size 768):
   comm "syz-executor303", pid 7044, jiffies 4294943526 (age 13.490s)
   hex dump (first 32 bytes):
     01 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00  ................
     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
   backtrace:
     [<000000005ba542b8>] kmemleak_alloc_recursive  
include/linux/kmemleak.h:43 [inline]
     [<000000005ba542b8>] slab_post_alloc_hook mm/slab.h:522 [inline]
     [<000000005ba542b8>] slab_alloc mm/slab.c:3319 [inline]
     [<000000005ba542b8>] kmem_cache_alloc+0x13f/0x2c0 mm/slab.c:3483
     [<000000006532a1e9>] sock_alloc_inode+0x1c/0xa0 net/socket.c:238
     [<0000000014ddc967>] alloc_inode+0x2c/0xe0 fs/inode.c:227
     [<0000000056541455>] new_inode_pseudo+0x18/0x70 fs/inode.c:916
     [<000000003b5b5444>] sock_alloc+0x1c/0x90 net/socket.c:554
     [<00000000e623b353>] __sock_create+0x8f/0x250 net/socket.c:1378
     [<000000000e094708>] sock_create_kern+0x3b/0x50 net/socket.c:1483
     [<000000009fe4f64f>] smc_create+0xae/0x160 net/smc/af_smc.c:1975
     [<0000000056be84a7>] __sock_create+0x164/0x250 net/socket.c:1414
     [<000000005915e5fe>] sock_create net/socket.c:1465 [inline]
     [<000000005915e5fe>] __sys_socket+0x69/0x110 net/socket.c:1507
     [<00000000afa837b2>] __do_sys_socket net/socket.c:1516 [inline]
     [<00000000afa837b2>] __se_sys_socket net/socket.c:1514 [inline]
     [<00000000afa837b2>] __x64_sys_socket+0x1e/0x30 net/socket.c:1514
     [<00000000d0addad1>] do_syscall_64+0x76/0x1a0  
arch/x86/entry/common.c:296
     [<000000004e8e7c22>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

BUG: memory leak
unreferenced object 0xffff88811faeeab8 (size 56):
   comm "syz-executor303", pid 7044, jiffies 4294943526 (age 13.490s)
   hex dump (first 32 bytes):
     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
     00 0a ea 28 81 88 ff ff d0 ea ae 1f 81 88 ff ff  ...(............
   backtrace:
     [<000000005ba542b8>] kmemleak_alloc_recursive  
include/linux/kmemleak.h:43 [inline]
     [<000000005ba542b8>] slab_post_alloc_hook mm/slab.h:522 [inline]
     [<000000005ba542b8>] slab_alloc mm/slab.c:3319 [inline]
     [<000000005ba542b8>] kmem_cache_alloc+0x13f/0x2c0 mm/slab.c:3483
     [<000000008ca63096>] kmem_cache_zalloc include/linux/slab.h:738 [inline]
     [<000000008ca63096>] lsm_inode_alloc security/security.c:522 [inline]
     [<000000008ca63096>] security_inode_alloc+0x33/0xb0  
security/security.c:875
     [<00000000b335d930>] inode_init_always+0x108/0x200 fs/inode.c:169
     [<0000000015dcffb3>] alloc_inode+0x49/0xe0 fs/inode.c:234
     [<0000000056541455>] new_inode_pseudo+0x18/0x70 fs/inode.c:916
     [<000000003b5b5444>] sock_alloc+0x1c/0x90 net/socket.c:554
     [<00000000e623b353>] __sock_create+0x8f/0x250 net/socket.c:1378
     [<000000000e094708>] sock_create_kern+0x3b/0x50 net/socket.c:1483
     [<000000009fe4f64f>] smc_create+0xae/0x160 net/smc/af_smc.c:1975
     [<0000000056be84a7>] __sock_create+0x164/0x250 net/socket.c:1414
     [<000000005915e5fe>] sock_create net/socket.c:1465 [inline]
     [<000000005915e5fe>] __sys_socket+0x69/0x110 net/socket.c:1507
     [<00000000afa837b2>] __do_sys_socket net/socket.c:1516 [inline]
     [<00000000afa837b2>] __se_sys_socket net/socket.c:1514 [inline]
     [<00000000afa837b2>] __x64_sys_socket+0x1e/0x30 net/socket.c:1514
     [<00000000d0addad1>] do_syscall_64+0x76/0x1a0  
arch/x86/entry/common.c:296
     [<000000004e8e7c22>] entry_SYSCALL_64_after_hwframe+0x44/0xa9



^ permalink raw reply

* Re: [PATCH iproute2-rc 8/8] rdma: Document counter statistic
From: Gal Pressman @ 2019-07-16  8:19 UTC (permalink / raw)
  To: Leon Romanovsky, Stephen Hemminger
  Cc: Leon Romanovsky, netdev, David Ahern, Mark Zhang,
	RDMA mailing list
In-Reply-To: <20190710072455.9125-9-leon@kernel.org>

On 10/07/2019 10:24, Leon Romanovsky wrote:
> +.SH "EXAMPLES"
> +.PP
> +rdma statistic show
> +.RS 4
> +Shows the state of the default counter of all RDMA devices on the system.
> +.RE
> +.PP
> +rdma statistic show link mlx5_2/1
> +.RS 4
> +Shows the state of the default counter of specified RDMA port
> +.RE
> +.PP
> +rdma statistic qp show
> +.RS 4
> +Shows the state of all qp counters of all RDMA devices on the system.
> +.RE
> +.PP
> +rdma statistic qp show link mlx5_2/1
> +.RS 4
> +Shows the state of all qp counters of specified RDMA port.
> +.RE
> +.PP
> +rdma statistic qp show link mlx5_2 pid 30489
> +.RS 4
> +Shows the state of all qp counters of specified RDMA port and belonging to pid 30489
> +.RE
> +.PP
> +rdma statistic qp mode
> +.RS 4
> +List current counter mode on all deivces

"deivces" -> "devices".

^ permalink raw reply

* [PATCH net] be2net: Signal that the device cannot transmit during reconfiguration
From: Benjamin Poirier @ 2019-07-16  8:16 UTC (permalink / raw)
  To: David Miller
  Cc: Sathya Perla, Ajit Khaparde, Sriharsha Basavapatna, Somnath Kotur,
	Firo Yang, Saeed Mahameed, netdev

While changing the number of interrupt channels, be2net stops adapter
operation (including netif_tx_disable()) but it doesn't signal that it
cannot transmit. This may lead dev_watchdog() to falsely trigger during
that time.

Add the missing call to netif_carrier_off(), following the pattern used in
many other drivers. netif_carrier_on() is already taken care of in
be_open().

Signed-off-by: Benjamin Poirier <bpoirier@suse.com>
---
 drivers/net/ethernet/emulex/benet/be_main.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/emulex/benet/be_main.c b/drivers/net/ethernet/emulex/benet/be_main.c
index 82015c8a5ed7..b7a246b33599 100644
--- a/drivers/net/ethernet/emulex/benet/be_main.c
+++ b/drivers/net/ethernet/emulex/benet/be_main.c
@@ -4697,8 +4697,12 @@ int be_update_queues(struct be_adapter *adapter)
 	struct net_device *netdev = adapter->netdev;
 	int status;
 
-	if (netif_running(netdev))
+	if (netif_running(netdev)) {
+		/* device cannot transmit now, avoid dev_watchdog timeouts */
+		netif_carrier_off(netdev);
+
 		be_close(netdev);
+	}
 
 	be_cancel_worker(adapter);
 
-- 
2.22.0


^ permalink raw reply related

* Re: [oss-drivers] Re: [RFC bpf-next 2/8] bpf: extend list based insn patching infra to verification layer
From: Jiong Wang @ 2019-07-16  8:12 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Jiong Wang, Andrii Nakryiko, Alexei Starovoitov, Daniel Borkmann,
	Edward Cree, Naveen N. Rao, Jakub Kicinski, bpf, Networking,
	oss-drivers
In-Reply-To: <CAEf4BzYzSuwVL9W+LRbGJXcv8AszxLJ0EBTH-FXxTzcW6CCU7Q@mail.gmail.com>


Andrii Nakryiko writes:

> On Mon, Jul 15, 2019 at 3:02 AM Jiong Wang <jiong.wang@netronome.com> wrote:
>>
>>
>> Andrii Nakryiko writes:
>>
>> > On Thu, Jul 11, 2019 at 5:20 AM Jiong Wang <jiong.wang@netronome.com> wrote:
>> >>
>> >>
>> >> Jiong Wang writes:
>> >>
>> >> > Andrii Nakryiko writes:
>> >> >
>> >> >> On Thu, Jul 4, 2019 at 2:32 PM Jiong Wang <jiong.wang@netronome.com> wrote:
>> >> >>>
>> >> >>> Verification layer also needs to handle auxiliar info as well as adjusting
>> >> >>> subprog start.
>> >> >>>
>> >> >>> At this layer, insns inside patch buffer could be jump, but they should
>> >> >>> have been resolved, meaning they shouldn't jump to insn outside of the
>> >> >>> patch buffer. Lineration function for this layer won't touch insns inside
>> >> >>> patch buffer.
>> >> >>>
>> >> >>> Adjusting subprog is finished along with adjusting jump target when the
>> >> >>> input will cover bpf to bpf call insn, re-register subprog start is cheap.
>> >> >>> But adjustment when there is insn deleteion is not considered yet.
>> >> >>>
>> >> >>> Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
>> >> >>> ---
>> >> >>>  kernel/bpf/verifier.c | 150 ++++++++++++++++++++++++++++++++++++++++++++++++++
>> >> >>>  1 file changed, 150 insertions(+)
>> >> >>>
>> >> >>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> >> >>> index a2e7637..2026d64 100644
>> >> >>> --- a/kernel/bpf/verifier.c
>> >> >>> +++ b/kernel/bpf/verifier.c
>> >> >>> @@ -8350,6 +8350,156 @@ static void opt_hard_wire_dead_code_branches(struct bpf_verifier_env *env)
>> >> >>>         }
>> >> >>>  }
>> >> >>>
>> >> >>> +/* Linearize bpf list insn to array (verifier layer). */
>> >> >>> +static struct bpf_verifier_env *
>> >> >>> +verifier_linearize_list_insn(struct bpf_verifier_env *env,
>> >> >>> +                            struct bpf_list_insn *list)
>> >> >>
>> >> >> It's unclear why this returns env back? It's not allocating a new env,
>> >> >> so it's weird and unnecessary. Just return error code.
>> >> >
>> >> > The reason is I was thinking we have two layers in BPF, the core and the
>> >> > verifier.
>> >> >
>> >> > For core layer (the relevant file is core.c), when doing patching, the
>> >> > input is insn list and bpf_prog, the linearization should linearize the
>> >> > insn list into insn array, and also whatever others affect inside bpf_prog
>> >> > due to changing on insns, for example line info inside prog->aux. So the
>> >> > return value is bpf_prog for core layer linearization hook.
>> >> >
>> >> > For verifier layer, it is similar, but the context if bpf_verifier_env, the
>> >> > linearization hook should linearize the insn list, and also those affected
>> >> > inside env, for example bpf_insn_aux_data, so the return value is
>> >> > bpf_verifier_env, meaning returning an updated verifier context
>> >> > (bpf_verifier_env) after insn list linearization.
>> >>
>> >> Realized your point is no new env is allocated, so just return error
>> >> code. Yes, the env pointer is not changed, just internal data is
>> >> updated. Return bpf_verifier_env mostly is trying to make the hook more
>> >> clear that it returns an updated "context" where the linearization happens,
>> >> for verifier layer, it is bpf_verifier_env, and for core layer, it is
>> >> bpf_prog, so return value was designed to return these two types.
>> >
>> > Oh, I missed that core layer returns bpf_prog*. I think this is
>> > confusing as hell and is very contrary to what one would expect. If
>> > the function doesn't allocate those objects, it shouldn't return them,
>> > except for rare cases of some accessor functions. Me reading this,
>> > I'll always be suprised and will have to go skim code just to check
>> > whether those functions really return new bpf_prog or
>> > bpf_verifier_env, respectively.
>>
>> bpf_prog_realloc do return new bpf_prog, so we will need to return bpf_prog
>> * for core layer.
>
> Ah, I see, then it would make sense for core layer, but still is very
> confusing for verifier_linearize_list_insn.
> I still hope for unified solution, so it shouldn't matter. But it
> pointed me to a bug in your code, see below.

Yeah, thanks!

>
>>
>> >
>> > Please change them both to just return error code.
>> >
>> >>
>> >> >
>> >> > Make sense?
>> >> >
>> >> > Regards,
>> >> > Jiong
>> >> >
>> >> >>
>> >> >>> +{
>> >> >>> +       u32 *idx_map, idx, orig_cnt, fini_cnt = 0;
>> >> >>> +       struct bpf_subprog_info *new_subinfo;
>> >> >>> +       struct bpf_insn_aux_data *new_data;
>> >> >>> +       struct bpf_prog *prog = env->prog;
>> >> >>> +       struct bpf_verifier_env *ret_env;
>> >> >>> +       struct bpf_insn *insns, *insn;
>> >> >>> +       struct bpf_list_insn *elem;
>> >> >>> +       int ret;
>> >> >>> +
>> >> >>> +       /* Calculate final size. */
>> >> >>> +       for (elem = list; elem; elem = elem->next)
>> >> >>> +               if (!(elem->flag & LIST_INSN_FLAG_REMOVED))
>> >> >>> +                       fini_cnt++;
>> >> >>> +
>> >> >>> +       orig_cnt = prog->len;
>> >> >>> +       insns = prog->insnsi;
>> >> >>> +       /* If prog length remains same, nothing else to do. */
>> >> >>> +       if (fini_cnt == orig_cnt) {
>> >> >>> +               for (insn = insns, elem = list; elem; elem = elem->next, insn++)
>> >> >>> +                       *insn = elem->insn;
>> >> >>> +               return env;
>> >> >>> +       }
>> >> >>> +       /* Realloc insn buffer when necessary. */
>> >> >>> +       if (fini_cnt > orig_cnt)
>> >> >>> +               prog = bpf_prog_realloc(prog, bpf_prog_size(fini_cnt),
>> >> >>> +                                       GFP_USER);
>> >> >>> +       if (!prog)
>> >> >>> +               return ERR_PTR(-ENOMEM);
>
> On realloc failure, prog will be non-NULL, so you need to handle error
> properly (and propagate it, instead of returning -ENOMEM):
>
> if (IS_ERR(prog))
>     return ERR_PTR(prog);
>
>
>> >> >>> +       insns = prog->insnsi;
>> >> >>> +       prog->len = fini_cnt;
>> >> >>> +       ret_env = env;
>> >> >>> +
>> >> >>> +       /* idx_map[OLD_IDX] = NEW_IDX */
>> >> >>> +       idx_map = kvmalloc(orig_cnt * sizeof(u32), GFP_KERNEL);
>> >> >>> +       if (!idx_map)
>> >> >>> +               return ERR_PTR(-ENOMEM);
>> >> >>> +       memset(idx_map, 0xff, orig_cnt * sizeof(u32));
>> >> >>> +
>> >> >>> +       /* Use the same alloc method used when allocating env->insn_aux_data. */
>> >> >>> +       new_data = vzalloc(array_size(sizeof(*new_data), fini_cnt));
>> >> >>> +       if (!new_data) {
>> >> >>> +               kvfree(idx_map);
>> >> >>> +               return ERR_PTR(-ENOMEM);
>> >> >>> +       }
>> >> >>> +
>> >> >>> +       /* Copy over insn + calculate idx_map. */
>> >> >>> +       for (idx = 0, elem = list; elem; elem = elem->next) {
>> >> >>> +               int orig_idx = elem->orig_idx - 1;
>> >> >>> +
>> >> >>> +               if (orig_idx >= 0) {
>> >> >>> +                       idx_map[orig_idx] = idx;
>> >> >>> +
>> >> >>> +                       if (elem->flag & LIST_INSN_FLAG_REMOVED)
>> >> >>> +                               continue;
>> >> >>> +
>> >> >>> +                       new_data[idx] = env->insn_aux_data[orig_idx];
>> >> >>> +
>> >> >>> +                       if (elem->flag & LIST_INSN_FLAG_PATCHED)
>> >> >>> +                               new_data[idx].zext_dst =
>> >> >>> +                                       insn_has_def32(env, &elem->insn);
>> >> >>> +               } else {
>> >> >>> +                       new_data[idx].seen = true;
>> >> >>> +                       new_data[idx].zext_dst = insn_has_def32(env,
>> >> >>> +                                                               &elem->insn);
>> >> >>> +               }
>> >> >>> +               insns[idx++] = elem->insn;
>> >> >>> +       }
>> >> >>> +
>> >> >>> +       new_subinfo = kvzalloc(sizeof(env->subprog_info), GFP_KERNEL);
>> >> >>> +       if (!new_subinfo) {
>> >> >>> +               kvfree(idx_map);
>> >> >>> +               vfree(new_data);
>> >> >>> +               return ERR_PTR(-ENOMEM);
>> >> >>> +       }
>> >> >>> +       memcpy(new_subinfo, env->subprog_info, sizeof(env->subprog_info));
>> >> >>> +       memset(env->subprog_info, 0, sizeof(env->subprog_info));
>> >> >>> +       env->subprog_cnt = 0;
>> >> >>> +       env->prog = prog;
>> >> >>> +       ret = add_subprog(env, 0);
>> >> >>> +       if (ret < 0) {
>> >> >>> +               ret_env = ERR_PTR(ret);
>> >> >>> +               goto free_all_ret;
>> >> >>> +       }
>> >> >>> +       /* Relocate jumps using idx_map.
>> >> >>> +        *   old_dst = jmp_insn.old_target + old_pc + 1;
>> >> >>> +        *   new_dst = idx_map[old_dst] = jmp_insn.new_target + new_pc + 1;
>> >> >>> +        *   jmp_insn.new_target = new_dst - new_pc - 1;
>> >> >>> +        */
>> >> >>> +       for (idx = 0, elem = list; elem; elem = elem->next) {
>> >> >>> +               int orig_idx = elem->orig_idx;
>> >> >>> +
>> >> >>> +               if (elem->flag & LIST_INSN_FLAG_REMOVED)
>> >> >>> +                       continue;
>> >> >>> +               if ((elem->flag & LIST_INSN_FLAG_PATCHED) || !orig_idx) {
>> >> >>> +                       idx++;
>> >> >>> +                       continue;
>> >> >>> +               }
>> >> >>> +
>> >> >>> +               ret = bpf_jit_adj_imm_off(&insns[idx], orig_idx - 1, idx,
>> >> >>> +                                         idx_map);
>> >> >>> +               if (ret < 0) {
>> >> >>> +                       ret_env = ERR_PTR(ret);
>> >> >>> +                       goto free_all_ret;
>> >> >>> +               }
>> >> >>> +               /* Recalculate subprog start as we are at bpf2bpf call insn. */
>> >> >>> +               if (ret > 0) {
>> >> >>> +                       ret = add_subprog(env, idx + insns[idx].imm + 1);
>> >> >>> +                       if (ret < 0) {
>> >> >>> +                               ret_env = ERR_PTR(ret);
>> >> >>> +                               goto free_all_ret;
>> >> >>> +                       }
>> >> >>> +               }
>> >> >>> +               idx++;
>> >> >>> +       }
>> >> >>> +       if (ret < 0) {
>> >> >>> +               ret_env = ERR_PTR(ret);
>> >> >>> +               goto free_all_ret;
>> >> >>> +       }
>> >> >>> +
>> >> >>> +       env->subprog_info[env->subprog_cnt].start = fini_cnt;
>> >> >>> +       for (idx = 0; idx <= env->subprog_cnt; idx++)
>> >> >>> +               new_subinfo[idx].start = env->subprog_info[idx].start;
>> >> >>> +       memcpy(env->subprog_info, new_subinfo, sizeof(env->subprog_info));
>> >> >>> +
>> >> >>> +       /* Adjust linfo.
>> >> >>> +        * FIXME: no support for insn removal at the moment.
>> >> >>> +        */
>> >> >>> +       if (prog->aux->nr_linfo) {
>> >> >>> +               struct bpf_line_info *linfo = prog->aux->linfo;
>> >> >>> +               u32 nr_linfo = prog->aux->nr_linfo;
>> >> >>> +
>> >> >>> +               for (idx = 0; idx < nr_linfo; idx++)
>> >> >>> +                       linfo[idx].insn_off = idx_map[linfo[idx].insn_off];
>> >> >>> +       }
>> >> >>> +       vfree(env->insn_aux_data);
>> >> >>> +       env->insn_aux_data = new_data;
>> >> >>> +       goto free_mem_list_ret;
>> >> >>> +free_all_ret:
>> >> >>> +       vfree(new_data);
>> >> >>> +free_mem_list_ret:
>> >> >>> +       kvfree(new_subinfo);
>> >> >>> +       kvfree(idx_map);
>> >> >>> +       return ret_env;
>> >> >>> +}
>> >> >>> +
>> >> >>>  static int opt_remove_dead_code(struct bpf_verifier_env *env)
>> >> >>>  {
>> >> >>>         struct bpf_insn_aux_data *aux_data = env->insn_aux_data;
>> >> >>> --
>> >> >>> 2.7.4
>> >> >>>
>> >>
>>


^ permalink raw reply

* Re: [bpf-next RFC 3/6] bpf: add bpf_tcp_gen_syncookie helper
From: Eric Dumazet @ 2019-07-16  7:59 UTC (permalink / raw)
  To: Petar Penkov, netdev, bpf
  Cc: davem, ast, daniel, edumazet, lmb, sdf, Petar Penkov
In-Reply-To: <20190716002650.154729-4-ppenkov.kernel@gmail.com>



On 7/16/19 2:26 AM, Petar Penkov wrote:
> From: Petar Penkov <ppenkov@google.com>
> 
> This helper function allows BPF programs to try to generate SYN
> cookies, given a reference to a listener socket. The function works
> from XDP and with an skb context since bpf_skc_lookup_tcp can lookup a
> socket in both cases.
> 
...
>  
> +BPF_CALL_5(bpf_tcp_gen_syncookie, struct sock *, sk, void *, iph, u32, iph_len,
> +	   struct tcphdr *, th, u32, th_len)
> +{
> +#ifdef CONFIG_SYN_COOKIES
> +	u32 cookie;
> +	u16 mss;
> +
> +	if (unlikely(th_len < sizeof(*th)))


You probably need to check that th_len == th->doff * 4

> +		return -EINVAL;
> +
> +	if (sk->sk_protocol != IPPROTO_TCP || sk->sk_state != TCP_LISTEN)
> +		return -EINVAL;
> +
> +	if (!sock_net(sk)->ipv4.sysctl_tcp_syncookies)
> +		return -EINVAL;
> +
> +	if (!th->syn || th->ack || th->fin || th->rst)
> +		return -EINVAL;
> +
> +	switch (sk->sk_family) {

This is strange, because a dual stack listener will have sk->sk_family set to AF_INET6.

What really matters here is if the packet is IPv4 or IPv6.

So you need to look at iph->version instead.

Then look if the socket family allows this packet to be processed
(For example AF_INET6 sockets might prevent IPv4 packets, see sk->sk_ipv6only )

> +	case AF_INET:
> +		if (unlikely(iph_len < sizeof(struct iphdr)))
> +			return -EINVAL;
> +		mss = tcp_v4_get_syncookie(sk, iph, th, &cookie);
> +		break;
> +
> +#if IS_BUILTIN(CONFIG_IPV6)
> +	case AF_INET6:
> +		if (unlikely(iph_len < sizeof(struct ipv6hdr)))
> +			return -EINVAL;
> +		mss = tcp_v6_get_syncookie(sk, iph, th, &cookie);
> +		break;
> +#endif /* CONFIG_IPV6 */
> +
> +	default:
> +		return -EPROTONOSUPPORT;
> +	}
> +	if (mss <= 0)
> +		return -ENOENT;
> +



^ permalink raw reply

* Re: [RFC PATCH 1/5] x86: tsc: add tsc to art helpers
From: Thomas Gleixner @ 2019-07-16  7:57 UTC (permalink / raw)
  To: Felipe Balbi
  Cc: Richard Cochran, netdev, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, x86, linux-kernel, Christopher S . Hall
In-Reply-To: <20190716072038.8408-2-felipe.balbi@linux.intel.com>

Felipe,

On Tue, 16 Jul 2019, Felipe Balbi wrote:

-ENOCHANGELOG

As you said in the cover letter:

>  (3) The change in arch/x86/kernel/tsc.c needs to be reviewed at length
>      before going in.

So some information what those interfaces are used for and why they are
needed would be really helpful.

> +void get_tsc_ns(struct system_counterval_t *tsc_counterval, u64 *tsc_ns)
> +{
> +	u64 tmp, res, rem;
> +	u64 cycles;
> +
> +	tsc_counterval->cycles = clocksource_tsc.read(NULL);
> +	cycles = tsc_counterval->cycles;
> +	tsc_counterval->cs = art_related_clocksource;
> +
> +	rem = do_div(cycles, tsc_khz);
> +
> +	res = cycles * USEC_PER_SEC;
> +	tmp = rem * USEC_PER_SEC;
> +
> +	do_div(tmp, tsc_khz);
> +	res += tmp;
> +
> +	*tsc_ns = res;
> +}
> +EXPORT_SYMBOL(get_tsc_ns);
> +
> +u64 get_art_ns_now(void)
> +{
> +	struct system_counterval_t tsc_cycles;
> +	u64 tsc_ns;
> +
> +	get_tsc_ns(&tsc_cycles, &tsc_ns);
> +
> +	return tsc_ns;
> +}
> +EXPORT_SYMBOL(get_art_ns_now);

While the changes look innocuous I'm missing the big picture why this needs
to emulate ART instead of simply using TSC directly.

Thanks,

	tglx

^ permalink raw reply

* memory leak in new_inode_pseudo (2)
From: syzbot @ 2019-07-16  7:38 UTC (permalink / raw)
  To: davem, linux-kernel, netdev, syzkaller-bugs

Hello,

syzbot found the following crash on:

HEAD commit:    fec88ab0 Merge tag 'for-linus-hmm' of git://git.kernel.org..
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=15a3da1fa00000
kernel config:  https://syzkaller.appspot.com/x/.config?x=8422fa55ce69212c
dashboard link: https://syzkaller.appspot.com/bug?extid=e682cca30bc101a4d9d9
compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=16ca5aa4600000

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+e682cca30bc101a4d9d9@syzkaller.appspotmail.com

BUG: memory leak
unreferenced object 0xffff8881223e5980 (size 768):
   comm "syz-executor.0", pid 7093, jiffies 4294950175 (age 8.140s)
   hex dump (first 32 bytes):
     01 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00  ................
     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
   backtrace:
     [<0000000030f6ab07>] kmemleak_alloc_recursive  
include/linux/kmemleak.h:43 [inline]
     [<0000000030f6ab07>] slab_post_alloc_hook mm/slab.h:522 [inline]
     [<0000000030f6ab07>] slab_alloc mm/slab.c:3319 [inline]
     [<0000000030f6ab07>] kmem_cache_alloc+0x13f/0x2c0 mm/slab.c:3483
     [<0000000005b17a67>] sock_alloc_inode+0x1c/0xa0 net/socket.c:238
     [<00000000cae2a9b4>] alloc_inode+0x2c/0xe0 fs/inode.c:227
     [<000000004d22e56a>] new_inode_pseudo+0x18/0x70 fs/inode.c:916
     [<000000007bb4d82d>] sock_alloc+0x1c/0x90 net/socket.c:554
     [<00000000884dfd41>] __sock_create+0x8f/0x250 net/socket.c:1378
     [<000000009dc85063>] sock_create_kern+0x3b/0x50 net/socket.c:1483
     [<00000000ca0afb1d>] smc_create+0xae/0x160 net/smc/af_smc.c:1975
     [<00000000ff903d89>] __sock_create+0x164/0x250 net/socket.c:1414
     [<00000000c0787cdf>] sock_create net/socket.c:1465 [inline]
     [<00000000c0787cdf>] __sys_socket+0x69/0x110 net/socket.c:1507
     [<0000000067a4ade6>] __do_sys_socket net/socket.c:1516 [inline]
     [<0000000067a4ade6>] __se_sys_socket net/socket.c:1514 [inline]
     [<0000000067a4ade6>] __x64_sys_socket+0x1e/0x30 net/socket.c:1514
     [<000000001e7b04ac>] do_syscall_64+0x76/0x1a0  
arch/x86/entry/common.c:296
     [<000000003fe40e36>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

BUG: memory leak
unreferenced object 0xffff88811f269f50 (size 56):
   comm "syz-executor.0", pid 7093, jiffies 4294950175 (age 8.140s)
   hex dump (first 32 bytes):
     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
     00 5a 3e 22 81 88 ff ff 68 9f 26 1f 81 88 ff ff  .Z>"....h.&.....
   backtrace:
     [<0000000030f6ab07>] kmemleak_alloc_recursive  
include/linux/kmemleak.h:43 [inline]
     [<0000000030f6ab07>] slab_post_alloc_hook mm/slab.h:522 [inline]
     [<0000000030f6ab07>] slab_alloc mm/slab.c:3319 [inline]
     [<0000000030f6ab07>] kmem_cache_alloc+0x13f/0x2c0 mm/slab.c:3483
     [<000000005d4d6be7>] kmem_cache_zalloc include/linux/slab.h:738 [inline]
     [<000000005d4d6be7>] lsm_inode_alloc security/security.c:522 [inline]
     [<000000005d4d6be7>] security_inode_alloc+0x33/0xb0  
security/security.c:875
     [<00000000ef89212c>] inode_init_always+0x108/0x200 fs/inode.c:169
     [<00000000647feaf5>] alloc_inode+0x49/0xe0 fs/inode.c:234
     [<000000004d22e56a>] new_inode_pseudo+0x18/0x70 fs/inode.c:916
     [<000000007bb4d82d>] sock_alloc+0x1c/0x90 net/socket.c:554
     [<00000000884dfd41>] __sock_create+0x8f/0x250 net/socket.c:1378
     [<000000009dc85063>] sock_create_kern+0x3b/0x50 net/socket.c:1483
     [<00000000ca0afb1d>] smc_create+0xae/0x160 net/smc/af_smc.c:1975
     [<00000000ff903d89>] __sock_create+0x164/0x250 net/socket.c:1414
     [<00000000c0787cdf>] sock_create net/socket.c:1465 [inline]
     [<00000000c0787cdf>] __sys_socket+0x69/0x110 net/socket.c:1507
     [<0000000067a4ade6>] __do_sys_socket net/socket.c:1516 [inline]
     [<0000000067a4ade6>] __se_sys_socket net/socket.c:1514 [inline]
     [<0000000067a4ade6>] __x64_sys_socket+0x1e/0x30 net/socket.c:1514
     [<000000001e7b04ac>] do_syscall_64+0x76/0x1a0  
arch/x86/entry/common.c:296
     [<000000003fe40e36>] entry_SYSCALL_64_after_hwframe+0x44/0xa9



---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
syzbot can test patches for this bug, for details see:
https://goo.gl/tpsmEJ#testing-patches

^ permalink raw reply

* [RFC PATCH 3/5] PTP: implement PTP_EVENT_COUNT_TSTAMP ioctl
From: Felipe Balbi @ 2019-07-16  7:20 UTC (permalink / raw)
  To: Richard Cochran
  Cc: netdev, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, x86, linux-kernel, Christopher S . Hall,
	Felipe Balbi
In-Reply-To: <20190716072038.8408-1-felipe.balbi@linux.intel.com>

With this, we can request the underlying driver to count the number of
events that have been captured.

Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com>
---
 drivers/ptp/ptp_chardev.c      | 15 +++++++++++++++
 include/uapi/linux/ptp_clock.h |  2 ++
 2 files changed, 17 insertions(+)

diff --git a/drivers/ptp/ptp_chardev.c b/drivers/ptp/ptp_chardev.c
index 18ffe449efdf..a3e163a6acdc 100644
--- a/drivers/ptp/ptp_chardev.c
+++ b/drivers/ptp/ptp_chardev.c
@@ -114,6 +114,7 @@ long ptp_ioctl(struct posix_clock *pc, unsigned int cmd, unsigned long arg)
 	struct system_device_crosststamp xtstamp;
 	struct ptp_clock_info *ops = ptp->info;
 	struct ptp_sys_offset *sysoff = NULL;
+	struct ptp_event_count_tstamp counttstamp;
 	struct ptp_system_timestamp sts;
 	struct ptp_clock_request req;
 	struct ptp_clock_caps caps;
@@ -301,6 +302,20 @@ long ptp_ioctl(struct posix_clock *pc, unsigned int cmd, unsigned long arg)
 		mutex_unlock(&ptp->pincfg_mux);
 		break;
 
+	case PTP_EVENT_COUNT_TSTAMP:
+		if (!ops->counttstamp)
+			return -ENOTSUPP;
+		if (copy_from_user(&req.perout, (void __user *)arg,
+				   sizeof(counttstamp))) {
+			err = -EFAULT;
+			break;
+		}
+		err = ops->counttstamp(ops, &counttstamp);
+		if (!err && copy_to_user((void __user *)arg, &counttstamp,
+						sizeof(counttstamp)))
+			err = -EFAULT;
+		break;
+
 	default:
 		err = -ENOTTY;
 		break;
diff --git a/include/uapi/linux/ptp_clock.h b/include/uapi/linux/ptp_clock.h
index 1bc794ad957a..674db7de64f3 100644
--- a/include/uapi/linux/ptp_clock.h
+++ b/include/uapi/linux/ptp_clock.h
@@ -148,6 +148,8 @@ struct ptp_pin_desc {
 	_IOWR(PTP_CLK_MAGIC, 8, struct ptp_sys_offset_precise)
 #define PTP_SYS_OFFSET_EXTENDED \
 	_IOWR(PTP_CLK_MAGIC, 9, struct ptp_sys_offset_extended)
+#define PTP_EVENT_COUNT_TSTAMP \
+	_IOWR(PTP_CLK_MAGIC, 6, struct ptp_event_count_tstamp)
 
 struct ptp_extts_event {
 	struct ptp_clock_time t; /* Time event occured. */
-- 
2.22.0


^ permalink raw reply related

* [RFC PATCH 4/5] PTP: Add flag for non-periodic output
From: Felipe Balbi @ 2019-07-16  7:20 UTC (permalink / raw)
  To: Richard Cochran
  Cc: netdev, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, x86, linux-kernel, Christopher S . Hall,
	Felipe Balbi
In-Reply-To: <20190716072038.8408-1-felipe.balbi@linux.intel.com>

When this new flag is set, we can use single-shot output.

Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com>
---
 include/uapi/linux/ptp_clock.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/ptp_clock.h b/include/uapi/linux/ptp_clock.h
index 674db7de64f3..439cbdfc3d9b 100644
--- a/include/uapi/linux/ptp_clock.h
+++ b/include/uapi/linux/ptp_clock.h
@@ -67,7 +67,9 @@ struct ptp_perout_request {
 	struct ptp_clock_time start;  /* Absolute start time. */
 	struct ptp_clock_time period; /* Desired period, zero means disable. */
 	unsigned int index;           /* Which channel to configure. */
-	unsigned int flags;           /* Reserved for future use. */
+
+#define PTP_PEROUT_ONE_SHOT BIT(0)
+	unsigned int flags;           /* Bit 0 -> oneshot output. */
 	unsigned int rsv[4];          /* Reserved for future use. */
 };
 
-- 
2.22.0


^ permalink raw reply related

* [RFC PATCH 5/5] PTP: Add support for Intel PMC Timed GPIO Controller
From: Felipe Balbi @ 2019-07-16  7:20 UTC (permalink / raw)
  To: Richard Cochran
  Cc: netdev, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, x86, linux-kernel, Christopher S . Hall,
	Felipe Balbi
In-Reply-To: <20190716072038.8408-1-felipe.balbi@linux.intel.com>

Add a driver supporting Intel Timed GPIO controller available as part
of some Intel PMCs.

Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com>
---
 drivers/ptp/Kconfig               |   8 +
 drivers/ptp/Makefile              |   1 +
 drivers/ptp/ptp-intel-pmc-tgpio.c | 378 ++++++++++++++++++++++++++++++
 3 files changed, 387 insertions(+)
 create mode 100644 drivers/ptp/ptp-intel-pmc-tgpio.c

diff --git a/drivers/ptp/Kconfig b/drivers/ptp/Kconfig
index 9b8fee5178e8..bb0fce70a783 100644
--- a/drivers/ptp/Kconfig
+++ b/drivers/ptp/Kconfig
@@ -107,6 +107,14 @@ config PTP_1588_CLOCK_PCH
 	  To compile this driver as a module, choose M here: the module
 	  will be called ptp_pch.
 
+config PTP_INTEL_PMC_TGPIO
+	tristate "Intel PMC Timed GPIO"
+	depends on X86
+	depends on ACPI
+	imply PTP_1588_CLOCK
+	help
+	  This driver adds support for Intel PMC Timed GPIO Controller
+
 config PTP_1588_CLOCK_KVM
 	tristate "KVM virtual PTP clock"
 	depends on PTP_1588_CLOCK
diff --git a/drivers/ptp/Makefile b/drivers/ptp/Makefile
index 677d1d178a3e..ff89c90ace82 100644
--- a/drivers/ptp/Makefile
+++ b/drivers/ptp/Makefile
@@ -7,6 +7,7 @@ ptp-y					:= ptp_clock.o ptp_chardev.o ptp_sysfs.o
 obj-$(CONFIG_PTP_1588_CLOCK)		+= ptp.o
 obj-$(CONFIG_PTP_1588_CLOCK_DTE)	+= ptp_dte.o
 obj-$(CONFIG_PTP_1588_CLOCK_IXP46X)	+= ptp_ixp46x.o
+obj-$(CONFIG_PTP_INTEL_PMC_TGPIO)	+= ptp-intel-pmc-tgpio.o
 obj-$(CONFIG_PTP_1588_CLOCK_PCH)	+= ptp_pch.o
 obj-$(CONFIG_PTP_1588_CLOCK_KVM)	+= ptp_kvm.o
 obj-$(CONFIG_PTP_1588_CLOCK_QORIQ)	+= ptp-qoriq.o
diff --git a/drivers/ptp/ptp-intel-pmc-tgpio.c b/drivers/ptp/ptp-intel-pmc-tgpio.c
new file mode 100644
index 000000000000..880ece34868a
--- /dev/null
+++ b/drivers/ptp/ptp-intel-pmc-tgpio.c
@@ -0,0 +1,378 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Intel Timed GPIO Controller Driver
+ *
+ * Copyright (C) 2018 Intel Corporation
+ * Author: Felipe Balbi <felipe.balbi@linux.intel.com>
+ */
+
+#include <linux/acpi.h>
+#include <linux/bitops.h>
+#include <linux/gpio.h>
+#include <linux/io-64-nonatomic-lo-hi.h>
+#include <linux/kthread.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/platform_device.h>
+#include <linux/ptp_clock_kernel.h>
+#include <asm/tsc.h>
+
+#define TGPIOCTL		0x00
+#define TGPIOCOMPV31_0		0x10
+#define TGPIOCOMPV63_32		0x14
+#define TGPIOPIV31_0		0x18
+#define TGPIOPIV63_32		0x1c
+#define TGPIOTCV31_0		0x20
+#define TGPIOTCV63_32		0x24
+#define TGPIOECCV31_0		0x28
+#define TGPIOECCV63_32		0x2c
+#define TGPIOEC31_0		0x30
+#define TGPIOEC63_32		0x34
+
+/* Control Register */
+#define TGPIOCTL_EN		BIT(0)
+#define TGPIOCTL_DIR		BIT(1)
+#define TGPIOCTL_EP		GENMASK(3, 2)
+#define TGPIOCTL_EP_RISING_EDGE	(0 << 2)
+#define TGPIOCTL_EP_FALLING_EDGE (1 << 2)
+#define TGPIOCTL_EP_TOGGLE_EDGE	(2 << 2)
+#define TGPIOCTL_PM		BIT(4)
+
+#define NSECS_PER_SEC		1000000000
+#define TGPIO_MAX_ADJ_TIME	999999900
+
+struct intel_pmc_tgpio {
+	struct ptp_clock_info	info;
+	struct ptp_clock	*clock;
+
+	struct mutex		lock;
+	struct device		*dev;
+	void __iomem		*base;
+
+	struct task_struct	*event_thread;
+	bool			input;
+};
+#define to_intel_pmc_tgpio(i)	(container_of((i), struct intel_pmc_tgpio, info))
+
+static inline u64 to_intel_pmc_tgpio_time(struct ptp_clock_time *t)
+{
+	return t->sec * NSECS_PER_SEC + t->nsec;
+}
+
+static inline u64 intel_pmc_tgpio_readq(void __iomem *base, u32 offset)
+{
+	return lo_hi_readq(base + offset);
+}
+
+static inline void intel_pmc_tgpio_writeq(void __iomem *base, u32 offset, u64 v)
+{
+	return lo_hi_writeq(v, base + offset);
+}
+
+static inline u32 intel_pmc_tgpio_readl(void __iomem *base, u32 offset)
+{
+	return readl(base + offset);
+}
+
+static inline void intel_pmc_tgpio_writel(void __iomem *base, u32 offset, u32 value)
+{
+	writel(value, base + offset);
+}
+
+static struct ptp_pin_desc intel_pmc_tgpio_pin_config[] = {
+	{					\
+		.name	= "pin0",		\
+		.index	= 0,			\
+		.func	= PTP_PF_NONE,		\
+		.chan	= 0,			\
+	}
+};
+
+static int intel_pmc_tgpio_gettime64(struct ptp_clock_info *info,
+		struct timespec64 *ts)
+{
+	struct intel_pmc_tgpio	*tgpio = to_intel_pmc_tgpio(info);
+	u64 now;
+
+	mutex_lock(&tgpio->lock);
+	now = get_art_ns_now();
+	*ts = ns_to_timespec64(now);
+	mutex_unlock(&tgpio->lock);
+
+	return 0;
+}
+
+static int intel_pmc_tgpio_settime64(struct ptp_clock_info *info,
+		const struct timespec64 *ts)
+{
+	return -EOPNOTSUPP;
+}
+
+static int intel_pmc_tgpio_event_thread(void *_tgpio)
+{
+	struct intel_pmc_tgpio	*tgpio = _tgpio;
+	u64 reg;
+
+	while (!kthread_should_stop()) {
+		bool input;
+		int i;
+
+		mutex_lock(&tgpio->lock);
+		input = tgpio->input;
+		mutex_unlock(&tgpio->lock);
+
+		if (!input)
+			schedule();
+
+		reg = intel_pmc_tgpio_readq(tgpio->base, TGPIOEC31_0);
+
+		for (i = 0; i < reg; i++) {
+			struct ptp_clock_event event;
+
+			event.type = PTP_CLOCK_EXTTS;
+			event.index = 0;
+			event.timestamp = intel_pmc_tgpio_readq(tgpio->base,
+					TGPIOTCV31_0);
+
+			ptp_clock_event(tgpio->clock, &event);
+		}
+		schedule_timeout_interruptible(10);
+	}
+
+	return 0;
+}
+
+static int intel_pmc_tgpio_config_input(struct intel_pmc_tgpio *tgpio,
+		struct ptp_extts_request *extts, int on)
+{
+	u32			ctrl;
+	bool			input;
+
+	ctrl = intel_pmc_tgpio_readl(tgpio->base, TGPIOCTL);
+	ctrl &= ~TGPIOCTL_EN;
+	intel_pmc_tgpio_writel(tgpio->base, TGPIOCTL, ctrl);
+
+	if (on) {
+		ctrl |= TGPIOCTL_DIR;
+
+		if (extts->flags & PTP_RISING_EDGE &&
+				extts->flags & PTP_FALLING_EDGE)
+			ctrl |= TGPIOCTL_EP_TOGGLE_EDGE;
+		else if (extts->flags & PTP_RISING_EDGE)
+			ctrl |= TGPIOCTL_EP_RISING_EDGE;
+		else if (extts->flags & PTP_FALLING_EDGE)
+			ctrl |= TGPIOCTL_EP_FALLING_EDGE;
+
+		/* gotta program all other bits before EN bit is set */
+		intel_pmc_tgpio_writel(tgpio->base, TGPIOCTL, ctrl);
+		ctrl |= TGPIOCTL_EN;
+		input = true;
+	} else {
+		ctrl &= ~(TGPIOCTL_DIR | TGPIOCTL_EN);
+		input = false;
+	}
+
+	intel_pmc_tgpio_writel(tgpio->base, TGPIOCTL, ctrl);
+	tgpio->input = input;
+
+	if (input)
+		wake_up_process(tgpio->event_thread);
+
+	return 0;
+}
+
+static int intel_pmc_tgpio_config_output(struct intel_pmc_tgpio *tgpio,
+		struct ptp_perout_request *perout, int on)
+{
+	u32			ctrl;
+
+	ctrl = intel_pmc_tgpio_readl(tgpio->base, TGPIOCTL);
+	if (on) {
+		struct ptp_clock_time *period = &perout->period;
+		struct ptp_clock_time *start = &perout->start;
+
+		if (ctrl & TGPIOCTL_EN)
+			return 0;
+
+		intel_pmc_tgpio_writeq(tgpio->base, TGPIOCOMPV31_0,
+				to_intel_pmc_tgpio_time(start));
+
+		intel_pmc_tgpio_writeq(tgpio->base, TGPIOPIV31_0,
+				to_intel_pmc_tgpio_time(period));
+
+		ctrl &= ~TGPIOCTL_DIR;
+		if (perout->flags & PTP_PEROUT_ONE_SHOT)
+			ctrl &= ~TGPIOCTL_PM;
+		else
+			ctrl |= TGPIOCTL_PM;
+
+		/* gotta program all other bits before EN bit is set */
+		intel_pmc_tgpio_writel(tgpio->base, TGPIOCTL, ctrl);
+
+		ctrl |= TGPIOCTL_EN;
+		intel_pmc_tgpio_writel(tgpio->base, TGPIOCTL, ctrl);
+	} else {
+		if (!(ctrl & ~TGPIOCTL_EN))
+			return 0;
+
+		ctrl &= ~(TGPIOCTL_EN | TGPIOCTL_PM);
+		intel_pmc_tgpio_writel(tgpio->base, TGPIOCTL, ctrl);
+	}
+
+	return 0;
+}
+
+static int intel_pmc_tgpio_enable(struct ptp_clock_info *info,
+		struct ptp_clock_request *req, int on)
+{
+	struct intel_pmc_tgpio	*tgpio = to_intel_pmc_tgpio(info);
+	int			ret = -EOPNOTSUPP;
+
+	mutex_lock(&tgpio->lock);
+	switch (req->type) {
+	case PTP_CLK_REQ_EXTTS:
+		ret = intel_pmc_tgpio_config_input(tgpio, &req->extts, on);
+		break;
+	case PTP_CLK_REQ_PEROUT:
+		ret = intel_pmc_tgpio_config_output(tgpio, &req->perout, on);
+		break;
+	default:
+		break;
+	}
+	mutex_unlock(&tgpio->lock);
+
+	return ret;
+}
+
+static int intel_pmc_tgpio_get_time_fn(ktime_t *device_time,
+		struct system_counterval_t *system_counter, void *_tgpio)
+{
+	get_tsc_ns(system_counter, device_time);
+	return 0;
+}
+
+static int intel_pmc_tgpio_getcrosststamp(struct ptp_clock_info *info,
+		struct system_device_crosststamp *cts)
+{
+	struct intel_pmc_tgpio	*tgpio = to_intel_pmc_tgpio(info);
+
+	return get_device_system_crosststamp(intel_pmc_tgpio_get_time_fn, tgpio,
+			NULL, cts);
+}
+
+static int intel_pmc_tgpio_counttstamp(struct ptp_clock_info *info,
+		struct ptp_event_count_tstamp *count)
+{
+	struct intel_pmc_tgpio	*tgpio = to_intel_pmc_tgpio(info);
+	u32 dt_hi_tmp;
+	u32 dt_hi;
+	u32 dt_lo;
+
+	dt_hi_tmp = intel_pmc_tgpio_readl(tgpio->base, TGPIOTCV63_32);
+	dt_lo = intel_pmc_tgpio_readl(tgpio->base, TGPIOTCV31_0);
+
+	count->event_count = intel_pmc_tgpio_readl(tgpio->base, TGPIOECCV63_32);
+	count->event_count <<= 32;
+	count->event_count |= intel_pmc_tgpio_readl(tgpio->base, TGPIOECCV31_0);
+
+	dt_hi = intel_pmc_tgpio_readl(tgpio->base, TGPIOTCV63_32);
+
+	if (dt_hi_tmp != dt_hi && dt_lo & 0x80000000)
+		count->device_time.sec = dt_hi_tmp;
+	else
+		count->device_time.sec = dt_hi;
+
+	count->device_time.nsec = dt_lo;
+
+	return 0;
+}
+
+static int intel_pmc_tgpio_verify(struct ptp_clock_info *ptp, unsigned int pin,
+		enum ptp_pin_function func, unsigned int chan)
+{
+	return 0;
+}
+
+static const struct ptp_clock_info intel_pmc_tgpio_info = {
+	.owner		= THIS_MODULE,
+	.name		= "Intel PMC TGPIO",
+	.max_adj	= 50000000,
+	.n_pins		= 1,
+	.n_ext_ts	= 1,
+	.n_per_out	= 1,
+	.pin_config	= intel_pmc_tgpio_pin_config,
+	.gettime64	= intel_pmc_tgpio_gettime64,
+	.settime64	= intel_pmc_tgpio_settime64,
+	.enable		= intel_pmc_tgpio_enable,
+	.getcrosststamp	= intel_pmc_tgpio_getcrosststamp,
+	.counttstamp	= intel_pmc_tgpio_counttstamp,
+	.verify		= intel_pmc_tgpio_verify,
+};
+
+static int intel_pmc_tgpio_probe(struct platform_device *pdev)
+{
+	struct intel_pmc_tgpio	*tgpio;
+	struct device		*dev;
+	struct resource		*res;
+
+	dev = &pdev->dev;
+	tgpio = devm_kzalloc(dev, sizeof(*tgpio), GFP_KERNEL);
+	if (!tgpio)
+		return -ENOMEM;
+
+	tgpio->dev = dev;
+	tgpio->info = intel_pmc_tgpio_info;
+
+	res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
+	tgpio->base = devm_ioremap_resource(dev, res);
+	if (!tgpio->base)
+		return -ENOMEM;
+
+	mutex_init(&tgpio->lock);
+	platform_set_drvdata(pdev, tgpio);
+
+	tgpio->event_thread = kthread_create(intel_pmc_tgpio_event_thread,
+			tgpio, dev_name(tgpio->dev));
+	if (IS_ERR(tgpio->event_thread))
+		return PTR_ERR(tgpio->event_thread);
+
+	tgpio->clock = ptp_clock_register(&tgpio->info, &pdev->dev);
+	if (IS_ERR(tgpio->clock))
+		return PTR_ERR(tgpio->clock);
+
+	wake_up_process(tgpio->event_thread);
+
+	return 0;
+}
+
+static int intel_pmc_tgpio_remove(struct platform_device *pdev)
+{
+	struct intel_pmc_tgpio	*tgpio = platform_get_drvdata(pdev);
+
+	ptp_clock_unregister(tgpio->clock);
+
+	return 0;
+}
+
+static const struct acpi_device_id intel_pmc_acpi_match[] = {
+	/* TODO */
+
+	{  },
+};
+
+/* MODULE_ALIAS("acpi*:TODO:*"); */
+
+static struct platform_driver intel_pmc_tgpio_driver = {
+	.probe		= intel_pmc_tgpio_probe,
+	.remove		= intel_pmc_tgpio_remove,
+	.driver		= {
+		.name	= "intel-pmc-tgpio",
+		.acpi_match_table = ACPI_PTR(intel_pmc_acpi_match),
+	},
+};
+
+module_platform_driver(intel_pmc_tgpio_driver);
+
+MODULE_AUTHOR("Felipe Balbi <felipe.balbi@linux.intel.com>");
+MODULE_LICENSE("GPL v2");
+MODULE_DESCRIPTION("Intel PMC Timed GPIO Controller Driver");
-- 
2.22.0


^ permalink raw reply related

* [RFC PATCH 1/5] x86: tsc: add tsc to art helpers
From: Felipe Balbi @ 2019-07-16  7:20 UTC (permalink / raw)
  To: Richard Cochran
  Cc: netdev, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, x86, linux-kernel, Christopher S . Hall,
	Felipe Balbi
In-Reply-To: <20190716072038.8408-1-felipe.balbi@linux.intel.com>

Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com>
---
 arch/x86/include/asm/tsc.h |  2 ++
 arch/x86/kernel/tsc.c      | 32 ++++++++++++++++++++++++++++++++
 2 files changed, 34 insertions(+)

diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
index 8a0c25c6bf09..b7a9f4385a82 100644
--- a/arch/x86/include/asm/tsc.h
+++ b/arch/x86/include/asm/tsc.h
@@ -32,6 +32,8 @@ static inline cycles_t get_cycles(void)
 
 extern struct system_counterval_t convert_art_to_tsc(u64 art);
 extern struct system_counterval_t convert_art_ns_to_tsc(u64 art_ns);
+extern void get_tsc_ns(struct system_counterval_t *tsc_counterval, u64 *tsc_ns);
+extern u64 get_art_ns_now(void);
 
 extern void tsc_early_init(void);
 extern void tsc_init(void);
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 0b29e58f288e..333fffc1db7c 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1215,6 +1215,38 @@ struct system_counterval_t convert_art_to_tsc(u64 art)
 }
 EXPORT_SYMBOL(convert_art_to_tsc);
 
+void get_tsc_ns(struct system_counterval_t *tsc_counterval, u64 *tsc_ns)
+{
+	u64 tmp, res, rem;
+	u64 cycles;
+
+	tsc_counterval->cycles = clocksource_tsc.read(NULL);
+	cycles = tsc_counterval->cycles;
+	tsc_counterval->cs = art_related_clocksource;
+
+	rem = do_div(cycles, tsc_khz);
+
+	res = cycles * USEC_PER_SEC;
+	tmp = rem * USEC_PER_SEC;
+
+	do_div(tmp, tsc_khz);
+	res += tmp;
+
+	*tsc_ns = res;
+}
+EXPORT_SYMBOL(get_tsc_ns);
+
+u64 get_art_ns_now(void)
+{
+	struct system_counterval_t tsc_cycles;
+	u64 tsc_ns;
+
+	get_tsc_ns(&tsc_cycles, &tsc_ns);
+
+	return tsc_ns;
+}
+EXPORT_SYMBOL(get_art_ns_now);
+
 /**
  * convert_art_ns_to_tsc() - Convert ART in nanoseconds to TSC.
  * @art_ns: ART (Always Running Timer) in unit of nanoseconds
-- 
2.22.0


^ permalink raw reply related

* [RFC PATCH 2/5] PTP: add a callback for counting timestamp events
From: Felipe Balbi @ 2019-07-16  7:20 UTC (permalink / raw)
  To: Richard Cochran
  Cc: netdev, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, x86, linux-kernel, Christopher S . Hall,
	Felipe Balbi
In-Reply-To: <20190716072038.8408-1-felipe.balbi@linux.intel.com>

This will be used for frequency discipline adjustments.

Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com>
---
 include/linux/ptp_clock_kernel.h | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/include/linux/ptp_clock_kernel.h b/include/linux/ptp_clock_kernel.h
index 28eb9c792522..1a4e3f916128 100644
--- a/include/linux/ptp_clock_kernel.h
+++ b/include/linux/ptp_clock_kernel.h
@@ -35,6 +35,16 @@ struct ptp_system_timestamp {
 	struct timespec64 post_ts;
 };
 
+/**
+ * struct ptp_event_count_tstamp - device time vs event count for frequency discipline
+ */
+struct ptp_event_count_tstamp {
+	unsigned int index;
+
+	struct ptp_clock_time device_time;
+	u64 event_count;
+};
+
 /**
  * struct ptp_clock_info - decribes a PTP hardware clock
  *
@@ -134,6 +144,8 @@ struct ptp_clock_info {
 			  struct ptp_system_timestamp *sts);
 	int (*getcrosststamp)(struct ptp_clock_info *ptp,
 			      struct system_device_crosststamp *cts);
+	int (*counttstamp)(struct ptp_clock_info *ptp,
+			   struct ptp_event_count_tstamp *count);
 	int (*settime64)(struct ptp_clock_info *p, const struct timespec64 *ts);
 	int (*enable)(struct ptp_clock_info *ptp,
 		      struct ptp_clock_request *request, int on);
-- 
2.22.0


^ permalink raw reply related

* [RFC PATCH 0/5] PTP: add support for Intel's TGPIO controller
From: Felipe Balbi @ 2019-07-16  7:20 UTC (permalink / raw)
  To: Richard Cochran
  Cc: netdev, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, x86, linux-kernel, Christopher S . Hall,
	Felipe Balbi

TGPIO is a new IP which allows for time synchronization between systems
without any other means of synchronization such as PTP or NTP. The
driver is implemented as part of the PTP framework since its features
covered most of what this controller can do.

There are a few things that made me send this as a RFC, however:

(1) This version of the controller lacks an interrupt line. Currently I
	put a kthread that starts polling the controller whenever its
	pin is configured as input. Any better ideas for allowing
	userspace control the polling rate? Perhaps tap into ptp_poll()?

(2) ACPI IDs can't be shared at this moment, unfortunately.

(3) The change in arch/x86/kernel/tsc.c needs to be reviewed at length
	before going in.

Let me know what you guys think,
Cheers

Felipe Balbi (5):
  x86: tsc: add tsc to art helpers
  PTP: add a callback for counting timestamp events
  PTP: implement PTP_EVENT_COUNT_TSTAMP ioctl
  PTP: Add flag for non-periodic output
  PTP: Add support for Intel PMC Timed GPIO Controller

 arch/x86/include/asm/tsc.h        |   2 +
 arch/x86/kernel/tsc.c             |  32 +++
 drivers/ptp/Kconfig               |   8 +
 drivers/ptp/Makefile              |   1 +
 drivers/ptp/ptp-intel-pmc-tgpio.c | 378 ++++++++++++++++++++++++++++++
 drivers/ptp/ptp_chardev.c         |  15 ++
 include/linux/ptp_clock_kernel.h  |  12 +
 include/uapi/linux/ptp_clock.h    |   6 +-
 8 files changed, 453 insertions(+), 1 deletion(-)
 create mode 100644 drivers/ptp/ptp-intel-pmc-tgpio.c

-- 
2.22.0


^ permalink raw reply

* [PATCH] net/sched: Make NET_ACT_CT depends on NF_NAT
From: YueHaibing @ 2019-07-16  7:16 UTC (permalink / raw)
  To: jhs, xiyou.wangcong, jiri, davem; +Cc: linux-kernel, netdev, YueHaibing

If NF_NAT is m and NET_ACT_CT is y, build fails:

net/sched/act_ct.o: In function `tcf_ct_act':
act_ct.c:(.text+0x21ac): undefined reference to `nf_ct_nat_ext_add'
act_ct.c:(.text+0x229a): undefined reference to `nf_nat_icmp_reply_translation'
act_ct.c:(.text+0x233a): undefined reference to `nf_nat_setup_info'
act_ct.c:(.text+0x234a): undefined reference to `nf_nat_alloc_null_binding'
act_ct.c:(.text+0x237c): undefined reference to `nf_nat_packet'

Reported-by: Hulk Robot <hulkci@huawei.com>
Fixes: b57dc7c13ea9 ("net/sched: Introduce action ct")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
---
 net/sched/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index dd55b9a..afd2ba1 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -942,7 +942,7 @@ config NET_ACT_TUNNEL_KEY
 
 config NET_ACT_CT
         tristate "connection tracking tc action"
-        depends on NET_CLS_ACT && NF_CONNTRACK
+        depends on NET_CLS_ACT && NF_CONNTRACK && NF_NAT
         help
 	  Say Y here to allow sending the packets to conntrack module.
 
-- 
2.7.4



^ permalink raw reply related

* Re: [PATCH iproute2-rc 1/8] rdma: Update uapi headers to add statistic counter support
From: Leon Romanovsky @ 2019-07-16  6:54 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev, David Ahern, Mark Zhang, RDMA mailing list
In-Reply-To: <20190715135238.7c0c7242@hermes.lan>

On Mon, Jul 15, 2019 at 01:52:38PM -0700, Stephen Hemminger wrote:
> On Wed, 10 Jul 2019 10:24:48 +0300
> Leon Romanovsky <leon@kernel.org> wrote:
>
> > From: Mark Zhang <markz@mellanox.com>
> >
> > Update rdma_netlink.h to kernel commit 6e7be47a5345 ("RDMA/nldev:
> > Allow get default counter statistics through RDMA netlink").
> >
> > Signed-off-by: Mark Zhang <markz@mellanox.com>
> > Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
>
> I am waiting on this until it gets to Linus's tree.

It was merged tonight.
https://git.kernel.org/torvalds/c/2a3c389a0fde49b241430df806a34276568cfb29

Thanks

^ permalink raw reply

* Re: linux-next: Tree for Jul 15 (HEADERS_TEST w/ netfilter tables offload)
From: Masahiro Yamada @ 2019-07-16  6:44 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: Laura Garcia, Randy Dunlap, Stephen Rothwell,
	Linux Next Mailing List, Linux Kernel Mailing List, linux-kbuild,
	netdev@vger.kernel.org, Netfilter Development Mailing list
In-Reply-To: <20190715180905.rytaht5kslpbatcy@salvia>

On Tue, Jul 16, 2019 at 3:09 AM Pablo Neira Ayuso <pablo@netfilter.org> wrote:
>
> On Tue, Jul 16, 2019 at 02:56:09AM +0900, Masahiro Yamada wrote:
> > On Tue, Jul 16, 2019 at 2:33 AM Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > >
> > > On Mon, Jul 15, 2019 at 07:28:04PM +0200, Laura Garcia wrote:
> > > > CC'ing netfilter.
> > > >
> > > > On Mon, Jul 15, 2019 at 6:45 PM Randy Dunlap <rdunlap@infradead.org> wrote:
> > > > >
> > > > > On 7/14/19 9:48 PM, Stephen Rothwell wrote:
> > > > > > Hi all,
> > > > > >
> > > > > > Please do not add v5.4 material to your linux-next included branches
> > > > > > until after v5.3-rc1 has been released.
> > > > > >
> > > > > > Changes since 20190712:
> > > > > >
> > > > >
> > > > > Hi,
> > > > >
> > > > > I am seeing these build errors from HEADERS_TEST (or KERNEL_HEADERS_TEST)
> > > > > for include/net/netfilter/nf_tables_offload.h.s:
> > > > >
> > > > >   CC      include/net/netfilter/nf_tables_offload.h.s
> > > [...]
> > > > > Should this header file not be tested?
> >
> > This means you must endlessly exclude
> > headers that include nf_tables.h
> >
> >
> > > Yes, it should indeed be added.
> >
> > Adding 'header-test-' is the last resort.
>
> OK, so policy now is that all internal headers should compile
> standalone, right?

I would not say that.
I just want to put as much code as possible into the test-coverage.

If there is a good reason to opt out of the header-test, that is OK.
We should take a look at the cause of the error
before blindly adding it into the blacklist.


For this particular case, I just thought some functions
could be localized in net/netfilter/, and would be cleaner.

Having said that, I am not familiar enough with
the netfilter subsystem.
So, this should be reviewed by the experts in the area.


Anyway, CONFIG_NF_TABLES seems mandatory to compile
include/net/netfilter/nf_tables_*.h

So, I will queue the following patch
to suppress the error for now.

diff --git a/include/Kbuild b/include/Kbuild
index 7e9f1acb9dd5..e59605243bca 100644
--- a/include/Kbuild
+++ b/include/Kbuild
@@ -905,10 +905,11 @@ header-test-                      +=
net/netfilter/nf_nat_redirect.h
 header-test-                   += net/netfilter/nf_queue.h
 header-test-                   += net/netfilter/nf_reject.h
 header-test-                   += net/netfilter/nf_synproxy.h
-header-test-                   += net/netfilter/nf_tables.h
-header-test-                   += net/netfilter/nf_tables_core.h
-header-test-                   += net/netfilter/nf_tables_ipv4.h
+header-test-$(CONFIG_NF_TABLES)        += net/netfilter/nf_tables.h
+header-test-$(CONFIG_NF_TABLES)        += net/netfilter/nf_tables_core.h
+header-test-$(CONFIG_NF_TABLES)        += net/netfilter/nf_tables_ipv4.h
 header-test-                   += net/netfilter/nf_tables_ipv6.h
+header-test-$(CONFIG_NF_TABLES)        += net/netfilter/nf_tables_offload.h
 header-test-                   += net/netfilter/nft_fib.h
 header-test-                   += net/netfilter/nft_meta.h
 header-test-                   += net/netfilter/nft_reject.h



This test just landed in upstream,
and will take some time to iron out the issues.

If I am disturbing people too much,
I perhaps need to loosen the policy.
Sorry if this test is too annoying.


Thanks.


--
Best Regards
Masahiro Yamada

^ permalink raw reply related

* [PATCH] net: ethernet: mediatek: mtk_eth_soc: Add of_node_put() before goto
From: Nishka Dasgupta @ 2019-07-16  5:55 UTC (permalink / raw)
  To: nbd, john, sean.wang, davem, netdev, matthias.bgg,
	linux-arm-kernel, linux-mediatek
  Cc: Nishka Dasgupta

Each iteration of for_each_child_of_node puts the previous node, but in
the case of a goto from the middle of the loop, there is no put, thus
causing a memory leak. Hence add an of_node_put before the goto.
Issue found with Coccinelle.

Signed-off-by: Nishka Dasgupta <nishkadg.linux@gmail.com>
---
 drivers/net/ethernet/mediatek/mtk_eth_soc.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index b20b3a5a1ebb..c39d7f4ab1d4 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -2548,8 +2548,10 @@ static int mtk_probe(struct platform_device *pdev)
 			continue;
 
 		err = mtk_add_mac(eth, mac_np);
-		if (err)
+		if (err) {
+			of_node_put(mac_np);
 			goto err_deinit_hw;
+		}
 	}
 
 	if (MTK_HAS_CAPS(eth->soc->caps, MTK_SHARED_INT)) {
-- 
2.19.1


^ permalink raw reply related

* [PATCH] net: ethernet: mscc: ocelot_board: Add of_node_put() before return
From: Nishka Dasgupta @ 2019-07-16  5:52 UTC (permalink / raw)
  To: alexandre.belloni, unglinuxdriver, davem, netdev; +Cc: Nishka Dasgupta

Each iteration of for_each_available_child_of_node puts the previous
node, but in the case of a return from the middle of the loop, there is
no put, thus causing a memory leak. Hence add an of_node_put before the
return in two places.
Issue found with Coccinelle.

Signed-off-by: Nishka Dasgupta <nishkadg.linux@gmail.com>
---
 drivers/net/ethernet/mscc/ocelot_board.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mscc/ocelot_board.c b/drivers/net/ethernet/mscc/ocelot_board.c
index 58bde1a9eacb..2451d4a96490 100644
--- a/drivers/net/ethernet/mscc/ocelot_board.c
+++ b/drivers/net/ethernet/mscc/ocelot_board.c
@@ -291,8 +291,10 @@ static int mscc_ocelot_probe(struct platform_device *pdev)
 			continue;
 
 		err = ocelot_probe_port(ocelot, port, regs, phy);
-		if (err)
+		if (err) {
+			of_node_put(portnp);
 			return err;
+		}
 
 		phy_mode = of_get_phy_mode(portnp);
 		if (phy_mode < 0)
@@ -318,6 +320,7 @@ static int mscc_ocelot_probe(struct platform_device *pdev)
 			dev_err(ocelot->dev,
 				"invalid phy mode for port%d, (Q)SGMII only\n",
 				port);
+			of_node_put(portnp);
 			return -EINVAL;
 		}
 
-- 
2.19.1


^ permalink raw reply related

* [PATCH] net: ethernet: ti: cpsw: Add of_node_put() before return and break
From: Nishka Dasgupta @ 2019-07-16  5:48 UTC (permalink / raw)
  To: grygorii.strashko, davem, ivan.khoronzhuk, linux-omap, netdev
  Cc: Nishka Dasgupta

Each iteration of for_each_available_child_of_node puts the previous
node, but in the case of a return or break from the middle of the loop,
there is no put, thus causing a memory leak.
Hence, for function cpsw_probe_dt, create an extra label err_node_put
that puts the last used node and returns ret; modify the return
statements in the loop to save the return value in ret and goto this new
label.
For function cpsw_remove_dt, add an of_node_put before the break.
Issue found with Coccinelle.

Signed-off-by: Nishka Dasgupta <nishkadg.linux@gmail.com>
---
 drivers/net/ethernet/ti/cpsw.c | 26 ++++++++++++++++++--------
 1 file changed, 18 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index f320f9a0de8b..32a89744972d 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -2570,7 +2570,7 @@ static int cpsw_probe_dt(struct cpsw_platform_data *data,
 			ret = PTR_ERR(slave_data->ifphy);
 			dev_err(&pdev->dev,
 				"%d: Error retrieving port phy: %d\n", i, ret);
-			return ret;
+			goto err_node_put;
 		}
 
 		slave_data->slave_node = slave_node;
@@ -2589,7 +2589,7 @@ static int cpsw_probe_dt(struct cpsw_platform_data *data,
 			if (ret) {
 				if (ret != -EPROBE_DEFER)
 					dev_err(&pdev->dev, "failed to register fixed-link phy: %d\n", ret);
-				return ret;
+				goto err_node_put;
 			}
 			slave_data->phy_node = of_node_get(slave_node);
 		} else if (parp) {
@@ -2607,7 +2607,8 @@ static int cpsw_probe_dt(struct cpsw_platform_data *data,
 			of_node_put(mdio_node);
 			if (!mdio) {
 				dev_err(&pdev->dev, "Missing mdio platform device\n");
-				return -EINVAL;
+				ret = -EINVAL;
+				goto err_node_put;
 			}
 			snprintf(slave_data->phy_id, sizeof(slave_data->phy_id),
 				 PHY_ID_FMT, mdio->name, phyid);
@@ -2622,7 +2623,8 @@ static int cpsw_probe_dt(struct cpsw_platform_data *data,
 		if (slave_data->phy_if < 0) {
 			dev_err(&pdev->dev, "Missing or malformed slave[%d] phy-mode property\n",
 				i);
-			return slave_data->phy_if;
+			ret = slave_data->phy_if;
+			goto err_node_put;
 		}
 
 no_phy_slave:
@@ -2633,7 +2635,7 @@ static int cpsw_probe_dt(struct cpsw_platform_data *data,
 			ret = ti_cm_get_macid(&pdev->dev, i,
 					      slave_data->mac_addr);
 			if (ret)
-				return ret;
+				goto err_node_put;
 		}
 		if (data->dual_emac) {
 			if (of_property_read_u32(slave_node, "dual_emac_res_vlan",
@@ -2648,11 +2650,17 @@ static int cpsw_probe_dt(struct cpsw_platform_data *data,
 		}
 
 		i++;
-		if (i == data->slaves)
-			break;
+		if (i == data->slaves) {
+			ret = 0;
+			goto err_node_put;
+		}
 	}
 
 	return 0;
+
+err_node_put:
+	of_node_put(slave_node);
+	return ret;
 }
 
 static void cpsw_remove_dt(struct platform_device *pdev)
@@ -2675,8 +2683,10 @@ static void cpsw_remove_dt(struct platform_device *pdev)
 		of_node_put(slave_data->phy_node);
 
 		i++;
-		if (i == data->slaves)
+		if (i == data->slaves) {
+			of_node_put(slave_node);
 			break;
+		}
 	}
 
 	of_platform_depopulate(&pdev->dev);
-- 
2.19.1


^ permalink raw reply related

* Re: [PATCH bpf] bpf: net: Set sk_bpf_storage back to NULL for cloned sk
From: Martin Lau @ 2019-07-16  5:46 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: bpf@vger.kernel.org, netdev@vger.kernel.org, Alexei Starovoitov,
	Daniel Borkmann, David Miller, Kernel Team
In-Reply-To: <20190709163321.GB22061@mini-arch>

On Tue, Jul 09, 2019 at 09:33:21AM -0700, Stanislav Fomichev wrote:
> On 06/11, Martin KaFai Lau wrote:
> > The cloned sk should not carry its parent-listener's sk_bpf_storage.
> > This patch fixes it by setting it back to NULL.
> Have you thought about some kind of inheritance for listener sockets'
> storage? Suppose I have a situation where I write something
> to listener's sk storage (directly or via recently added sockopts hooks)
> and I want to inherit that state for a freshly established connection.
> 
> I was looking into adding possibility to call bpf_get_listener_sock form
> BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB callback to manually
> copy some data form the listener socket, but I don't think
> at this point there is any association between newly established
> socket and the listener.
Right, at that point, the child sk has no reference back
to the listener's sk.

After a quick look, the listener sk may not always be available
also (e.g. the backlog processing case).  Hence, adding
the listener sk to the bpf running ctx is not obvious
either.

> 
> Thoughts/ideas?
I think cloning the listener's bpf sk storage could be added
to the existing sk cloning logic.  It seems to be a more straight
forward approach instead of figuring out the right place to call
another bpf prog to clone it.

Quick thoughts out of my head:
1. Default should be not-to-clone.  Have a way (a map's flag?) to opt-in.
2. The listener's sk storage could be being modified while being cloned.
   One possibility is to check if the value has bpf_spin_lock.
   If there is, lock it before cloning.

^ permalink raw reply

* [PATCH] rculist: Add build check for single optional list argument
From: Joel Fernandes (Google) @ 2019-07-16  4:07 UTC (permalink / raw)
  To: linux-kernel
  Cc: Joel Fernandes (Google), Paul McKenney, Alexey Kuznetsov,
	Bjorn Helgaas, Borislav Petkov, c0d1n61at3, David S. Miller,
	edumazet, Greg Kroah-Hartman, Hideaki YOSHIFUJI, H. Peter Anvin,
	Ingo Molnar, Jonathan Corbet, Josh Triplett, keescook,
	kernel-hardening, kernel-team, Lai Jiangshan, Len Brown,
	linux-acpi, linux-doc, linux-pci, linux-pm, Mathieu Desnoyers,
	neilb, netdev, Oleg Nesterov, Paul E. McKenney, Pavel Machek,
	peterz, Rafael J. Wysocki, Rasmus Villemoes, rcu, Steven Rostedt,
	Tejun Heo, Thomas Gleixner, will,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)

In a previous patch series [1], we added an optional lockdep expression
argument to list_for_each_entry_rcu() and the hlist equivalent. This
also meant more than one optional argument can be passed to them with
that error going unnoticed. To fix this, let us force a compiler error
more than one optional argument is passed.

[1] https://lore.kernel.org/patchwork/project/lkml/list/?series=402150

Suggested-by: Paul McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 include/linux/rculist.h | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/include/linux/rculist.h b/include/linux/rculist.h
index 1048160625bb..86659f6d72dc 100644
--- a/include/linux/rculist.h
+++ b/include/linux/rculist.h
@@ -44,14 +44,18 @@ static inline void INIT_LIST_HEAD_RCU(struct list_head *list)
  * Check during list traversal that we are within an RCU reader
  */
 
+#define check_arg_count_one(dummy)
+
 #ifdef CONFIG_PROVE_RCU_LIST
-#define __list_check_rcu(dummy, cond, ...)				\
+#define __list_check_rcu(dummy, cond, extra...)				\
 	({								\
+	check_arg_count_one(extra);					\
 	RCU_LOCKDEP_WARN(!cond && !rcu_read_lock_any_held(),		\
 			 "RCU-list traversed in non-reader section!");	\
 	 })
 #else
-#define __list_check_rcu(dummy, cond, ...) ({})
+#define __list_check_rcu(dummy, cond, extra...)				\
+	({ check_arg_count_one(extra); })
 #endif
 
 /*
-- 
2.22.0.510.g264f2c817a-goog


^ permalink raw reply related

* Re: [PATCH 7/9] x86/pci: Pass lockdep condition to pcm_mmcfg_list iterator (v1)
From: Joel Fernandes @ 2019-07-16  4:03 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-kernel, Alexey Kuznetsov, Borislav Petkov, c0d1n61at3,
	David S. Miller, edumazet, Greg Kroah-Hartman, Hideaki YOSHIFUJI,
	H. Peter Anvin, Ingo Molnar, Jonathan Corbet, Josh Triplett,
	keescook, kernel-hardening, kernel-team, Lai Jiangshan, Len Brown,
	linux-acpi, linux-doc, linux-pci, linux-pm, Mathieu Desnoyers,
	neilb, netdev, Oleg Nesterov, Paul E. McKenney, Pavel Machek,
	peterz, Rafael J. Wysocki, Rasmus Villemoes, rcu, Steven Rostedt,
	Tejun Heo, Thomas Gleixner, will,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)
In-Reply-To: <20190715200235.GG46935@google.com>

On Mon, Jul 15, 2019 at 03:02:35PM -0500, Bjorn Helgaas wrote:
> On Mon, Jul 15, 2019 at 10:37:03AM -0400, Joel Fernandes (Google) wrote:
> > The pcm_mmcfg_list is traversed with list_for_each_entry_rcu without a
> > reader-lock held, because the pci_mmcfg_lock is already held. Make this
> > known to the list macro so that it fixes new lockdep warnings that
> > trigger due to lockdep checks added to list_for_each_entry_rcu().
> > 
> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> 
> Ingo takes care of most patches to this file, but FWIW,
> 
> Acked-by: Bjorn Helgaas <bhelgaas@google.com>

Thanks.

> I would personally prefer if you capitalized the subject to match the
> "x86/PCI:" convention that's used fairly consistently in
> arch/x86/pci/.
> 
> Also, I didn't apply this to be sure, but it looks like this might
> make a line or two wider than 80 columns, which I would rewrap if I
> were applying this.

Updated below is the patch with the nits corrected:

---8<-----------------------

From 73fab09d7e33ca2110c24215f8ed428c12625dbe Mon Sep 17 00:00:00 2001
From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
Date: Sat, 1 Jun 2019 15:05:49 -0400
Subject: [PATCH] x86/PCI: Pass lockdep condition to pcm_mmcfg_list iterator
 (v1)

The pcm_mmcfg_list is traversed with list_for_each_entry_rcu without a
reader-lock held, because the pci_mmcfg_lock is already held. Make this
known to the list macro so that it fixes new lockdep warnings that
trigger due to lockdep checks added to list_for_each_entry_rcu().

Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 arch/x86/pci/mmconfig-shared.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/pci/mmconfig-shared.c b/arch/x86/pci/mmconfig-shared.c
index 7389db538c30..9e3250ec5a37 100644
--- a/arch/x86/pci/mmconfig-shared.c
+++ b/arch/x86/pci/mmconfig-shared.c
@@ -29,6 +29,7 @@
 static bool pci_mmcfg_running_state;
 static bool pci_mmcfg_arch_init_failed;
 static DEFINE_MUTEX(pci_mmcfg_lock);
+#define pci_mmcfg_lock_held() lock_is_held(&(pci_mmcfg_lock).dep_map)
 
 LIST_HEAD(pci_mmcfg_list);
 
@@ -54,7 +55,8 @@ static void list_add_sorted(struct pci_mmcfg_region *new)
 	struct pci_mmcfg_region *cfg;
 
 	/* keep list sorted by segment and starting bus number */
-	list_for_each_entry_rcu(cfg, &pci_mmcfg_list, list) {
+	list_for_each_entry_rcu(cfg, &pci_mmcfg_list, list,
+				pci_mmcfg_lock_held()) {
 		if (cfg->segment > new->segment ||
 		    (cfg->segment == new->segment &&
 		     cfg->start_bus >= new->start_bus)) {
@@ -118,7 +120,8 @@ struct pci_mmcfg_region *pci_mmconfig_lookup(int segment, int bus)
 {
 	struct pci_mmcfg_region *cfg;
 
-	list_for_each_entry_rcu(cfg, &pci_mmcfg_list, list)
+	list_for_each_entry_rcu(cfg, &pci_mmcfg_list, list
+				pci_mmcfg_lock_held())
 		if (cfg->segment == segment &&
 		    cfg->start_bus <= bus && bus <= cfg->end_bus)
 			return cfg;
-- 
2.22.0.510.g264f2c817a-goog


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox