Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Song Liu @ 2019-08-05  7:36 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kees Cook, Networking, bpf, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team, Lorenz Bauer, Jann Horn, Greg KH, Linux API,
	LSM List
In-Reply-To: <CALCETrWtE2U4EvZVYeq8pSmQjBzF2PHH+KxYW8FSeF+W=1FYjw@mail.gmail.com>

Hi Andy, 

> On Aug 4, 2019, at 10:47 PM, Andy Lutomirski <luto@kernel.org> wrote:
> 
> On Sun, Aug 4, 2019 at 5:08 PM Andy Lutomirski <luto@kernel.org> wrote:
>> 
>> On Sun, Aug 4, 2019 at 3:16 PM Andy Lutomirski <luto@kernel.org> wrote:
>>> 
>>> On Fri, Aug 2, 2019 at 12:22 AM Song Liu <songliubraving@fb.com> wrote:
>>>> 
>>>> Hi Andy,
>>>> 
>>>>> I actually agree CAP_BPF_ADMIN makes sense. The hard part is to make
>>>>>> existing tools (setcap, getcap, etc.) and libraries aware of the new CAP.
>>>>> 
>>>>> It's been done before -- it's not that hard.  IMO the main tricky bit
>>>>> would be try be somewhat careful about defining exactly what
>>>>> CAP_BPF_ADMIN does.
>>>> 
>>>> Agreed. I think defining CAP_BPF_ADMIN could be a good topic for the
>>>> Plumbers conference.
>>>> 
>>>> OTOH, I don't think we have to wait for CAP_BPF_ADMIN to allow daemons
>>>> like systemd to do sys_bpf() without root.
>>> 
>>> I don't understand the use case here.  Are you talking about systemd
>>> --user?  As far as I know, a user is expected to be able to fully
>>> control their systemd --user process, so giving it unrestricted bpf
>>> access is very close to giving it superuser access, and this doesn't
>>> sound like a good idea.  I think that, if systemd --user needs bpf(),
>>> it either needs real unprivileged bpf() or it needs a privileged
>>> helper (SUID or a daemon) to intermediate this access.
>>> 
>>>> 
>>>>> 
>>>>>>> I don't see why you need to invent a whole new mechanism for this.
>>>>>>> The entire cgroup ecosystem outside bpf() does just fine using the
>>>>>>> write permission on files in cgroupfs to control access.  Why can't
>>>>>>> bpf() do the same thing?
>>>>>> 
>>>>>> It is easier to use write permission for BPF_PROG_ATTACH. But it is
>>>>>> not easy to do the same for other bpf commands: BPF_PROG_LOAD and
>>>>>> BPF_MAP_*. A lot of these commands don't have target concept. Maybe
>>>>>> we should have target concept for all these commands. But that is a
>>>>>> much bigger project. OTOH, "all or nothing" model allows all these
>>>>>> commands at once.
>>>>> 
>>>>> For BPF_PROG_LOAD, I admit I've never understood why permission is
>>>>> required at all.  I think that CAP_SYS_ADMIN or similar should be
>>>>> needed to get is_priv in the verifier, but I think that should mainly
>>>>> be useful for tracing, and that requires lots of privilege anyway.
>>>>> BPF_MAP_* is probably the trickiest part.  One solution would be some
>>>>> kind of bpffs, but I'm sure other solutions are possible.
>>>> 
>>>> Improving permission management of cgroup_bpf is another good topic to
>>>> discuss. However, it is also an overkill for current use case.
>>>> 
>>> 
>>> I looked at the code some more, and I don't think this is so hard
>>> after all.  As I understand it, all of the map..by_id stuff is, to
>>> some extent, deprecated in favor of persistent maps.  As I see it, the
>>> map..by_id calls should require privilege forever, although I can
>>> imagine ways to scope that privilege to a namespace if the maps
>>> themselves were to be scoped to a namespace.
>>> 
>>> Instead, unprivileged tools would use the persistent map interface
>>> roughly like this:
>>> 
>>> $ bpftool map create /sys/fs/bpf/my_dir/filename type hash key 8 value
>>> 8 entries 64 name mapname
>>> 
>>> This would require that the caller have either CAP_DAC_OVERRIDE or
>>> that the caller have permission to create files in /sys/fs/bpf/my_dir
>>> (using the same rules as for any filesystem), and the resulting map
>>> would end up owned by the creating user and have mode 0600 (or maybe
>>> 0666, or maybe a new bpf_attr parameter) modified by umask.  Then all
>>> the various capable() checks that are currently involved in accessing
>>> a persistent map would instead check FMODE_READ or FMODE_WRITE on the
>>> map file as appropriate.
>>> 
>>> Half of this stuff already works.  I just set my system up like this:
>>> 
>>> $ ls -l /sys/fs/bpf
>>> total 0
>>> drwxr-xr-x. 3 luto luto 0 Aug  4 15:10 luto
>>> 
>>> $ mkdir /sys/fs/bpf/luto/test
>>> 
>>> $ ls -l /sys/fs/bpf/luto
>>> total 0
>>> drwxrwxr-x. 2 luto luto 0 Aug  4 15:10 test
>>> 
>>> I bet that making the bpf() syscalls work appropriately in this
>>> context without privilege would only be a couple of hours of work.
>>> The hard work, creating bpffs and making it function, is already done
>>> :)
>>> 
>>> P.S. The docs for bpftool create are less than fantastic.  The
>>> complete lack of any error message at all when the syscall returns
>>> -EACCES is also not fantastic.
>> 
>> This isn't remotely finished, but I spent a bit of time fiddling with this:
>> 
>> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=bpf/perms
>> 
>> What do you think?  (It's obviously not done.  It doesn't compile, and
>> I haven't gotten to the permissions needed to do map operations.  I
>> also haven't touched the capable() checks.)
> 
> I updated the branch.  It compiles, and basic map functionality works!

Thanks a lot for trying this out. This is a very interesting direction
that we will explore. 

> 
> # mount -t bpf bpf /sys/fs/bpf
> # cd /sys/fs/bpf
> # mkdir luto
> # chown luto: luto
> # setpriv --euid=1000 --ruid=1000 bash
> $ pwd
> /sys/fs/bpf
> bash-5.0$ ls -l
> total 0
> drwxr-xr-x 2 luto luto 0 Aug  4 22:41 luto
> bash-5.0$ bpftool map create /sys/fs/bpf/luto/filename type hash key 8
> value 8 entries 64 name mapname
> bash-5.0$ bpftool map dump pinned /sys/fs/bpf/luto/filename
> Found 0 elements
> 
> # chown root: /sys/fs/bpf/luto/filename
> 
> $ bpftool map dump pinned /sys/fs/bpf/luto/filename
> Error: bpf obj get (/sys/fs/bpf/luto): Permission denied
> 
> So I think it's possible to get a respectable subset of bpf()
> functionality working without privilege in short order :)

I think we have two key questions to answer: 
  1. What subset of bpf() functionality will the users need?
  2. Who are the users? 

Different answers to these two questions lead to different directions.


In our use case, the answers are 
  1) almost all bpf() functionality
  2) highly trusted users (sudoers)

So our initial approach of /dev/bpf allows all bpf() functionality
in one bit in task_struct. (Yes, we can just sudo. But, we would 
rather not use sudo when possible.)


"cgroup management" use case may have answers like:
  1) cgroup_bpf only
  2) users in their own containers

For this case, getting cgroup_bpf related features (cgroup_bpf progs; 
some map types, etc.) work with unprivileged users would be the right 
direction. 


"USDT tracing" use case may have answers like:
  1) uprobe, stockmap, histogram, etc.
  2) unprivileged user, w/ or w/o containers

For this case, the first step is likely hacking sys_perf_event_open(). 


I guess we will need more discussions to decide how to make bpf() 
work better for all these (and more) use cases. 

Thanks,
Song


^ permalink raw reply

* Re: [PATCH ethtool] ethtool: dump nested registers
From: Michal Kubecek @ 2019-08-05  8:04 UTC (permalink / raw)
  To: netdev; +Cc: Vivien Didelot, f.fainelli, andrew, davem, linville, cphealy
In-Reply-To: <20190802193455.17126-1-vivien.didelot@gmail.com>

On Fri, Aug 02, 2019 at 03:34:54PM -0400, Vivien Didelot wrote:
> Usually kernel drivers set the regs->len value to the same length as
> info->regdump_len, which was used for the allocation. In case where
> regs->len is smaller than the allocated info->regdump_len length,
> we may assume that the dump contains a nested set of registers.
> 
> This becomes handy for kernel drivers to expose registers of an
> underlying network conduit unfortunately not exposed to userspace,
> as found in network switching equipment for example.
> 
> This patch adds support for recursing into the dump operation if there
> is enough room for a nested ethtool_drvinfo structure containing a
> valid driver name, followed by a ethtool_regs structure like this:
> 
>     0      regs->len                        info->regdump_len
>     v              v                                        v
>     +--------------+-----------------+--------------+-- - --+
>     | ethtool_regs | ethtool_drvinfo | ethtool_regs |       |
>     +--------------+-----------------+--------------+-- - --+
> 
> Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com>
> ---

I'm not sure about this approach. If these additional objects with their
own registers are represented by a network device, we can query their
registers directly. If they are not (which, IIUC, is the case in your
use case), we should use an appropriate interface. AFAIK the CPU ports
are already represented in devlink, shouldn't devlink be also used to
query their registers?

>  ethtool.c | 13 +++++++++++--
>  1 file changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/ethtool.c b/ethtool.c
> index 05fe05a08..c0e2903c5 100644
> --- a/ethtool.c
> +++ b/ethtool.c
> @@ -1245,7 +1245,7 @@ static int dump_regs(int gregs_dump_raw, int gregs_dump_hex,
>  
>  	if (gregs_dump_raw) {
>  		fwrite(regs->data, regs->len, 1, stdout);
> -		return 0;
> +		goto nested;
>  	}
>  
>  	if (!gregs_dump_hex)
> @@ -1253,7 +1253,7 @@ static int dump_regs(int gregs_dump_raw, int gregs_dump_hex,
>  			if (!strncmp(driver_list[i].name, info->driver,
>  				     ETHTOOL_BUSINFO_LEN)) {
>  				if (driver_list[i].func(info, regs) == 0)
> -					return 0;
> +					goto nested;
>  				/* This version (or some other
>  				 * variation in the dump format) is
>  				 * not handled; fall back to hex
> @@ -1263,6 +1263,15 @@ static int dump_regs(int gregs_dump_raw, int gregs_dump_hex,
>  
>  	dump_hex(stdout, regs->data, regs->len, 0);
>  
> +nested:
> +	/* Recurse dump if some drvinfo and regs structures are nested */
> +	if (info->regdump_len > regs->len + sizeof(*info) + sizeof(*regs)) {
> +		info = (struct ethtool_drvinfo *)(&regs->data[0] + regs->len);
> +		regs = (struct ethtool_regs *)(&regs->data[0] + regs->len + sizeof(*info));
> +
> +		return dump_regs(gregs_dump_raw, gregs_dump_hex, info, regs);
> +	}
> +
>  	return 0;
>  }
>  

For raw and hex dumps, this will dump only the payloads without any
metadata allowing to identify what are the additional blocks for the
other related objects, i.e. where they start, how long they are and what
they belong to. That doesn't seem very useful.

Michal Kubecek

^ permalink raw reply

* Re: [PATCH v3] net: sched: Fix a possible null-pointer dereference in dequeue_func()
From: Eric Dumazet @ 2019-08-05  8:05 UTC (permalink / raw)
  To: Cong Wang, Jia-Ju Bai
  Cc: Jamal Hadi Salim, Jiri Pirko, David Miller,
	Linux Kernel Network Developers, LKML
In-Reply-To: <CAM_iQpU0L6hgzg1N9Ay=72Ee-2Ni-ovPJX8SyJaRDv7dbhZs_Q@mail.gmail.com>



On 7/29/19 7:23 PM, Cong Wang wrote:
> On Mon, Jul 29, 2019 at 1:24 AM Jia-Ju Bai <baijiaju1990@gmail.com> wrote:
>>
>> In dequeue_func(), there is an if statement on line 74 to check whether
>> skb is NULL:
>>     if (skb)
>>
>> When skb is NULL, it is used on line 77:
>>     prefetch(&skb->end);
>>
>> Thus, a possible null-pointer dereference may occur.
>>
>> To fix this bug, skb->end is used when skb is not NULL.
>>
>> This bug is found by a static analysis tool STCheck written by us.
> 
> It doesn't dereference the pointer, it simply calculates the address
> and passes it to gcc builtin prefetch. Both are fine when it is NULL,
> as prefetching a NULL address should be okay for kernel.
> 
> So your changelog is misleading and it doesn't fix any bug,
> although it does very slightly make the code better.
> 

Original code was intentional.

A prefetch() need to be done as early as possible.

At the time of commit 76e3cc126bb223013a6b9a0e2a51238d1ef2e409
this was pretty clear :

+static struct sk_buff *dequeue(struct codel_vars *vars, struct Qdisc *sch)
+{
+	struct sk_buff *skb = __skb_dequeue(&sch->q);
+
+	prefetch(&skb->end); /* we'll need skb_shinfo() */
+	return skb;
+}


Really static analysis should learn about prefetch(X) being legal for _any_ value of X

^ permalink raw reply

* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker
From: Jason Wang @ 2019-08-05  8:18 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: kvm, virtualization, netdev, linux-kernel, linux-mm, jgg,
	Paul E. McKenney
In-Reply-To: <20190803173825-mutt-send-email-mst@kernel.org>


On 2019/8/4 上午5:54, Michael S. Tsirkin wrote:
> On Thu, Aug 01, 2019 at 04:06:13AM -0400, Jason Wang wrote:
>> On 2019/8/1 上午2:29, Michael S. Tsirkin wrote:
>>> On Wed, Jul 31, 2019 at 04:46:53AM -0400, Jason Wang wrote:
>>>> We used to use RCU to synchronize MMU notifier with worker. This leads
>>>> calling synchronize_rcu() in invalidate_range_start(). But on a busy
>>>> system, there would be many factors that may slow down the
>>>> synchronize_rcu() which makes it unsuitable to be called in MMU
>>>> notifier.
>>>>
>>>> A solution is SRCU but its overhead is obvious with the expensive full
>>>> memory barrier. Another choice is to use seqlock, but it doesn't
>>>> provide a synchronization method between readers and writers. The last
>>>> choice is to use vq mutex, but it need to deal with the worst case
>>>> that MMU notifier must be blocked and wait for the finish of swap in.
>>>>
>>>> So this patch switches use a counter to track whether or not the map
>>>> was used. The counter was increased when vq try to start or finish
>>>> uses the map. This means, when it was even, we're sure there's no
>>>> readers and MMU notifier is synchronized. When it was odd, it means
>>>> there's a reader we need to wait it to be even again then we are
>>>> synchronized. To avoid full memory barrier, store_release +
>>>> load_acquire on the counter is used.
>>> Unfortunately this needs a lot of review and testing, so this can't make
>>> rc2, and I don't think this is the kind of patch I can merge after rc3.
>>> Subtle memory barrier tricks like this can introduce new bugs while they
>>> are fixing old ones.
>> I admit the patch is tricky. Some questions:
>>
>> - Do we must address the case of e.g swap in? If not, a simple
>>    vhost_work_flush() instead of synchronize_rcu() may work.
>> - Having some hard thought, I think we can use seqlock, it looks
>>    to me smp_wmb() is in write_segcount_begin() is sufficient, we don't
>>    care vq->map read before smp_wmb(), and for the other we all have
>>    good data devendency so smp_wmb() in the write_seqbegin_end() is
>>    sufficient.
> If we need an mb in the begin() we can switch to
> dependent_ptr_mb. if you need me to fix it up
> and repost, let me know.


Yes, but please let me figure out whether mb is really necessary here.


>
> Why isn't it a problem if the map is
> accessed outside the lock?


Correct me if I was wrong. E.g for vhost_put_avail_event()

vhost_vq_access_map_begin(vq);

                 map = vq->maps[VHOST_ADDR_USED];
                 if (likely(map)) {
                         used = map->addr;
                         *((__virtio16 *)&used->ring[vq->num]) =
                                 cpu_to_vhost16(vq, vq->avail_idx);
vhost_vq_access_map_end(vq);
                         return 0;
}

                 vhost_vq_access_map_end(vq);


We dont' care whether map is accessed before vhost_vq_access_map_begin() 
since MMU notifier can only change map from non-NULL to NULL. If we read 
it too early, we will only get NULL and won't use the map at all. And 
smp_wmb() in vhost_vq_access_map_begin() can make sure the real access 
to map->addr is done after we increasing the counter.


>
>
>
>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>> index db2c81cb1e90..6d9501303258 100644
>> --- a/drivers/vhost/vhost.c
>> +++ b/drivers/vhost/vhost.c
>> @@ -363,39 +363,29 @@ static bool vhost_map_range_overlap(struct vhost_uaddr *uaddr,
>>   
>>   static void inline vhost_vq_access_map_begin(struct vhost_virtqueue *vq)
>>   {
>> -	int ref = READ_ONCE(vq->ref);
>> -
>> -	smp_store_release(&vq->ref, ref + 1);
>> -	/* Make sure ref counter is visible before accessing the map */
>> -	smp_load_acquire(&vq->ref);
>> +	write_seqcount_begin(&vq->seq);
>>   }
>>   
>>   static void inline vhost_vq_access_map_end(struct vhost_virtqueue *vq)
>>   {
>> -	int ref = READ_ONCE(vq->ref);
>> -
>> -	/* Make sure vq access is done before increasing ref counter */
>> -	smp_store_release(&vq->ref, ref + 1);
>> +	write_seqcount_end(&vq->seq);
>>   }
>>   
>>   static void inline vhost_vq_sync_access(struct vhost_virtqueue *vq)
>>   {
>> -	int ref;
>> +	unsigned int ret;
>>   
>>   	/* Make sure map change was done before checking ref counter */
>>   	smp_mb();
>> -
>> -	ref = READ_ONCE(vq->ref);
>> -	if (ref & 0x1) {
>> -		/* When ref change, we are sure no reader can see
>> +	ret = raw_read_seqcount(&vq->seq);
>> +	if (ret & 0x1) {
>> +		/* When seq changes, we are sure no reader can see
>>   		 * previous map */
>> -		while (READ_ONCE(vq->ref) == ref) {
>> -			set_current_state(TASK_RUNNING);
>> +		while (raw_read_seqcount(&vq->seq) == ret)
>>   			schedule();
>
> So why do we set state here?


No need, just a artifact of previous patch.


> And should not we
> check need_sched?


We need use need_sched().


>
>
>> -		}
>>   	}
>> -	/* Make sure ref counter was checked before any other
>> -	 * operations that was dene on map. */
>> +	/* Make sure seq was checked before any other operations that
>> +	 * was dene on map. */
>>   	smp_mb();
>>   }
>>   
>> @@ -691,7 +681,7 @@ void vhost_dev_init(struct vhost_dev *dev,
>>   		vq->indirect = NULL;
>>   		vq->heads = NULL;
>>   		vq->dev = dev;
>> -		vq->ref = 0;
>> +		seqcount_init(&vq->seq);
>>   		mutex_init(&vq->mutex);
>>   		spin_lock_init(&vq->mmu_lock);
>>   		vhost_vq_reset(dev, vq);
>> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
>> index 3d10da0ae511..1a705e181a84 100644
>> --- a/drivers/vhost/vhost.h
>> +++ b/drivers/vhost/vhost.h
>> @@ -125,7 +125,7 @@ struct vhost_virtqueue {
>>   	 */
>>   	struct vhost_uaddr uaddrs[VHOST_NUM_ADDRS];
>>   #endif
>> -	int ref;
>> +	seqcount_t seq;
>>   	const struct vhost_umem_node *meta_iotlb[VHOST_NUM_ADDRS];
>>   
>>   	struct file *kick;
>> -- 
>> 2.18.1
>>
>>>
>>>
>>>
>>>
>>>> Consider the read critical section is pretty small the synchronization
>>>> should be done very fast.
>>>>
>>>> Note the patch lead about 3% PPS dropping.
>>> Sorry what do you mean by this last sentence? This degrades performance
>>> compared to what?
>> Compare to without this patch.
> OK is the feature still a performance win? or should we drop it for now?


Still a win, just a drop from 23% improvement to 20% improvement.


>>>> Reported-by: Michael S. Tsirkin <mst@redhat.com>
>>>> Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address")
>>>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>>>> ---
>>>>   drivers/vhost/vhost.c | 145 ++++++++++++++++++++++++++----------------
>>>>   drivers/vhost/vhost.h |   7 +-
>>>>   2 files changed, 94 insertions(+), 58 deletions(-)
>>>>
>>>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>>>> index cfc11f9ed9c9..db2c81cb1e90 100644
>>>> --- a/drivers/vhost/vhost.c
>>>> +++ b/drivers/vhost/vhost.c
>>>> @@ -324,17 +324,16 @@ static void vhost_uninit_vq_maps(struct vhost_virtqueue *vq)
>>>>   
>>>>   	spin_lock(&vq->mmu_lock);
>>>>   	for (i = 0; i < VHOST_NUM_ADDRS; i++) {
>>>> -		map[i] = rcu_dereference_protected(vq->maps[i],
>>>> -				  lockdep_is_held(&vq->mmu_lock));
>>>> +		map[i] = vq->maps[i];
>>>>   		if (map[i]) {
>>>>   			vhost_set_map_dirty(vq, map[i], i);
>>>> -			rcu_assign_pointer(vq->maps[i], NULL);
>>>> +			vq->maps[i] = NULL;
>>>>   		}
>>>>   	}
>>>>   	spin_unlock(&vq->mmu_lock);
>>>>   
>>>> -	/* No need for synchronize_rcu() or kfree_rcu() since we are
>>>> -	 * serialized with memory accessors (e.g vq mutex held).
>>>> +	/* No need for synchronization since we are serialized with
>>>> +	 * memory accessors (e.g vq mutex held).
>>>>   	 */
>>>>   
>>>>   	for (i = 0; i < VHOST_NUM_ADDRS; i++)
>>>> @@ -362,6 +361,44 @@ static bool vhost_map_range_overlap(struct vhost_uaddr *uaddr,
>>>>   	return !(end < uaddr->uaddr || start > uaddr->uaddr - 1 + uaddr->size);
>>>>   }
>>>>   
>>>> +static void inline vhost_vq_access_map_begin(struct vhost_virtqueue *vq)
>>>> +{
>>>> +	int ref = READ_ONCE(vq->ref);
>>>> +
>>>> +	smp_store_release(&vq->ref, ref + 1);
>>>> +	/* Make sure ref counter is visible before accessing the map */
>>>> +	smp_load_acquire(&vq->ref);
>>> The map access is after this sequence, correct?
>> Yes.
>>
>>> Just going by the rules in Documentation/memory-barriers.txt,
>>> I think that this pair will not order following accesses with ref store.
>>>
>>> Documentation/memory-barriers.txt says:
>>>
>>>
>>> +     In addition, a RELEASE+ACQUIRE
>>> +     pair is -not- guaranteed to act as a full memory barrier.
>>>
>>>
>>>
>>> The guarantee that is made is this:
>>> 	after
>>>       an ACQUIRE on a given variable, all memory accesses preceding any prior
>>>       RELEASE on that same variable are guaranteed to be visible.
>> Yes, but it's not clear about the order of ACQUIRE the same location
>> of previous RELEASE. And it only has a example like:
>>
>> "
>> 	*A = a;
>> 	RELEASE M
>> 	ACQUIRE N
>> 	*B = b;
>>
>> could occur as:
>>
>> 	ACQUIRE N, STORE *B, STORE *A, RELEASE M
>> "
>>
>> But it doesn't explain what happen when
>>
>> *A = a
>> RELEASE M
>> ACQUIRE M
>> *B = b;
>>
>> And tools/memory-model/Documentation said
>>
>> "
>> First, when a lock-acquire reads from a lock-release, the LKMM
>> requires that every instruction po-before the lock-release must
>> execute before any instruction po-after the lock-acquire.
>> "
>>
>> Is this a hint that I was correct?
> I don't think it's correct since by this logic
> memory barriers can be nops on x86.


It not a nop, instead, it goes to a write and then read to one same 
location of memory.


>
>>>
>>> And if we also had the reverse rule we'd end up with a full barrier,
>>> won't we?
>>>
>>> Cc Paul in case I missed something here. And if I'm right,
>>> maybe we should call this out, adding
>>>
>>> 	"The opposite is not true: a prior RELEASE is not
>>> 	 guaranteed to be visible before memory accesses following
>>> 	 the subsequent ACQUIRE".
>> That kinds of violates the RELEASE?
>>
>> "
>>       This also acts as a one-way permeable barrier.  It guarantees that all
>>       memory operations before the RELEASE operation will appear to happen
>>       before the RELEASE operation with respect to the other components of the
>> "
>
> yes but we are talking about RELEASE itself versus stuff
> that comes after it.


Unless RELEASE and ACQUIRE on the same address can be reordered (at 
least doesn't happen x86). The following ACQUIRE can make sure stuff 
after ACQUIRE id done after RELEASE.


>
>>>
>>>
>>>> +}
>>>> +
>>>> +static void inline vhost_vq_access_map_end(struct vhost_virtqueue *vq)
>>>> +{
>>>> +	int ref = READ_ONCE(vq->ref);
>>>> +
>>>> +	/* Make sure vq access is done before increasing ref counter */
>>>> +	smp_store_release(&vq->ref, ref + 1);
>>>> +}
>>>> +
>>>> +static void inline vhost_vq_sync_access(struct vhost_virtqueue *vq)
>>>> +{
>>>> +	int ref;
>>>> +
>>>> +	/* Make sure map change was done before checking ref counter */
>>>> +	smp_mb();
>>>> +
>>>> +	ref = READ_ONCE(vq->ref);
>>>> +	if (ref & 0x1) {
>>> Please document the even/odd trick here too, not just in the commit log.
>>>
>> Ok.
>>
>>>> +		/* When ref change,
>>> changes
>>>
>>>> we are sure no reader can see
>>>> +		 * previous map */
>>>> +		while (READ_ONCE(vq->ref) == ref) {
>>>
>>> what is the below line in aid of?
>>>
>>>> +			set_current_state(TASK_RUNNING);
> any answers here?


It's unecessary.


>
>>>> +			schedule();
>>>                          if (need_resched())
>>>                                  schedule();
>>>
>>> ?
>> Yes, better.
>>
>>>> +		}
>>> On an interruptible kernel, there's a risk here is that
>>> a task got preempted with an odd ref.
>>> So I suspect we'll have to disable preemption when we
>>> make ref odd.
>> I'm not sure I get, if the odd is not the original value we read,
>> we're sure it won't read the new map here I believe.
> But we will spin for a very long time in this case.


Yes, but do we disable preemption in MMU notifier callback. If not it 
should be the same as MMU notifier was preempted for long time.

We can disable the preempt count here, but it needs an extra cacheline.

Thanks


>
>>>
>>>> +	}
>>>> +	/* Make sure ref counter was checked before any other
>>>> +	 * operations that was dene on map. */
>>> was dene -> were done?
>>>
>> Yes.
>>
>>>> +	smp_mb();
>>>> +}
>>>> +
>>>>   static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq,
>>>>   				      int index,
>>>>   				      unsigned long start,
>>>> @@ -376,16 +413,15 @@ static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq,
>>>>   	spin_lock(&vq->mmu_lock);
>>>>   	++vq->invalidate_count;
>>>>   
>>>> -	map = rcu_dereference_protected(vq->maps[index],
>>>> -					lockdep_is_held(&vq->mmu_lock));
>>>> +	map = vq->maps[index];
>>>>   	if (map) {
>>>>   		vhost_set_map_dirty(vq, map, index);
>>>> -		rcu_assign_pointer(vq->maps[index], NULL);
>>>> +		vq->maps[index] = NULL;
>>>>   	}
>>>>   	spin_unlock(&vq->mmu_lock);
>>>>   
>>>>   	if (map) {
>>>> -		synchronize_rcu();
>>>> +		vhost_vq_sync_access(vq);
>>>>   		vhost_map_unprefetch(map);
>>>>   	}
>>>>   }
>>>> @@ -457,7 +493,7 @@ static void vhost_init_maps(struct vhost_dev *dev)
>>>>   	for (i = 0; i < dev->nvqs; ++i) {
>>>>   		vq = dev->vqs[i];
>>>>   		for (j = 0; j < VHOST_NUM_ADDRS; j++)
>>>> -			RCU_INIT_POINTER(vq->maps[j], NULL);
>>>> +			vq->maps[j] = NULL;
>>>>   	}
>>>>   }
>>>>   #endif
>>>> @@ -655,6 +691,7 @@ void vhost_dev_init(struct vhost_dev *dev,
>>>>   		vq->indirect = NULL;
>>>>   		vq->heads = NULL;
>>>>   		vq->dev = dev;
>>>> +		vq->ref = 0;
>>>>   		mutex_init(&vq->mutex);
>>>>   		spin_lock_init(&vq->mmu_lock);
>>>>   		vhost_vq_reset(dev, vq);
>>>> @@ -921,7 +958,7 @@ static int vhost_map_prefetch(struct vhost_virtqueue *vq,
>>>>   	map->npages = npages;
>>>>   	map->pages = pages;
>>>>   
>>>> -	rcu_assign_pointer(vq->maps[index], map);
>>>> +	vq->maps[index] = map;
>>>>   	/* No need for a synchronize_rcu(). This function should be
>>>>   	 * called by dev->worker so we are serialized with all
>>>>   	 * readers.
>>>> @@ -1216,18 +1253,18 @@ static inline int vhost_put_avail_event(struct vhost_virtqueue *vq)
>>>>   	struct vring_used *used;
>>>>   
>>>>   	if (!vq->iotlb) {
>>>> -		rcu_read_lock();
>>>> +		vhost_vq_access_map_begin(vq);
>>>>   
>>>> -		map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
>>>> +		map = vq->maps[VHOST_ADDR_USED];
>>>>   		if (likely(map)) {
>>>>   			used = map->addr;
>>>>   			*((__virtio16 *)&used->ring[vq->num]) =
>>>>   				cpu_to_vhost16(vq, vq->avail_idx);
>>>> -			rcu_read_unlock();
>>>> +			vhost_vq_access_map_end(vq);
>>>>   			return 0;
>>>>   		}
>>>>   
>>>> -		rcu_read_unlock();
>>>> +		vhost_vq_access_map_end(vq);
>>>>   	}
>>>>   #endif
>>>>   
>>>> @@ -1245,18 +1282,18 @@ static inline int vhost_put_used(struct vhost_virtqueue *vq,
>>>>   	size_t size;
>>>>   
>>>>   	if (!vq->iotlb) {
>>>> -		rcu_read_lock();
>>>> +		vhost_vq_access_map_begin(vq);
>>>>   
>>>> -		map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
>>>> +		map = vq->maps[VHOST_ADDR_USED];
>>>>   		if (likely(map)) {
>>>>   			used = map->addr;
>>>>   			size = count * sizeof(*head);
>>>>   			memcpy(used->ring + idx, head, size);
>>>> -			rcu_read_unlock();
>>>> +			vhost_vq_access_map_end(vq);
>>>>   			return 0;
>>>>   		}
>>>>   
>>>> -		rcu_read_unlock();
>>>> +		vhost_vq_access_map_end(vq);
>>>>   	}
>>>>   #endif
>>>>   
>>>> @@ -1272,17 +1309,17 @@ static inline int vhost_put_used_flags(struct vhost_virtqueue *vq)
>>>>   	struct vring_used *used;
>>>>   
>>>>   	if (!vq->iotlb) {
>>>> -		rcu_read_lock();
>>>> +		vhost_vq_access_map_begin(vq);
>>>>   
>>>> -		map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
>>>> +		map = vq->maps[VHOST_ADDR_USED];
>>>>   		if (likely(map)) {
>>>>   			used = map->addr;
>>>>   			used->flags = cpu_to_vhost16(vq, vq->used_flags);
>>>> -			rcu_read_unlock();
>>>> +			vhost_vq_access_map_end(vq);
>>>>   			return 0;
>>>>   		}
>>>>   
>>>> -		rcu_read_unlock();
>>>> +		vhost_vq_access_map_end(vq);
>>>>   	}
>>>>   #endif
>>>>   
>>>> @@ -1298,17 +1335,17 @@ static inline int vhost_put_used_idx(struct vhost_virtqueue *vq)
>>>>   	struct vring_used *used;
>>>>   
>>>>   	if (!vq->iotlb) {
>>>> -		rcu_read_lock();
>>>> +		vhost_vq_access_map_begin(vq);
>>>>   
>>>> -		map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
>>>> +		map = vq->maps[VHOST_ADDR_USED];
>>>>   		if (likely(map)) {
>>>>   			used = map->addr;
>>>>   			used->idx = cpu_to_vhost16(vq, vq->last_used_idx);
>>>> -			rcu_read_unlock();
>>>> +			vhost_vq_access_map_end(vq);
>>>>   			return 0;
>>>>   		}
>>>>   
>>>> -		rcu_read_unlock();
>>>> +		vhost_vq_access_map_end(vq);
>>>>   	}
>>>>   #endif
>>>>   
>>>> @@ -1362,17 +1399,17 @@ static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
>>>>   	struct vring_avail *avail;
>>>>   
>>>>   	if (!vq->iotlb) {
>>>> -		rcu_read_lock();
>>>> +		vhost_vq_access_map_begin(vq);
>>>>   
>>>> -		map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]);
>>>> +		map = vq->maps[VHOST_ADDR_AVAIL];
>>>>   		if (likely(map)) {
>>>>   			avail = map->addr;
>>>>   			*idx = avail->idx;
>>>> -			rcu_read_unlock();
>>>> +			vhost_vq_access_map_end(vq);
>>>>   			return 0;
>>>>   		}
>>>>   
>>>> -		rcu_read_unlock();
>>>> +		vhost_vq_access_map_end(vq);
>>>>   	}
>>>>   #endif
>>>>   
>>>> @@ -1387,17 +1424,17 @@ static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
>>>>   	struct vring_avail *avail;
>>>>   
>>>>   	if (!vq->iotlb) {
>>>> -		rcu_read_lock();
>>>> +		vhost_vq_access_map_begin(vq);
>>>>   
>>>> -		map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]);
>>>> +		map = vq->maps[VHOST_ADDR_AVAIL];
>>>>   		if (likely(map)) {
>>>>   			avail = map->addr;
>>>>   			*head = avail->ring[idx & (vq->num - 1)];
>>>> -			rcu_read_unlock();
>>>> +			vhost_vq_access_map_end(vq);
>>>>   			return 0;
>>>>   		}
>>>>   
>>>> -		rcu_read_unlock();
>>>> +		vhost_vq_access_map_end(vq);
>>>>   	}
>>>>   #endif
>>>>   
>>>> @@ -1413,17 +1450,17 @@ static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq,
>>>>   	struct vring_avail *avail;
>>>>   
>>>>   	if (!vq->iotlb) {
>>>> -		rcu_read_lock();
>>>> +		vhost_vq_access_map_begin(vq);
>>>>   
>>>> -		map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]);
>>>> +		map = vq->maps[VHOST_ADDR_AVAIL];
>>>>   		if (likely(map)) {
>>>>   			avail = map->addr;
>>>>   			*flags = avail->flags;
>>>> -			rcu_read_unlock();
>>>> +			vhost_vq_access_map_end(vq);
>>>>   			return 0;
>>>>   		}
>>>>   
>>>> -		rcu_read_unlock();
>>>> +		vhost_vq_access_map_end(vq);
>>>>   	}
>>>>   #endif
>>>>   
>>>> @@ -1438,15 +1475,15 @@ static inline int vhost_get_used_event(struct vhost_virtqueue *vq,
>>>>   	struct vring_avail *avail;
>>>>   
>>>>   	if (!vq->iotlb) {
>>>> -		rcu_read_lock();
>>>> -		map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]);
>>>> +		vhost_vq_access_map_begin(vq);
>>>> +		map = vq->maps[VHOST_ADDR_AVAIL];
>>>>   		if (likely(map)) {
>>>>   			avail = map->addr;
>>>>   			*event = (__virtio16)avail->ring[vq->num];
>>>> -			rcu_read_unlock();
>>>> +			vhost_vq_access_map_end(vq);
>>>>   			return 0;
>>>>   		}
>>>> -		rcu_read_unlock();
>>>> +		vhost_vq_access_map_end(vq);
>>>>   	}
>>>>   #endif
>>>>   
>>>> @@ -1461,17 +1498,17 @@ static inline int vhost_get_used_idx(struct vhost_virtqueue *vq,
>>>>   	struct vring_used *used;
>>>>   
>>>>   	if (!vq->iotlb) {
>>>> -		rcu_read_lock();
>>>> +		vhost_vq_access_map_begin(vq);
>>>>   
>>>> -		map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
>>>> +		map = vq->maps[VHOST_ADDR_USED];
>>>>   		if (likely(map)) {
>>>>   			used = map->addr;
>>>>   			*idx = used->idx;
>>>> -			rcu_read_unlock();
>>>> +			vhost_vq_access_map_end(vq);
>>>>   			return 0;
>>>>   		}
>>>>   
>>>> -		rcu_read_unlock();
>>>> +		vhost_vq_access_map_end(vq);
>>>>   	}
>>>>   #endif
>>>>   
>>>> @@ -1486,17 +1523,17 @@ static inline int vhost_get_desc(struct vhost_virtqueue *vq,
>>>>   	struct vring_desc *d;
>>>>   
>>>>   	if (!vq->iotlb) {
>>>> -		rcu_read_lock();
>>>> +		vhost_vq_access_map_begin(vq);
>>>>   
>>>> -		map = rcu_dereference(vq->maps[VHOST_ADDR_DESC]);
>>>> +		map = vq->maps[VHOST_ADDR_DESC];
>>>>   		if (likely(map)) {
>>>>   			d = map->addr;
>>>>   			*desc = *(d + idx);
>>>> -			rcu_read_unlock();
>>>> +			vhost_vq_access_map_end(vq);
>>>>   			return 0;
>>>>   		}
>>>>   
>>>> -		rcu_read_unlock();
>>>> +		vhost_vq_access_map_end(vq);
>>>>   	}
>>>>   #endif
>>>>   
>>>> @@ -1843,13 +1880,11 @@ static bool iotlb_access_ok(struct vhost_virtqueue *vq,
>>>>   #if VHOST_ARCH_CAN_ACCEL_UACCESS
>>>>   static void vhost_vq_map_prefetch(struct vhost_virtqueue *vq)
>>>>   {
>>>> -	struct vhost_map __rcu *map;
>>>> +	struct vhost_map *map;
>>>>   	int i;
>>>>   
>>>>   	for (i = 0; i < VHOST_NUM_ADDRS; i++) {
>>>> -		rcu_read_lock();
>>>> -		map = rcu_dereference(vq->maps[i]);
>>>> -		rcu_read_unlock();
>>>> +		map = vq->maps[i];
>>>>   		if (unlikely(!map))
>>>>   			vhost_map_prefetch(vq, i);
>>>>   	}
>>>> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
>>>> index a9a2a93857d2..f9e9558a529d 100644
>>>> --- a/drivers/vhost/vhost.h
>>>> +++ b/drivers/vhost/vhost.h
>>>> @@ -115,16 +115,17 @@ struct vhost_virtqueue {
>>>>   #if VHOST_ARCH_CAN_ACCEL_UACCESS
>>>>   	/* Read by memory accessors, modified by meta data
>>>>   	 * prefetching, MMU notifier and vring ioctl().
>>>> -	 * Synchonrized through mmu_lock (writers) and RCU (writers
>>>> -	 * and readers).
>>>> +	 * Synchonrized through mmu_lock (writers) and ref counters,
>>>> +	 * see vhost_vq_access_map_begin()/vhost_vq_access_map_end().
>>>>   	 */
>>>> -	struct vhost_map __rcu *maps[VHOST_NUM_ADDRS];
>>>> +	struct vhost_map *maps[VHOST_NUM_ADDRS];
>>>>   	/* Read by MMU notifier, modified by vring ioctl(),
>>>>   	 * synchronized through MMU notifier
>>>>   	 * registering/unregistering.
>>>>   	 */
>>>>   	struct vhost_uaddr uaddrs[VHOST_NUM_ADDRS];
>>>>   #endif
>>>> +	int ref;
>>> Is it important that this is signed? If not I'd do unsigned here:
>>> even though kernel does compile with 2s complement sign overflow,
>>> it seems cleaner not to depend on that.
>> Not a must, let me fix.
>>
>> Thanks
>>
>>>>   	const struct vhost_umem_node *meta_iotlb[VHOST_NUM_ADDRS];
>>>>   
>>>>   	struct file *kick;
>>>> -- 
>>>> 2.18.1

^ permalink raw reply

* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker
From: Jason Wang @ 2019-08-05  8:21 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Gunthorpe, kvm, virtualization, netdev, linux-kernel,
	linux-mm
In-Reply-To: <20190805020752-mutt-send-email-mst@kernel.org>


On 2019/8/5 下午2:28, Michael S. Tsirkin wrote:
> On Mon, Aug 05, 2019 at 12:33:45PM +0800, Jason Wang wrote:
>> On 2019/8/2 下午10:03, Michael S. Tsirkin wrote:
>>> On Fri, Aug 02, 2019 at 05:40:07PM +0800, Jason Wang wrote:
>>>> Btw, I come up another idea, that is to disable preemption when vhost thread
>>>> need to access the memory. Then register preempt notifier and if vhost
>>>> thread is preempted, we're sure no one will access the memory and can do the
>>>> cleanup.
>>> Great, more notifiers :(
>>>
>>> Maybe can live with
>>> 1- disable preemption while using the cached pointer
>>> 2- teach vhost to recover from memory access failures,
>>>      by switching to regular from/to user path
>>
>> I don't get this, I believe we want to recover from regular from/to user
>> path, isn't it?
> That (disable copy to/from user completely) would be a nice to have
> since it would reduce the attack surface of the driver, but e.g. your
> code already doesn't do that.
>

Yes since it requires a lot of changes.


>
>>> So if you want to try that, fine since it's a step in
>>> the right direction.
>>>
>>> But I think fundamentally it's not what we want to do long term.
>>
>> Yes.
>>
>>
>>> It's always been a fundamental problem with this patch series that only
>>> metadata is accessed through a direct pointer.
>>>
>>> The difference in ways you handle metadata and data is what is
>>> now coming and messing everything up.
>>
>> I do propose soemthing like this in the past:
>> https://www.spinics.net/lists/linux-virtualization/msg36824.html. But looks
>> like you have some concern about its locality.
> Right and it doesn't go away. You'll need to come up
> with a test that messes it up and triggers a worst-case
> scenario, so we can measure how bad is that worst-case.




>
>> But the problem still there, GUP can do page fault, so still need to
>> synchronize it with MMU notifiers.
> I think the idea was, if GUP would need a pagefault, don't
> do a GUP and do to/from user instead.


But this still need to be synchronized with MMU notifiers (or using 
dedicated work for GUP).


>   Hopefully that
> will fault the page in and the next access will go through.
>
>> The solution might be something like
>> moving GUP to a dedicated kind of vhost work.
> Right, generally GUP.
>
>>> So if continuing the direct map approach,
>>> what is needed is a cache of mapped VM memory, then on a cache miss
>>> we'd queue work along the lines of 1-2 above.
>>>
>>> That's one direction to take. Another one is to give up on that and
>>> write our own version of uaccess macros.  Add a "high security" flag to
>>> the vhost module and if not active use these for userspace memory
>>> access.
>>
>> Or using SET_BACKEND_FEATURES?
> No, I don't think it's considered best practice to allow unpriveledged
> userspace control over whether kernel enables security features.


Get this.


>
>> But do you mean permanent GUP as I did in
>> original RFC https://lkml.org/lkml/2018/12/13/218?
>>
>> Thanks
> Permanent GUP breaks THP and NUMA.


Yes.

Thanks


>
>>>

^ permalink raw reply

* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker
From: Jason Wang @ 2019-08-05  8:22 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Gunthorpe, kvm, virtualization, netdev, linux-kernel,
	linux-mm
In-Reply-To: <20190805022833-mutt-send-email-mst@kernel.org>


On 2019/8/5 下午2:30, Michael S. Tsirkin wrote:
> On Mon, Aug 05, 2019 at 12:36:40PM +0800, Jason Wang wrote:
>> On 2019/8/2 下午10:27, Michael S. Tsirkin wrote:
>>> On Fri, Aug 02, 2019 at 09:46:13AM -0300, Jason Gunthorpe wrote:
>>>> On Fri, Aug 02, 2019 at 05:40:07PM +0800, Jason Wang wrote:
>>>>>> This must be a proper barrier, like a spinlock, mutex, or
>>>>>> synchronize_rcu.
>>>>> I start with synchronize_rcu() but both you and Michael raise some
>>>>> concern.
>>>> I've also idly wondered if calling synchronize_rcu() under the various
>>>> mm locks is a deadlock situation.
>>>>
>>>>> Then I try spinlock and mutex:
>>>>>
>>>>> 1) spinlock: add lots of overhead on datapath, this leads 0 performance
>>>>> improvement.
>>>> I think the topic here is correctness not performance improvement
>>> The topic is whether we should revert
>>> commit 7f466032dc9 ("vhost: access vq metadata through kernel virtual address")
>>>
>>> or keep it in. The only reason to keep it is performance.
>>
>> Maybe it's time to introduce the config option?
> Depending on CONFIG_BROKEN? I'm not sure it's a good idea.


Ok.


>>> Now as long as all this code is disabled anyway, we can experiment a
>>> bit.
>>>
>>> I personally feel we would be best served by having two code paths:
>>>
>>> - Access to VM memory directly mapped into kernel
>>> - Access to userspace
>>>
>>>
>>> Having it all cleanly split will allow a bunch of optimizations, for
>>> example for years now we planned to be able to process an incoming short
>>> packet directly on softirq path, or an outgoing on directly within
>>> eventfd.
>>
>> It's not hard consider we've already had our own accssors. But the question
>> is (as asked in another thread), do you want permanent GUP or still use MMU
>> notifiers.
>>
>> Thanks
> We want THP and NUMA to work. Both are important for performance.
>

Yes.

Thanks


^ permalink raw reply

* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker
From: Jason Wang @ 2019-08-05  8:24 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Gunthorpe, linux-mm, netdev, linux-kernel, kvm,
	virtualization
In-Reply-To: <20190805023106-mutt-send-email-mst@kernel.org>


On 2019/8/5 下午2:40, Michael S. Tsirkin wrote:
> On Mon, Aug 05, 2019 at 12:41:45PM +0800, Jason Wang wrote:
>> On 2019/8/5 下午12:36, Jason Wang wrote:
>>> On 2019/8/2 下午10:27, Michael S. Tsirkin wrote:
>>>> On Fri, Aug 02, 2019 at 09:46:13AM -0300, Jason Gunthorpe wrote:
>>>>> On Fri, Aug 02, 2019 at 05:40:07PM +0800, Jason Wang wrote:
>>>>>>> This must be a proper barrier, like a spinlock, mutex, or
>>>>>>> synchronize_rcu.
>>>>>> I start with synchronize_rcu() but both you and Michael raise some
>>>>>> concern.
>>>>> I've also idly wondered if calling synchronize_rcu() under the various
>>>>> mm locks is a deadlock situation.
>>>>>
>>>>>> Then I try spinlock and mutex:
>>>>>>
>>>>>> 1) spinlock: add lots of overhead on datapath, this leads 0
>>>>>> performance
>>>>>> improvement.
>>>>> I think the topic here is correctness not performance improvement
>>>> The topic is whether we should revert
>>>> commit 7f466032dc9 ("vhost: access vq metadata through kernel
>>>> virtual address")
>>>>
>>>> or keep it in. The only reason to keep it is performance.
>>>
>>> Maybe it's time to introduce the config option?
>>
>> Or does it make sense if I post a V3 with:
>>
>> - introduce config option and disable the optimization by default
>>
>> - switch from synchronize_rcu() to vhost_flush_work(), but the rest are the
>> same
>>
>> This can give us some breath to decide which way should go for next release?
>>
>> Thanks
> As is, with preempt enabled?  Nope I don't think blocking an invalidator
> on swap IO is ok, so I don't believe this stuff is going into this
> release at this point.
>
> So it's more a question of whether it's better to revert and apply a clean
> patch on top, or just keep the code around but disabled with an ifdef as is.
> I'm open to both options, and would like your opinion on this.


Then I prefer to leave current code (VHOST_ARCH_CAN_ACCEL to 0) as is. 
This can also save efforts on rebasing packed virtqueues.

Thanks


>
>>>
>>>> Now as long as all this code is disabled anyway, we can experiment a
>>>> bit.
>>>>
>>>> I personally feel we would be best served by having two code paths:
>>>>
>>>> - Access to VM memory directly mapped into kernel
>>>> - Access to userspace
>>>>
>>>>
>>>> Having it all cleanly split will allow a bunch of optimizations, for
>>>> example for years now we planned to be able to process an incoming short
>>>> packet directly on softirq path, or an outgoing on directly within
>>>> eventfd.
>>>
>>> It's not hard consider we've already had our own accssors. But the
>>> question is (as asked in another thread), do you want permanent GUP or
>>> still use MMU notifiers.
>>>
>>> Thanks
>>>
>>> _______________________________________________
>>> Virtualization mailing list
>>> Virtualization@lists.linux-foundation.org
>>> https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* [PATCH net-next v2] net: dsa: mv88e6xxx: extend PTP gettime function to read system clock
From: Hubert Feurstein @ 2019-08-05  8:26 UTC (permalink / raw)
  To: netdev, linux-kernel
  Cc: Hubert Feurstein, Richard Cochran, Andrew Lunn, Vivien Didelot,
	Florian Fainelli, David S. Miller

From: Hubert Feurstein <h.feurstein@gmail.com>

This adds support for the PTP_SYS_OFFSET_EXTENDED ioctl.

Signed-off-by: Hubert Feurstein <h.feurstein@gmail.com>
---
 drivers/net/dsa/mv88e6xxx/ptp.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx/ptp.c b/drivers/net/dsa/mv88e6xxx/ptp.c
index 073cbd0bb91b..2ebc7db0fd4a 100644
--- a/drivers/net/dsa/mv88e6xxx/ptp.c
+++ b/drivers/net/dsa/mv88e6xxx/ptp.c
@@ -235,14 +235,17 @@ static int mv88e6xxx_ptp_adjtime(struct ptp_clock_info *ptp, s64 delta)
 	return 0;
 }
 
-static int mv88e6xxx_ptp_gettime(struct ptp_clock_info *ptp,
-				 struct timespec64 *ts)
+static int mv88e6xxx_ptp_gettimex(struct ptp_clock_info *ptp,
+				  struct timespec64 *ts,
+				  struct ptp_system_timestamp *sts)
 {
 	struct mv88e6xxx_chip *chip = ptp_to_chip(ptp);
 	u64 ns;
 
 	mv88e6xxx_reg_lock(chip);
+	ptp_read_system_prets(sts);
 	ns = timecounter_read(&chip->tstamp_tc);
+	ptp_read_system_postts(sts);
 	mv88e6xxx_reg_unlock(chip);
 
 	*ts = ns_to_timespec64(ns);
@@ -426,7 +429,7 @@ static void mv88e6xxx_ptp_overflow_check(struct work_struct *work)
 	struct mv88e6xxx_chip *chip = dw_overflow_to_chip(dw);
 	struct timespec64 ts;
 
-	mv88e6xxx_ptp_gettime(&chip->ptp_clock_info, &ts);
+	mv88e6xxx_ptp_gettimex(&chip->ptp_clock_info, &ts, NULL);
 
 	schedule_delayed_work(&chip->overflow_work,
 			      MV88E6XXX_TAI_OVERFLOW_PERIOD);
@@ -472,7 +475,7 @@ int mv88e6xxx_ptp_setup(struct mv88e6xxx_chip *chip)
 	chip->ptp_clock_info.max_adj    = MV88E6XXX_MAX_ADJ_PPB;
 	chip->ptp_clock_info.adjfine	= mv88e6xxx_ptp_adjfine;
 	chip->ptp_clock_info.adjtime	= mv88e6xxx_ptp_adjtime;
-	chip->ptp_clock_info.gettime64	= mv88e6xxx_ptp_gettime;
+	chip->ptp_clock_info.gettimex64	= mv88e6xxx_ptp_gettimex;
 	chip->ptp_clock_info.settime64	= mv88e6xxx_ptp_settime;
 	chip->ptp_clock_info.enable	= ptp_ops->ptp_enable;
 	chip->ptp_clock_info.verify	= ptp_ops->ptp_verify;
-- 
2.22.0


^ permalink raw reply related

* Re: [PATCH 05/14] gpio: lpc32xx: allow building on non-lpc32xx targets
From: Bartosz Golaszewski @ 2019-08-05  8:27 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: soc, arm-soc, Vladimir Zapolskiy, Sylvain Lemieux, Russell King,
	Gregory Clement, Linus Walleij, Jason Cooper, Andrew Lunn,
	Sebastian Hesselbarth, David S. Miller, Greg Kroah-Hartman,
	Alan Stern, Guenter Roeck, linux-gpio, netdev, linux-serial,
	USB list, LINUXWATCHDOG, Lee Jones, LKML
In-Reply-To: <CAK8P3a3KpKvRKXY72toE_5eAp4ER_Mre0GX3guwGeQgsY2HX+g@mail.gmail.com>

pt., 2 sie 2019 o 13:20 Arnd Bergmann <arnd@arndb.de> napisał(a):
>
> On Fri, Aug 2, 2019 at 9:10 AM Bartosz Golaszewski
> <bgolaszewski@baylibre.com> wrote:
> > > -#include <mach/hardware.h>
> > > -#include <mach/platform.h>
> > > +#define _GPREG(x)                              (x)
> >
> > What purpose does this macro serve?
> >
> > >
> > >  #define LPC32XX_GPIO_P3_INP_STATE              _GPREG(0x000)
> > >  #define LPC32XX_GPIO_P3_OUTP_SET               _GPREG(0x004)
>
> In the existing code base, this macro converts a register offset to
> an __iomem pointer for a gpio register. I changed the definition of the
> macro here to keep the number of changes down, but I it's just
> as easy to remove it if you prefer.

Could you just add a comment so that it's clear at first glance?

>
> > > @@ -167,14 +166,26 @@ struct lpc32xx_gpio_chip {
> > >         struct gpio_regs        *gpio_grp;
> > >  };
> > >
> > > +void __iomem *gpio_reg_base;
> >
> > Any reason why this can't be made part of struct lpc32xx_gpio_chip?
>
> It could be, but it's the same for each instance, and not known until
> probe() time, so the same pointer would need to be copied into each
> instance that is otherwise read-only.
>
> Let me know if you'd prefer me to rework these two things or leave
> them as they are.

I would prefer not to have global state in the driver, let's just
store the pointer in the data passed to gpiochip_add_data().

Bart

>
> > > +static inline u32 gpreg_read(unsigned long offset)
> >
> > Here and elsewhere: could you please keep the lpc32xx_gpio prefix for
> > all symbols?
>
> Sure.
>
>       Arnd

^ permalink raw reply

* Re: [PATCH net-next 1/3,v2] net: sched: use major priority number as hardware priority
From: Jiri Pirko @ 2019-08-05  8:36 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: netfilter-devel, davem, netdev, jakub.kicinski, marcelo.leitner,
	saeedm, wenxu, gerlitz.or, paulb
In-Reply-To: <20190802132846.3067-2-pablo@netfilter.org>

Fri, Aug 02, 2019 at 03:28:44PM CEST, pablo@netfilter.org wrote:
>tc transparently maps the software priority number to hardware. Update
>it to pass the major priority which is what most drivers expect. Update
>drivers too so they do not need to lshift the priority field of the
>flow_cls_common_offload object. The stmmac driver is an exception, since
>this code assumes the tc software priority is fine, therefore, lshift it
>just to be conservative.
>
>Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

Acked-by: Jiri Pirko <jiri@mellanox.com>

^ permalink raw reply

* Re: [PATCH net] net: dsa: mv88e6xxx: drop adjust_link to enabled phylink
From: Hubert Feurstein @ 2019-08-05  8:49 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: netdev, linux-kernel, Vivien Didelot, Florian Fainelli,
	David S. Miller
In-Reply-To: <20190804151013.GD6800@lunn.ch>

Hi Andrew,

It looks like some work is still needed in b53_phylink_mac_config to
take over the
functionality of the current adjust_link implementation.

Hubert

Am So., 4. Aug. 2019 um 17:10 Uhr schrieb Andrew Lunn <andrew@lunn.ch>:
>
> On Wed, Jul 31, 2019 at 05:42:39PM +0200, Hubert Feurstein wrote:
> > We have to drop the adjust_link callback in order to finally migrate to
> > phylink.
> >
> > Otherwise we get the following warning during startup:
> >   "mv88e6xxx 2188000.ethernet-1:10: Using legacy PHYLIB callbacks. Please
> >    migrate to PHYLINK!"
>
> Hi Hubert
>
> Do we need the same patch for the b53 driver?
>
>    Andrew

^ permalink raw reply

* Re: [RFC] net: dsa: mv88e6xxx: ptp: improve phc2sys precision for mv88e6xxx switch in combination with imx6-fec
From: Vladimir Oltean @ 2019-08-05  9:54 UTC (permalink / raw)
  To: Hubert Feurstein, mlichvar, Richard Cochran
  Cc: Andrew Lunn, netdev, lkml, Vivien Didelot, Florian Fainelli,
	David S. Miller
In-Reply-To: <20190802163248.11152-1-h.feurstein@gmail.com>

Hi Hubert,

On Fri, 2 Aug 2019 at 19:33, Hubert Feurstein <h.feurstein@gmail.com> wrote:
>
> With this patch the phc2sys synchronisation precision improved to +/-500ns.
>
> Signed-off-by: Hubert Feurstein <h.feurstein@gmail.com>
> ---
>
> This patch should only be the base for a discussion about improving precision of
> phc2sys (from the linuxptp package) in combination with a mv88e6xxx switch and
> imx6-fec.
>
> When I started my work on PTP at the beginning of this week, I was positively
> supprised about the sync precision of ptp4l. After adding support for the
> MV88E6220 I was able to synchronize two of our boards within +/- 10 nanoseconds.
> Remebering that the PTP system in the MV88E6220 is clocked with 100MHz, this is
> I think the best what can be expected. Big thanks to Richard and the other
> developers who made this possible.
>
> But then I tried to synchornize the PTP clock with the system clock by using
> phc2sys (phc2sys -rr -amq -l6) and I quickly was very disapointed about the
> precision.
>
>   Min:          -17829 ns
>   Max:          21694 ns
>   StdDev:       8520 ns
>   Count:        127
>
> So I tried to find the reason for this. And as you probably already know, the
> reason for that is the MDIO latency, which is terrible (up to 800 usecs).
>
> As a next step, I tried to to implement the gettimex64 callback (see: "[PATCH]
> net: dsa: mv88e6xxx: extend PTP gettime function to read system clock"). With
> this in place (and a patched linuxptp-master version which really uses the
> PTP_SYS_OFFSET_EXTENDED-ioctl), I got the following results:
>
>   Min:          -12144 ns
>   Max:          10891 ns
>   StdDev:       4046,71 ns
>   Count:        112
>
> So, things improved, but this is still unacceptable. It was still not possible
> to compensate the MDIO latency issue.
>
> According to my understanding, the timestamps (by using
> ptp_read_system_{pre|post}ts) have to be captured at a place where we have an
> constant offset related to the PHC in the switch. The only point where these
> timestamps can be captured is the mdio_write callback in the imx_fec. Because,
> reading the PHC timestamp will result in the follwing MDIO accesses:
>
>   (several) reads of the AVB_CMD register (to poll for the busy-flag)
>   write AVB_CMD (AVBOp=110b Read with post-incerement of PHC timestamp)
>   read AVB_DATA (PTP Global Time [15:0])
>   read AVB_DATA (PTP Global Time [31:16])
>
> With this sequence in mind, the Marvell switch has to snapshot the PHC
> timestamp at the write-AVB_CMD in order to be able to get sane values later by
> reading AVB_DATA. So the best place to capture the system timestamps is this
> one and only write operation directly in the imx_fec. By using the patch below
> (without the changes to the system clock resolution) I got the following
> results:
>
>   Min:          -464 ns
>   Max:          525 ns
>   StdDev:       210,31 ns
>   Count:        401
>
> I would say that is a huge improvement.
>
> I realized, that the system clock (at least on the imx6) has a resolution of
> 333ns. So I tried to speed up this clock by using the PER-clock instead of
> OSC_PER. This gave me 15ns resolution. The results were:
>
>   Min:          -476 ns
>   Max:          439 ns
>   StdDev:       176,52 ns
>   Count:        630
>
> So, things got improved again a little bit (at least the StdDev).
>
> According to my understanding, this is almost the best which is possible,
> because there is one more clock which influences the results. This is the MDIO
> bus clock, which is just 2.5MHz (or 400ns). So, I would say that +/- 400ns
> jitter is caused only by the MDIO bus clock, since the changes in imx_fec should
> not introduce any unpredictable latencies.
>
> My question to the experienced kernel developers is, how can this be implemented
> in a more generic way? Because this hack only works under these circumstances,
> and of course can never be part of the mainline kernel.
>
> I am not 100% sure that all my statements or assumptions are correct, so feel
> free to correct me.
>
> Hubert
>

You guessed correctly (since you copied me) that I'm battling much of
the same issues with the sja1105 and its spi-fsl-dspi controller
driver.
In fact I will refrain from saying anything about your
PTP_SYS_OFFSET_EXTENDED solution/hack combo, but will ask some
questions instead:

- You said you patched linuxptp master. Could you share the patch? Is
there anything else that's needed except compiling against the board's
real kernel headers (aka point KBUILD_OUTPUT to the extracted location
of /sys/kernel/kheaders.tar.xz)? I've done that and I do see phc2sys
probing and using the new ioctl, but I don't see a big improvement in
my case (that's probably because my SPI interface has real jitter
caused by peripheral clock instability, but I really need to put a
scope on it to be sure, so that's a discussion for another time).
- I really wonder what your jitter is caused by. Maybe it is just
contention on the mii_bus->mdio_lock? If that's the case, maybe you
don't need to go all the way down to the driver level, and taking the
ptp_sts at the subsystem (MDIO, SPI, I2C, etc) level is "good enough".
- Along the lines of the above, I wonder how badly would the
maintainers shout at the proposal of adding a struct
ptp_system_timestamp pointer and its associated lock in struct device.
That way at least you don't have to change the API, like you did with
mdiobus_write_nested_ptp. Relatively speaking, this is the least
amount of intrusion (although, again, far from beautiful).
- The software timestamps help you in the particular case of phc2sys,
but are they enough to cover all your needs? I haven't spent even 1
second looking at Marvell switch datasheets, but is its free-running
timer only used for PTP timestamping? At least the sja1105 does also
support time-based egress scheduling and ingress policing, so for that
scenario, the timecounter/cyclecounter implementation will no longer
cut it (you do need to synchronize the hardware clock). If your
hardware supports these PTP-based features as well, I can only assume
the reason why mv88e6xxx went for a timecounter is the same as the
reason why I did: the MDIO/SPI jitter while accessing the frequency
and offset correction registers is bad enough that the servo loop goes
haywire. And implementing gettimex64 will not solve that.

I also added Miroslav Lichvar, who originally created the
PTP_SYS_OFFSET_EXTENDED ioctl, maybe he can share some ideas on how it
is best served.

>  drivers/clocksource/timer-imx-gpt.c       |  9 ++++++++-
>  drivers/net/dsa/mv88e6xxx/chip.h          |  2 ++
>  drivers/net/dsa/mv88e6xxx/ptp.c           | 11 +++++++----
>  drivers/net/dsa/mv88e6xxx/smi.c           |  2 +-
>  drivers/net/ethernet/freescale/fec_main.c | 24 ++++++++++++++++++-----
>  drivers/net/phy/mdio_bus.c                | 16 +++++++++++++++
>  include/linux/mdio.h                      |  2 ++
>  include/linux/phy.h                       |  1 +
>  8 files changed, 56 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/clocksource/timer-imx-gpt.c b/drivers/clocksource/timer-imx-gpt.c
> index 706c0d0ff56c..84695a2d8ff7 100644
> --- a/drivers/clocksource/timer-imx-gpt.c
> +++ b/drivers/clocksource/timer-imx-gpt.c
> @@ -471,8 +471,15 @@ static int __init mxc_timer_init_dt(struct device_node *np,  enum imx_gpt_type t
>
>         /* Try osc_per first, and fall back to per otherwise */
>         imxtm->clk_per = of_clk_get_by_name(np, "osc_per");
> -       if (IS_ERR(imxtm->clk_per))
> +
> +       /* Force PER clock to be used, so we get 15ns instead of 333ns */
> +       //if (IS_ERR(imxtm->clk_per)) {
> +       if (1) {
>                 imxtm->clk_per = of_clk_get_by_name(np, "per");
> +               pr_info("==> Using PER clock\n");
> +       } else {
> +               pr_info("==> Using OSC_PER clock\n");
> +       }
>
>         imxtm->type = type;
>
> diff --git a/drivers/net/dsa/mv88e6xxx/chip.h b/drivers/net/dsa/mv88e6xxx/chip.h
> index 01963ee94c50..9e14dc406415 100644
> --- a/drivers/net/dsa/mv88e6xxx/chip.h
> +++ b/drivers/net/dsa/mv88e6xxx/chip.h
> @@ -277,6 +277,8 @@ struct mv88e6xxx_chip {
>         struct ptp_clock_info   ptp_clock_info;
>         struct delayed_work     tai_event_work;
>         struct ptp_pin_desc     pin_config[MV88E6XXX_MAX_GPIO];
> +       struct ptp_system_timestamp *ptp_sts;
> +
>         u16 trig_config;
>         u16 evcap_config;
>         u16 enable_count;
> diff --git a/drivers/net/dsa/mv88e6xxx/ptp.c b/drivers/net/dsa/mv88e6xxx/ptp.c
> index 073cbd0bb91b..cf6e52ee9e0a 100644
> --- a/drivers/net/dsa/mv88e6xxx/ptp.c
> +++ b/drivers/net/dsa/mv88e6xxx/ptp.c
> @@ -235,14 +235,17 @@ static int mv88e6xxx_ptp_adjtime(struct ptp_clock_info *ptp, s64 delta)
>         return 0;
>  }
>
> -static int mv88e6xxx_ptp_gettime(struct ptp_clock_info *ptp,
> -                                struct timespec64 *ts)
> +static int mv88e6xxx_ptp_gettimex(struct ptp_clock_info *ptp,
> +                                 struct timespec64 *ts,
> +                                 struct ptp_system_timestamp *sts)
>  {
>         struct mv88e6xxx_chip *chip = ptp_to_chip(ptp);
>         u64 ns;
>
>         mv88e6xxx_reg_lock(chip);
> +       chip->ptp_sts = sts;
>         ns = timecounter_read(&chip->tstamp_tc);
> +       chip->ptp_sts = NULL;
>         mv88e6xxx_reg_unlock(chip);
>
>         *ts = ns_to_timespec64(ns);
> @@ -426,7 +429,7 @@ static void mv88e6xxx_ptp_overflow_check(struct work_struct *work)
>         struct mv88e6xxx_chip *chip = dw_overflow_to_chip(dw);
>         struct timespec64 ts;
>
> -       mv88e6xxx_ptp_gettime(&chip->ptp_clock_info, &ts);
> +       mv88e6xxx_ptp_gettimex(&chip->ptp_clock_info, &ts, NULL);
>
>         schedule_delayed_work(&chip->overflow_work,
>                               MV88E6XXX_TAI_OVERFLOW_PERIOD);
> @@ -472,7 +475,7 @@ int mv88e6xxx_ptp_setup(struct mv88e6xxx_chip *chip)
>         chip->ptp_clock_info.max_adj    = MV88E6XXX_MAX_ADJ_PPB;
>         chip->ptp_clock_info.adjfine    = mv88e6xxx_ptp_adjfine;
>         chip->ptp_clock_info.adjtime    = mv88e6xxx_ptp_adjtime;
> -       chip->ptp_clock_info.gettime64  = mv88e6xxx_ptp_gettime;
> +       chip->ptp_clock_info.gettimex64 = mv88e6xxx_ptp_gettimex;
>         chip->ptp_clock_info.settime64  = mv88e6xxx_ptp_settime;
>         chip->ptp_clock_info.enable     = ptp_ops->ptp_enable;
>         chip->ptp_clock_info.verify     = ptp_ops->ptp_verify;
> diff --git a/drivers/net/dsa/mv88e6xxx/smi.c b/drivers/net/dsa/mv88e6xxx/smi.c
> index 5fc78a063843..801fd4abba5a 100644
> --- a/drivers/net/dsa/mv88e6xxx/smi.c
> +++ b/drivers/net/dsa/mv88e6xxx/smi.c
> @@ -45,7 +45,7 @@ static int mv88e6xxx_smi_direct_write(struct mv88e6xxx_chip *chip,
>  {
>         int ret;
>
> -       ret = mdiobus_write_nested(chip->bus, dev, reg, data);
> +       ret = mdiobus_write_nested_ptp(chip->bus, dev, reg, data, chip->ptp_sts);
>         if (ret < 0)
>                 return ret;
>
> diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
> index 2f6057e7335d..20f589dc5b8b 100644
> --- a/drivers/net/ethernet/freescale/fec_main.c
> +++ b/drivers/net/ethernet/freescale/fec_main.c
> @@ -1814,11 +1814,25 @@ static int fec_enet_mdio_write(struct mii_bus *bus, int mii_id, int regnum,
>
>         reinit_completion(&fep->mdio_done);
>
> -       /* start a write op */
> -       writel(FEC_MMFR_ST | FEC_MMFR_OP_WRITE |
> -               FEC_MMFR_PA(mii_id) | FEC_MMFR_RA(regnum) |
> -               FEC_MMFR_TA | FEC_MMFR_DATA(value),
> -               fep->hwp + FEC_MII_DATA);
> +       if (bus->ptp_sts) {
> +               unsigned long flags = 0;
> +               local_irq_save(flags);
> +               __iowmb();
> +               /* >Take the timestamp *after* the memory barrier */
> +               ptp_read_system_prets(bus->ptp_sts);
> +               writel_relaxed(FEC_MMFR_ST | FEC_MMFR_OP_WRITE |
> +                       FEC_MMFR_PA(mii_id) | FEC_MMFR_RA(regnum) |
> +                       FEC_MMFR_TA | FEC_MMFR_DATA(value),
> +                       fep->hwp + FEC_MII_DATA);
> +               ptp_read_system_postts(bus->ptp_sts);
> +               local_irq_restore(flags);
> +       } else {
> +               /* start a write op */
> +               writel(FEC_MMFR_ST | FEC_MMFR_OP_WRITE |
> +                       FEC_MMFR_PA(mii_id) | FEC_MMFR_RA(regnum) |
> +                       FEC_MMFR_TA | FEC_MMFR_DATA(value),
> +                       fep->hwp + FEC_MII_DATA);
> +       }
>
>         /* wait for end of transfer */
>         time_left = wait_for_completion_timeout(&fep->mdio_done,
> diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c
> index bd04fe762056..f076487db11f 100644
> --- a/drivers/net/phy/mdio_bus.c
> +++ b/drivers/net/phy/mdio_bus.c
> @@ -672,6 +672,22 @@ int mdiobus_write_nested(struct mii_bus *bus, int addr, u32 regnum, u16 val)
>  }
>  EXPORT_SYMBOL(mdiobus_write_nested);
>
> +int mdiobus_write_nested_ptp(struct mii_bus *bus, int addr, u32 regnum, u16 val, struct ptp_system_timestamp *ptp_sts)
> +{
> +       int err;
> +
> +       BUG_ON(in_interrupt());
> +
> +       mutex_lock_nested(&bus->mdio_lock, MDIO_MUTEX_NESTED);
> +       bus->ptp_sts = ptp_sts;
> +       err = __mdiobus_write(bus, addr, regnum, val);
> +       bus->ptp_sts = NULL;
> +       mutex_unlock(&bus->mdio_lock);
> +
> +       return err;
> +}
> +EXPORT_SYMBOL(mdiobus_write_nested_ptp);
> +
>  /**
>   * mdiobus_write - Convenience function for writing a given MII mgmt register
>   * @bus: the mii_bus struct
> diff --git a/include/linux/mdio.h b/include/linux/mdio.h
> index e8242ad88c81..7f9767babdc3 100644
> --- a/include/linux/mdio.h
> +++ b/include/linux/mdio.h
> @@ -9,6 +9,7 @@
>  #include <uapi/linux/mdio.h>
>  #include <linux/mod_devicetable.h>
>
> +struct ptp_system_timestamp;
>  struct gpio_desc;
>  struct mii_bus;
>
> @@ -310,6 +311,7 @@ int mdiobus_read(struct mii_bus *bus, int addr, u32 regnum);
>  int mdiobus_read_nested(struct mii_bus *bus, int addr, u32 regnum);
>  int mdiobus_write(struct mii_bus *bus, int addr, u32 regnum, u16 val);
>  int mdiobus_write_nested(struct mii_bus *bus, int addr, u32 regnum, u16 val);
> +int mdiobus_write_nested_ptp(struct mii_bus *bus, int addr, u32 regnum, u16 val, struct ptp_system_timestamp *ptp_sts);
>
>  int mdiobus_register_device(struct mdio_device *mdiodev);
>  int mdiobus_unregister_device(struct mdio_device *mdiodev);
> diff --git a/include/linux/phy.h b/include/linux/phy.h
> index 462b90b73f93..fd4e33654863 100644
> --- a/include/linux/phy.h
> +++ b/include/linux/phy.h
> @@ -252,6 +252,7 @@ struct mii_bus {
>         int reset_delay_us;
>         /* RESET GPIO descriptor pointer */
>         struct gpio_desc *reset_gpiod;
> +       struct ptp_system_timestamp *ptp_sts;
>  };
>  #define to_mii_bus(d) container_of(d, struct mii_bus, dev)
>
> --
> 2.22.0
>

Regards,
-Vladimir

^ permalink raw reply

* [patch iproute2] devlink: finish queue.h to list.h transition
From: Jiri Pirko @ 2019-08-05  9:56 UTC (permalink / raw)
  To: netdev; +Cc: stephen, dsahern, slyfox, ayal, mlxsw

From: Jiri Pirko <jiri@mellanox.com>

Loose the "q" from the names and name the structure fields in the same
way rest of the code does. Also, fix list_add arg order which leads
to segfault.

Fixes: 33267017faf1 ("iproute2: devlink: port from sys/queue.h to list.h")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 devlink/devlink.c | 21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/devlink/devlink.c b/devlink/devlink.c
index 0ea401ae432a..91c85dc1de73 100644
--- a/devlink/devlink.c
+++ b/devlink/devlink.c
@@ -5978,35 +5978,36 @@ static int fmsg_value_show(struct dl *dl, int type, struct nlattr *nl_data)
 	return MNL_CB_OK;
 }
 
-struct nest_qentry {
+struct nest_entry {
 	int attr_type;
-	struct list_head nest_entries;
+	struct list_head list;
 };
 
 struct fmsg_cb_data {
 	struct dl *dl;
 	uint8_t value_type;
-	struct list_head qhead;
+	struct list_head entry_list;
 };
 
 static int cmd_fmsg_nest_queue(struct fmsg_cb_data *fmsg_data,
 			       uint8_t *attr_value, bool insert)
 {
-	struct nest_qentry *entry = NULL;
+	struct nest_entry *entry;
 
 	if (insert) {
-		entry = malloc(sizeof(struct nest_qentry));
+		entry = malloc(sizeof(struct nest_entry));
 		if (!entry)
 			return -ENOMEM;
 
 		entry->attr_type = *attr_value;
-		list_add(&fmsg_data->qhead, &entry->nest_entries);
+		list_add(&entry->list, &fmsg_data->entry_list);
 	} else {
-		if (list_empty(&fmsg_data->qhead))
+		if (list_empty(&fmsg_data->entry_list))
 			return MNL_CB_ERROR;
-		entry = list_first_entry(&fmsg_data->qhead, struct nest_qentry, nest_entries);
+		entry = list_first_entry(&fmsg_data->entry_list,
+					 struct nest_entry, list);
 		*attr_value = entry->attr_type;
-		list_del(&entry->nest_entries);
+		list_del(&entry->list);
 		free(entry);
 	}
 	return MNL_CB_OK;
@@ -6115,7 +6116,7 @@ static int cmd_health_object_common(struct dl *dl, uint8_t cmd, uint16_t flags)
 		return err;
 
 	data.dl = dl;
-	INIT_LIST_HEAD(&data.qhead);
+	INIT_LIST_HEAD(&data.entry_list);
 	err = _mnlg_socket_sndrcv(dl->nlg, nlh, cmd_fmsg_object_cb, &data);
 	return err;
 }
-- 
2.21.0


^ permalink raw reply related

* [PATCH] NFC: nfcmrvl: fix gpio-handling regression
From: Johan Hovold @ 2019-08-05 10:00 UTC (permalink / raw)
  To: netdev
  Cc: Linus Walleij, Vincent Cuissard, Andrey Konovalov, Dmitry Vyukov,
	linux-usb, linux-kernel, Johan Hovold, stable,
	syzbot+cf35b76f35e068a1107f

Fix two reset-gpio sanity checks which were never converted to use
gpio_is_valid(), and make sure to use -EINVAL to indicate a missing
reset line also for the UART-driver module parameter and for the USB
driver.

This specifically prevents the UART and USB drivers from incidentally
trying to request and use gpio 0, and also avoids triggering a WARN() in
gpio_to_desc() during probe when no valid reset line has been specified.

Fixes: e33a3f84f88f ("NFC: nfcmrvl: allow gpio 0 for reset signalling")
Cc: stable <stable@vger.kernel.org>	# 4.13
Reported-by: syzbot+cf35b76f35e068a1107f@syzkaller.appspotmail.com
Tested-by: syzbot+cf35b76f35e068a1107f@syzkaller.appspotmail.com
Signed-off-by: Johan Hovold <johan@kernel.org>
---
 drivers/nfc/nfcmrvl/main.c | 4 ++--
 drivers/nfc/nfcmrvl/uart.c | 4 ++--
 drivers/nfc/nfcmrvl/usb.c  | 1 +
 3 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/nfc/nfcmrvl/main.c b/drivers/nfc/nfcmrvl/main.c
index e65d027b91fa..529be35ac178 100644
--- a/drivers/nfc/nfcmrvl/main.c
+++ b/drivers/nfc/nfcmrvl/main.c
@@ -244,7 +244,7 @@ void nfcmrvl_chip_reset(struct nfcmrvl_private *priv)
 	/* Reset possible fault of previous session */
 	clear_bit(NFCMRVL_PHY_ERROR, &priv->flags);
 
-	if (priv->config.reset_n_io) {
+	if (gpio_is_valid(priv->config.reset_n_io)) {
 		nfc_info(priv->dev, "reset the chip\n");
 		gpio_set_value(priv->config.reset_n_io, 0);
 		usleep_range(5000, 10000);
@@ -255,7 +255,7 @@ void nfcmrvl_chip_reset(struct nfcmrvl_private *priv)
 
 void nfcmrvl_chip_halt(struct nfcmrvl_private *priv)
 {
-	if (priv->config.reset_n_io)
+	if (gpio_is_valid(priv->config.reset_n_io))
 		gpio_set_value(priv->config.reset_n_io, 0);
 }
 
diff --git a/drivers/nfc/nfcmrvl/uart.c b/drivers/nfc/nfcmrvl/uart.c
index 9a22056e8d9e..e5a622ce4b95 100644
--- a/drivers/nfc/nfcmrvl/uart.c
+++ b/drivers/nfc/nfcmrvl/uart.c
@@ -26,7 +26,7 @@
 static unsigned int hci_muxed;
 static unsigned int flow_control;
 static unsigned int break_control;
-static unsigned int reset_n_io;
+static int reset_n_io = -EINVAL;
 
 /*
 ** NFCMRVL NCI OPS
@@ -231,5 +231,5 @@ MODULE_PARM_DESC(break_control, "Tell if UART driver must drive break signal.");
 module_param(hci_muxed, uint, 0);
 MODULE_PARM_DESC(hci_muxed, "Tell if transport is muxed in HCI one.");
 
-module_param(reset_n_io, uint, 0);
+module_param(reset_n_io, int, 0);
 MODULE_PARM_DESC(reset_n_io, "GPIO that is wired to RESET_N signal.");
diff --git a/drivers/nfc/nfcmrvl/usb.c b/drivers/nfc/nfcmrvl/usb.c
index 945cc903d8f1..888e298f610b 100644
--- a/drivers/nfc/nfcmrvl/usb.c
+++ b/drivers/nfc/nfcmrvl/usb.c
@@ -305,6 +305,7 @@ static int nfcmrvl_probe(struct usb_interface *intf,
 
 	/* No configuration for USB */
 	memset(&config, 0, sizeof(config));
+	config.reset_n_io = -EINVAL;
 
 	nfc_info(&udev->dev, "intf %p id %p\n", intf, id);
 
-- 
2.22.0


^ permalink raw reply related

* Re: [RFC] net: dsa: mv88e6xxx: ptp: improve phc2sys precision for mv88e6xxx switch in combination with imx6-fec
From: Miroslav Lichvar @ 2019-08-05 10:36 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Hubert Feurstein, Richard Cochran, Andrew Lunn, netdev, lkml,
	Vivien Didelot, Florian Fainelli, David S. Miller
In-Reply-To: <CA+h21hr835sdvtgVOA2M9SWeCXDOrDG1S3FnNgJd_9NP2X_TaQ@mail.gmail.com>

On Mon, Aug 05, 2019 at 12:54:49PM +0300, Vladimir Oltean wrote:
> - Along the lines of the above, I wonder how badly would the
> maintainers shout at the proposal of adding a struct
> ptp_system_timestamp pointer and its associated lock in struct device.
> That way at least you don't have to change the API, like you did with
> mdiobus_write_nested_ptp. Relatively speaking, this is the least
> amount of intrusion (although, again, far from beautiful).

That would make sense to me, if there are other drivers that could use
it.

> I also added Miroslav Lichvar, who originally created the
> PTP_SYS_OFFSET_EXTENDED ioctl, maybe he can share some ideas on how it
> is best served.

The idea behind that ioctl was to allow drivers to take the system
timestamps as close to the reading of the HW clock as possible,
excluding delays (and jitter) due to other operations that are
necessary to get that timestamp. In the ethernet drivers that was a
single PCI read. If in this case it's a "write" operation that
triggers the reading of the HW clock, then I think the patch is
using is the ptp_read_system_*ts() calls correctly.

> > --- a/drivers/net/ethernet/freescale/fec_main.c
> > +++ b/drivers/net/ethernet/freescale/fec_main.c
> > @@ -1814,11 +1814,25 @@ static int fec_enet_mdio_write(struct mii_bus *bus, int mii_id, int regnum,
> >
> >         reinit_completion(&fep->mdio_done);
> >
> > -       /* start a write op */
> > -       writel(FEC_MMFR_ST | FEC_MMFR_OP_WRITE |
> > -               FEC_MMFR_PA(mii_id) | FEC_MMFR_RA(regnum) |
> > -               FEC_MMFR_TA | FEC_MMFR_DATA(value),
> > -               fep->hwp + FEC_MII_DATA);
> > +       if (bus->ptp_sts) {
> > +               unsigned long flags = 0;
> > +               local_irq_save(flags);
> > +               __iowmb();
> > +               /* >Take the timestamp *after* the memory barrier */
> > +               ptp_read_system_prets(bus->ptp_sts);
> > +               writel_relaxed(FEC_MMFR_ST | FEC_MMFR_OP_WRITE |
> > +                       FEC_MMFR_PA(mii_id) | FEC_MMFR_RA(regnum) |
> > +                       FEC_MMFR_TA | FEC_MMFR_DATA(value),
> > +                       fep->hwp + FEC_MII_DATA);
> > +               ptp_read_system_postts(bus->ptp_sts);
> > +               local_irq_restore(flags);
> > +       } else {
> > +               /* start a write op */
> > +               writel(FEC_MMFR_ST | FEC_MMFR_OP_WRITE |
> > +                       FEC_MMFR_PA(mii_id) | FEC_MMFR_RA(regnum) |
> > +                       FEC_MMFR_TA | FEC_MMFR_DATA(value),
> > +                       fep->hwp + FEC_MII_DATA);
> > +       }

-- 
Miroslav Lichvar

^ permalink raw reply

* Re: [PATCH] tools: bpftool: fix reading from /proc/config.gz
From: Quentin Monnet @ 2019-08-05 10:41 UTC (permalink / raw)
  To: Peter Wu, Alexei Starovoitov, Daniel Borkmann
  Cc: netdev, Stanislav Fomichev, Jakub Kicinski
In-Reply-To: <20190805001541.8096-1-peter@lekensteyn.nl>

Hi Peter,

Thanks for looking into this (and for the fixes)! Some comments below.

2019-08-05 01:15 UTC+0100 ~ Peter Wu <peter@lekensteyn.nl>
> /proc/config has never existed as far as I can see, but /proc/config.gz

As far as I understood (from examining Cilium [0]), /proc/config _is_
used by some distributions, such as CoreOS. This is why we look at that
location in bpftool.

[0] https://github.com/cilium/cilium/blob/master/bpf/run_probes.sh#L42

> is present on Arch Linux. Execute an external gunzip program to avoid
> linking to zlib and rework the option scanning code since a pipe is not
> seekable. This also fixes a file handle leak on some error paths.
> 
> Fixes: 4567b983f78c ("tools: bpftool: add probes for kernel configuration options")
> Cc: Quentin Monnet <quentin.monnet@netronome.com>
> Signed-off-by: Peter Wu <peter@lekensteyn.nl>
> ---
>  tools/bpf/bpftool/feature.c | 92 +++++++++++++++++++++----------------
>  1 file changed, 52 insertions(+), 40 deletions(-)
> 
> diff --git a/tools/bpf/bpftool/feature.c b/tools/bpf/bpftool/feature.c
> index d672d9086fff..e9e10f582047 100644
> --- a/tools/bpf/bpftool/feature.c
> +++ b/tools/bpf/bpftool/feature.c
> @@ -284,34 +284,34 @@ static void probe_jit_limit(void)
>  	}
>  }
>  
> -static char *get_kernel_config_option(FILE *fd, const char *option)
> +static bool get_kernel_config_option(FILE *fd, char **buf_p, size_t *n_p,
> +				     char **value)

Maybe we could rename this function, and have "next" appear in it
somewhere? After your changes, it does not return the value for a
specific option anymore.

>  {
> -	size_t line_n = 0, optlen = strlen(option);
> -	char *res, *strval, *line = NULL;
> -	ssize_t n;
> +	char *sep;
> +	ssize_t linelen;

Please order the declarations in reverse-Christmas tree style.

>  
> -	rewind(fd);
> -	while ((n = getline(&line, &line_n, fd)) > 0) {
> -		if (strncmp(line, option, optlen))
> +	while ((linelen = getline(buf_p, n_p, fd)) > 0) {
> +		char *line = *buf_p;

Please leave a blank line after declarations.

> +		if (strncmp(line, "CONFIG_", 7))
>  			continue;
> -		/* Check we have at least '=', value, and '\n' */
> -		if (strlen(line) < optlen + 3)
> -			continue;
> -		if (*(line + optlen) != '=')
> +
> +		sep = memchr(line, '=', linelen);
> +		if (!sep)
>  			continue;
>  
>  		/* Trim ending '\n' */
> -		line[strlen(line) - 1] = '\0';
> +		line[linelen - 1] = '\0';
> +
> +		/* Split on '=' and ensure that a value is present. */
> +		*sep = '\0';
> +		if (!sep[1])
> +			continue;
>  
> -		/* Copy and return config option value */
> -		strval = line + optlen + 1;
> -		res = strdup(strval);
> -		free(line);
> -		return res;
> +		*value = sep + 1;
> +		return true;
>  	}
> -	free(line);
>  
> -	return NULL;
> +	return false;
>  }
>  
>  static void probe_kernel_image_config(void)
> @@ -386,31 +386,34 @@ static void probe_kernel_image_config(void)
>  		/* test_bpf module for BPF tests */
>  		"CONFIG_TEST_BPF",
>  	};
> +	char *values[ARRAY_SIZE(options)] = { };
>  	char *value, *buf = NULL;
>  	struct utsname utsn;
>  	char path[PATH_MAX];
>  	size_t i, n;
>  	ssize_t ret;
> -	FILE *fd;
> +	FILE *fd = NULL;
> +	bool is_pipe = false;

Reverse-Christmas-tree style please.

>  
>  	if (uname(&utsn))
> -		goto no_config;
> +		goto end_parse;

Just thinking, maybe if uname() fails we can skip /boot/config-$(uname
-r) but still attempt to parse /proc/config{,.gz} instead of printing
only NULL options?

>  
>  	snprintf(path, sizeof(path), "/boot/config-%s", utsn.release);
>  
>  	fd = fopen(path, "r");
>  	if (!fd && errno == ENOENT) {
> -		/* Some distributions put the config file at /proc/config, give
> -		 * it a try.
> -		 * Sometimes it is also at /proc/config.gz but we do not try
> -		 * this one for now, it would require linking against libz.
> +		/* Some distributions build with CONFIG_IKCONFIG=y and put the
> +		 * config file at /proc/config.gz. We try to invoke an external
> +		 * gzip program to avoid linking to libz.
> +		 * Hide stderr to avoid interference with the JSON output.

Because some distributions do use /proc/config, we should keep this. You
can probably add /proc/config.gz as another attempt below (or even
above) the current case?

>  		 */
> -		fd = fopen("/proc/config", "r");
> +		fd = popen("gunzip -c /proc/config.gz 2>/dev/null", "r");
> +		is_pipe = true;
>  	}
>  	if (!fd) {
>  		p_info("skipping kernel config, can't open file: %s",
>  		       strerror(errno));
> -		goto no_config;
> +		goto end_parse;
>  	}
>  	/* Sanity checks */
>  	ret = getline(&buf, &n, fd);
> @@ -418,27 +421,36 @@ static void probe_kernel_image_config(void)
>  	if (!buf || !ret) {
>  		p_info("skipping kernel config, can't read from file: %s",
>  		       strerror(errno));
> -		free(buf);
> -		goto no_config;
> +		goto end_parse;
>  	}
>  	if (strcmp(buf, "# Automatically generated file; DO NOT EDIT.\n")) {
>  		p_info("skipping kernel config, can't find correct file");
> -		free(buf);
> -		goto no_config;
> +		goto end_parse;
> +	}
> +
> +	while (get_kernel_config_option(fd, &buf, &n, &value))> +		for (i = 0; i < ARRAY_SIZE(options); i++) {
> +			if (values[i] || strcmp(buf, options[i]))

Can we have an option set multiple times in the config file? If so,
maybe have a p_info() if values are different to warn users that
conflicting values were found?

(Reading the file just once is a nice improvement over my version, by
the way, thanks!)

> +				continue;
> +
> +			values[i] = strdup(value);
> +		}
> +	}
> +
> +end_parse:
> +	if (fd) {
> +		if (is_pipe) {
> +			if (pclose(fd))
> +				p_info("failed to read /proc/config.gz");
> +		} else
> +			fclose(fd);

Please use braces on all branches of the conditional statement (else).

Thanks,
Quentin

^ permalink raw reply

* [PATCH]][next] selftests: nettest: fix spelling mistake: "potocol" -> "protocol"
From: Colin King @ 2019-08-05 10:52 UTC (permalink / raw)
  To: David S . Miller, Shuah Khan, netdev, linux-kselftest
  Cc: kernel-janitors, linux-kernel

From: Colin Ian King <colin.king@canonical.com>

There is a spelling mistake in an error messgae. Fix it.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
---
 tools/testing/selftests/net/nettest.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/net/nettest.c b/tools/testing/selftests/net/nettest.c
index 9278f8460d75..83515e5ea4dc 100644
--- a/tools/testing/selftests/net/nettest.c
+++ b/tools/testing/selftests/net/nettest.c
@@ -1627,7 +1627,7 @@ int main(int argc, char *argv[])
 				args.protocol = pe->p_proto;
 			} else {
 				if (str_to_uint(optarg, 0, 0xffff, &tmp) != 0) {
-					fprintf(stderr, "Invalid potocol\n");
+					fprintf(stderr, "Invalid protocol\n");
 					return 1;
 				}
 				args.protocol = tmp;
-- 
2.20.1


^ permalink raw reply related

* Re: [v2] net: can: Fix compiling warning
From: Markus Elfring @ 2019-08-05 10:56 UTC (permalink / raw)
  To: Mao Wenan, David S. Miller, Oliver Hartkopp, netdev
  Cc: kernel-janitors, linux-kernel
In-Reply-To: <20190805012637.62314-1-maowenan@huawei.com>

>  v1->v2: change patch description typo error, …

Will an other commit subject like “[PATCH v3] net: can: Fix compilation warnings
for two functions” be more appropriate here?

Regards,
Markus

^ permalink raw reply

* Re: [PATCH net 1/2] net: sched: police: allow accessing police->params with rtnl
From: Pieter Jansen van Vuuren @ 2019-08-05 11:38 UTC (permalink / raw)
  To: Vlad Buslov, netdev; +Cc: jhs, xiyou.wangcong, jiri, davem
In-Reply-To: <20190803133619.10574-2-vladbu@mellanox.com>

On 2019/08/03 14:36, Vlad Buslov wrote:
> Recently implemented support for police action in flow_offload infra leads
> to following rcu usage warning:
> 
> [ 1925.881092] =============================
> [ 1925.881094] WARNING: suspicious RCU usage
> [ 1925.881098] 5.3.0-rc1+ #574 Not tainted
> [ 1925.881100] -----------------------------
> [ 1925.881104] include/net/tc_act/tc_police.h:57 suspicious rcu_dereference_check() usage!
> [ 1925.881106]
>                other info that might help us debug this:
> 
> [ 1925.881109]
>                rcu_scheduler_active = 2, debug_locks = 1
> [ 1925.881112] 1 lock held by tc/18591:
> [ 1925.881115]  #0: 00000000b03cb918 (rtnl_mutex){+.+.}, at: tc_new_tfilter+0x47c/0x970
> [ 1925.881124]
>                stack backtrace:
> [ 1925.881127] CPU: 2 PID: 18591 Comm: tc Not tainted 5.3.0-rc1+ #574
> [ 1925.881130] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017
> [ 1925.881132] Call Trace:
> [ 1925.881138]  dump_stack+0x85/0xc0
> [ 1925.881145]  tc_setup_flow_action+0x1771/0x2040
> [ 1925.881155]  fl_hw_replace_filter+0x11f/0x2e0 [cls_flower]
> [ 1925.881175]  fl_change+0xd24/0x1b30 [cls_flower]
> [ 1925.881200]  tc_new_tfilter+0x3e0/0x970
> [ 1925.881231]  ? tc_del_tfilter+0x720/0x720
> [ 1925.881243]  rtnetlink_rcv_msg+0x389/0x4b0
> [ 1925.881250]  ? netlink_deliver_tap+0x95/0x400
> [ 1925.881257]  ? rtnl_dellink+0x2d0/0x2d0
> [ 1925.881264]  netlink_rcv_skb+0x49/0x110
> [ 1925.881275]  netlink_unicast+0x171/0x200
> [ 1925.881284]  netlink_sendmsg+0x224/0x3f0
> [ 1925.881299]  sock_sendmsg+0x5e/0x60
> [ 1925.881305]  ___sys_sendmsg+0x2ae/0x330
> [ 1925.881309]  ? task_work_add+0x43/0x50
> [ 1925.881314]  ? fput_many+0x45/0x80
> [ 1925.881329]  ? __lock_acquire+0x248/0x1930
> [ 1925.881342]  ? find_held_lock+0x2b/0x80
> [ 1925.881347]  ? task_work_run+0x7b/0xd0
> [ 1925.881359]  __sys_sendmsg+0x59/0xa0
> [ 1925.881375]  do_syscall_64+0x5c/0xb0
> [ 1925.881381]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [ 1925.881384] RIP: 0033:0x7feb245047b8
> [ 1925.881388] Code: 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 65 8f 0c 00 8b 00 85 c0 75 17 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83
>  ec 28 89 54
> [ 1925.881391] RSP: 002b:00007ffc2d2a5788 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
> [ 1925.881395] RAX: ffffffffffffffda RBX: 000000005d4497ed RCX: 00007feb245047b8
> [ 1925.881398] RDX: 0000000000000000 RSI: 00007ffc2d2a57f0 RDI: 0000000000000003
> [ 1925.881400] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000006
> [ 1925.881403] R10: 0000000000404ec2 R11: 0000000000000246 R12: 0000000000000001
> [ 1925.881406] R13: 0000000000480640 R14: 0000000000000012 R15: 0000000000000001
> 
> Change tcf_police_rate_bytes_ps() and tcf_police_tcfp_burst() helpers to
> allow using them from both rtnl and rcu protected contexts.
> 
> Fixes: 8c8cfc6ed274 ("net/sched: add police action to the hardware intermediate representation")
> Signed-off-by: Vlad Buslov <vladbu@mellanox.com>

Thank you Vlad.
Reviewed-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>

^ permalink raw reply

* Re: [PATCH net 2/2] net: sched: sample: allow accessing psample_group with rtnl
From: Pieter Jansen van Vuuren @ 2019-08-05 11:39 UTC (permalink / raw)
  To: Vlad Buslov, netdev; +Cc: jhs, xiyou.wangcong, jiri, davem
In-Reply-To: <20190803133619.10574-3-vladbu@mellanox.com>

On 2019/08/03 14:36, Vlad Buslov wrote:
> Recently implemented support for sample action in flow_offload infra leads
> to following rcu usage warning:
> 
> [ 1938.234856] =============================
> [ 1938.234858] WARNING: suspicious RCU usage
> [ 1938.234863] 5.3.0-rc1+ #574 Not tainted
> [ 1938.234866] -----------------------------
> [ 1938.234869] include/net/tc_act/tc_sample.h:47 suspicious rcu_dereference_check() usage!
> [ 1938.234872]
>                other info that might help us debug this:
> 
> [ 1938.234875]
>                rcu_scheduler_active = 2, debug_locks = 1
> [ 1938.234879] 1 lock held by tc/19540:
> [ 1938.234881]  #0: 00000000b03cb918 (rtnl_mutex){+.+.}, at: tc_new_tfilter+0x47c/0x970
> [ 1938.234900]
>                stack backtrace:
> [ 1938.234905] CPU: 2 PID: 19540 Comm: tc Not tainted 5.3.0-rc1+ #574
> [ 1938.234908] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017
> [ 1938.234911] Call Trace:
> [ 1938.234922]  dump_stack+0x85/0xc0
> [ 1938.234930]  tc_setup_flow_action+0xed5/0x2040
> [ 1938.234944]  fl_hw_replace_filter+0x11f/0x2e0 [cls_flower]
> [ 1938.234965]  fl_change+0xd24/0x1b30 [cls_flower]
> [ 1938.234990]  tc_new_tfilter+0x3e0/0x970
> [ 1938.235021]  ? tc_del_tfilter+0x720/0x720
> [ 1938.235028]  rtnetlink_rcv_msg+0x389/0x4b0
> [ 1938.235038]  ? netlink_deliver_tap+0x95/0x400
> [ 1938.235044]  ? rtnl_dellink+0x2d0/0x2d0
> [ 1938.235053]  netlink_rcv_skb+0x49/0x110
> [ 1938.235063]  netlink_unicast+0x171/0x200
> [ 1938.235073]  netlink_sendmsg+0x224/0x3f0
> [ 1938.235091]  sock_sendmsg+0x5e/0x60
> [ 1938.235097]  ___sys_sendmsg+0x2ae/0x330
> [ 1938.235111]  ? __handle_mm_fault+0x12cd/0x19e0
> [ 1938.235125]  ? __handle_mm_fault+0x12cd/0x19e0
> [ 1938.235138]  ? find_held_lock+0x2b/0x80
> [ 1938.235147]  ? do_user_addr_fault+0x22d/0x490
> [ 1938.235160]  __sys_sendmsg+0x59/0xa0
> [ 1938.235178]  do_syscall_64+0x5c/0xb0
> [ 1938.235187]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [ 1938.235192] RIP: 0033:0x7ff9a4d597b8
> [ 1938.235197] Code: 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 65 8f 0c 00 8b 00 85 c0 75 17 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83
>  ec 28 89 54
> [ 1938.235200] RSP: 002b:00007ffcfe381c48 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
> [ 1938.235205] RAX: ffffffffffffffda RBX: 000000005d4497f9 RCX: 00007ff9a4d597b8
> [ 1938.235208] RDX: 0000000000000000 RSI: 00007ffcfe381cb0 RDI: 0000000000000003
> [ 1938.235211] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000006
> [ 1938.235214] R10: 0000000000404ec2 R11: 0000000000000246 R12: 0000000000000001
> [ 1938.235217] R13: 0000000000480640 R14: 0000000000000012 R15: 0000000000000001
> 
> Change tcf_sample_psample_group() helper to allow using it from both rtnl
> and rcu protected contexts.
> 
> Fixes: a7a7be6087b0 ("net/sched: add sample action to the hardware intermediate representation")
> Signed-off-by: Vlad Buslov <vladbu@mellanox.com>

Thanks
Reviewed-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>

^ permalink raw reply

* Re: [PATCH 1/2] usb: serial: option: Add the BroadMobi BM818 card
From: Johan Hovold @ 2019-08-05 11:47 UTC (permalink / raw)
  To: Angus Ainslie (Purism)
  Cc: kernel, Bjørn Mork, David S. Miller, Johan Hovold,
	Greg Kroah-Hartman, netdev, linux-usb, linux-kernel, Bob Ham
In-Reply-To: <20190724145227.27169-2-angus@akkea.ca>

On Wed, Jul 24, 2019 at 07:52:26AM -0700, Angus Ainslie (Purism) wrote:
> From: Bob Ham <bob.ham@puri.sm>
> 
> Add a VID:PID for the BroadModi BM818 M.2 card

Would you mind posting the output of usb-devices (or lsusb -v) for this
device?

> Signed-off-by: Bob Ham <bob.ham@puri.sm>
> Signed-off-by: Angus Ainslie (Purism) <angus@akkea.ca>
> ---
>  drivers/usb/serial/option.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/usb/serial/option.c b/drivers/usb/serial/option.c
> index c1582fbd1150..674a68ee9564 100644
> --- a/drivers/usb/serial/option.c
> +++ b/drivers/usb/serial/option.c
> @@ -1975,6 +1975,8 @@ static const struct usb_device_id option_ids[] = {
>  	  .driver_info = RSVD(4) | RSVD(5) },
>  	{ USB_DEVICE_INTERFACE_CLASS(0x2cb7, 0x0105, 0xff),			/* Fibocom NL678 series */
>  	  .driver_info = RSVD(6) },
> +	{ USB_DEVICE(0x2020, 0x2060),						/* BroadMobi  */

Looks like you forgot to include the model in the comment here.

And please move this one after the other 0x2020 (PID 0x2031) entry.

Should you also be using USB_DEVICE_INTERFACE_CLASS() (e.g. to avoid
matching a mass-storage interface)?

> +	  .driver_info = RSVD(4) },
>  	{ } /* Terminating entry */
>  };
>  MODULE_DEVICE_TABLE(usb, option_ids);

Johan

^ permalink raw reply

* RE: [PATCH] net: sctp: Rename fallthrough label to unhandled
From: David Laight @ 2019-08-05 11:49 UTC (permalink / raw)
  To: 'Joe Perches', Neil Horman
  Cc: Vlad Yasevich, Marcelo Ricardo Leitner, David S. Miller,
	linux-sctp@vger.kernel.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <a9d003ddd0d59fb144db3ecda3453b3d9c0cb139.camel@perches.com>

From: Joe Perches
> Sent: 01 August 2019 18:43
> On Thu, 2019-08-01 at 06:50 -0400, Neil Horman wrote:
> > On Wed, Jul 31, 2019 at 03:23:46PM -0700, Joe Perches wrote:
> []
> > You can say that if you want, but you made the point that your think the macro
> > as you have written is more readable.  You example illustrates though that /*
> > fallthrough */ is a pretty common comment, and not prefixing it makes it look
> > like someone didn't add a comment that they meant to.  The __ prefix is standard
> > practice for defining macros to attributes (212 instances of it by my count).  I
> > don't mind rewriting the goto labels at all, but I think consistency is
> > valuable.
> 
> Hey Neil.
> 
> Perhaps you want to make this argument on the RFC patch thread
> that introduces the fallthrough pseudo-keyword.
> 
> https://lore.kernel.org/patchwork/patch/1108577/

ISTM that the only place where you need something other than the
traditional comment is inside a #define where (almost certainly)
the comments have to get stripped too early.

Adding a 'fallthough' as unknown C keyword sucks...


	David

 

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply

* [PATCH net-next v3] net: can: Fix compiling warnings for two functions
From: Mao Wenan @ 2019-08-05 11:57 UTC (permalink / raw)
  To: davem, socketcan; +Cc: netdev, linux-kernel, kernel-janitors, maowenan
In-Reply-To: <6fd68e9b-a8ae-4e5e-9b23-c099b5ca9aa4@web.de>

There are two warnings in net/can, fix them by setting bcm_sock_no_ioctlcmd
and raw_sock_no_ioctlcmd as static.

net/can/bcm.c:1683:5: warning: symbol 'bcm_sock_no_ioctlcmd' was not declared. Should it be static?
net/can/raw.c:840:5: warning: symbol 'raw_sock_no_ioctlcmd' was not declared. Should it be static?

Fixes: 473d924d7d46 ("can: fix ioctl function removal")

Signed-off-by: Mao Wenan <maowenan@huawei.com>
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
---
 v1->v2: change patch description typo error, 'warings' to 'warnings'.
 v2->v3: change subject of patch.
 net/can/bcm.c | 2 +-
 net/can/raw.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/can/bcm.c b/net/can/bcm.c
index bf1d0bbecec8..b8a32b4ac368 100644
--- a/net/can/bcm.c
+++ b/net/can/bcm.c
@@ -1680,7 +1680,7 @@ static int bcm_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
 	return size;
 }
 
-int bcm_sock_no_ioctlcmd(struct socket *sock, unsigned int cmd,
+static int bcm_sock_no_ioctlcmd(struct socket *sock, unsigned int cmd,
 			 unsigned long arg)
 {
 	/* no ioctls for socket layer -> hand it down to NIC layer */
diff --git a/net/can/raw.c b/net/can/raw.c
index da386f1fa815..a01848ff9b12 100644
--- a/net/can/raw.c
+++ b/net/can/raw.c
@@ -837,7 +837,7 @@ static int raw_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
 	return size;
 }
 
-int raw_sock_no_ioctlcmd(struct socket *sock, unsigned int cmd,
+static int raw_sock_no_ioctlcmd(struct socket *sock, unsigned int cmd,
 			 unsigned long arg)
 {
 	/* no ioctls for socket layer -> hand it down to NIC layer */
-- 
2.20.1


^ permalink raw reply related

* Re: [drop_monitor]  98ffbd6cd2:  will-it-scale.per_thread_ops -17.5% regression
From: Ido Schimmel @ 2019-08-05 11:56 UTC (permalink / raw)
  To: kernel test robot
  Cc: netdev, davem, nhorman, dsahern, roopa, nikolay, jakub.kicinski,
	toke, andy, f.fainelli, andrew, vivien.didelot, mlxsw,
	Ido Schimmel, lkp
In-Reply-To: <20190729095213.GQ22106@shao2-debian>

On Mon, Jul 29, 2019 at 05:52:13PM +0800, kernel test robot wrote:
> Greeting,
> 
> FYI, we noticed a -17.5% regression of will-it-scale.per_thread_ops due to commit:
> 
> 
> commit: 98ffbd6cd2b25fc6cbb0695e03b4fd43b5e116e6 ("[RFC PATCH net-next 10/12] drop_monitor: Add packet alert mode")
> url: https://github.com/0day-ci/linux/commits/Ido-Schimmel/drop_monitor-Capture-dropped-packets-and-metadata/20190723-135834
> 
> 
> in testcase: will-it-scale
> on test machine: 288 threads Intel(R) Xeon Phi(TM) CPU 7295 @ 1.50GHz with 80G memory
> with following parameters:
> 
> 	nr_task: 100%
> 	mode: thread
> 	test: lock1
> 	cpufreq_governor: performance

Hi,

Thanks for the report. The test ('lock1') has nothing to do with the
networking subsystem and the commit that you cite is not doing anything
unless you're actually running drop monitor in this newly introduced
mode. I assume you're not running drop monitor at all? Therefore, these
results seem very strange to me.

The only thing I could think of to explain this is that somehow the
addition of 'struct sk_buff_head' to the per-CPU variable might have
affected alignment elsewhere.

I used your kernel config on my system and tried to run the test like
you did [1][2]. Did not get conclusive results [3]. Took measurements on
vanilla net-next and with my entire patchset applied (with some changes
since RFC).

If you look at the operations per seconds in the 'threads' column when
there are 4 tasks you can see that the average before my patchset is
2325577, while the average after is 2340328.

Do you see anything obviously wrong in how I ran the test? If not, in
your experience, how reliable are your results? I found a similar report
[4] that did not make a lot of sense as well.

Thanks

[1]
#!/bin/bash

for cpu_dir in /sys/devices/system/cpu/cpu[0-9]*
do
        online_file="$cpu_dir"/online
        [ -f "$online_file" ] && [ "$(cat "$online_file")" -eq 0 ] && continue

        file="$cpu_dir"/cpufreq/scaling_governor
        [ -f "$file" ] && echo "performance" > "$file"
done

[2]
# ./runtest.py lock1

[3]
before1.csv

tasks,processes,processes_idle,threads,threads_idle,linear
0,0,100,0,100,0
1,610132,74.98,594558,74.40,610132
2,1230153,49.95,1184090,49.95,1220264
3,1844832,24.92,1758502,25.07,1830396
4,2454858,0.20,2311086,0.18,2440528

before2.csv

tasks,processes,processes_idle,threads,threads_idle,linear
0,0,100,0,100,0
1,607417,74.92,584035,75.03,607417
2,1227674,50.02,1170271,50.05,1214834
3,1846440,24.91,1761115,25.03,1822251
4,2482559,0.23,2343761,0.20,2429668

before3.csv

tasks,processes,processes_idle,threads,threads_idle,linear
0,0,100,0,100,0
1,609516,74.96,594691,74.85,609516
2,1231126,49.82,1176170,50.07,1219032
3,1858004,24.93,1761192,25.06,1828548
4,2460096,0.29,2321886,0.20,2438064

after1.csv 

tasks,processes,processes_idle,threads,threads_idle,linear
0,0,100,0,100,0
1,623846,75.01,598565,75.01,623846
2,1237010,50.01,1163000,50.06,1247692
3,1858541,24.99,1752192,24.98,1871538
4,2477562,0.20,2338462,0.20,2495384

after2.csv

tasks,processes,processes_idle,threads,threads_idle,linear
0,0,100,0,100,0
1,624175,74.98,593229,60.28,624175
2,1237561,45.43,1168572,50.01,1248350
3,1850211,25.03,1744378,24.90,1872525
4,2481224,0.20,2335768,0.20,2496700

after3.csv

tasks,processes,processes_idle,threads,threads_idle,linear
0,0,100,0,100,0
1,617805,74.99,590419,75.02,617805
2,1230908,50.01,1158534,50.06,1235610
3,1851623,25.06,1728419,24.94,1853415
4,2470115,0.20,2346754,0.20,2471220

[4] https://lkml.org/lkml/2019/2/19/351

^ permalink raw reply

* [PATCH] dt-bindings: net: meson-dwmac: convert to yaml
From: Neil Armstrong @ 2019-08-05 12:25 UTC (permalink / raw)
  To: robh+dt
  Cc: Neil Armstrong, martin.blumenstingl, devicetree, netdev,
	linux-amlogic, linux-arm-kernel, linux-kernel

Now that we have the DT validation in place, let's convert the device tree
bindings for the Synopsys DWMAC Glue for Amlogic SoCs over to a YAML schemas.

Signed-off-by: Neil Armstrong <narmstrong@baylibre.com>
---
Rob, 

I keep getting :
.../devicetree/bindings/net/amlogic,meson-dwmac.example.dt.yaml: ethernet@c9410000: reg: [[3376480256, 65536], [3364046144, 8]] is too long

for the example DT

and for the board DT :
../amlogic/meson-gxl-s905x-libretech-cc.dt.yaml: ethernet@c9410000: reg: [[0, 3376480256, 0, 65536, 0, 3364046144, 0, 4]] is too short
../amlogic/meson-gxl-s905x-nexbox-a95x.dt.yaml: soc: ethernet@c9410000:reg:0: [0, 3376480256, 0, 65536, 0, 3364046144, 0, 4] is too long

and I don't know how to get rid of it.

What should I do for reg ?

Neil

 .../bindings/net/amlogic,meson-dwmac.yaml     | 113 ++++++++++++++++++
 .../devicetree/bindings/net/meson-dwmac.txt   |  71 -----------
 .../devicetree/bindings/net/snps,dwmac.yaml   |   5 +
 3 files changed, 118 insertions(+), 71 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/net/amlogic,meson-dwmac.yaml
 delete mode 100644 Documentation/devicetree/bindings/net/meson-dwmac.txt

diff --git a/Documentation/devicetree/bindings/net/amlogic,meson-dwmac.yaml b/Documentation/devicetree/bindings/net/amlogic,meson-dwmac.yaml
new file mode 100644
index 000000000000..ae91aa9d8616
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/amlogic,meson-dwmac.yaml
@@ -0,0 +1,113 @@
+# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause)
+# Copyright 2019 BayLibre, SAS
+%YAML 1.2
+---
+$id: "http://devicetree.org/schemas/net/amlogic,meson-dwmac.yaml#"
+$schema: "http://devicetree.org/meta-schemas/core.yaml#"
+
+title: Amlogic Meson DWMAC Ethernet controller
+
+maintainers:
+  - Neil Armstrong <narmstrong@baylibre.com>
+  - Martin Blumenstingl <martin.blumenstingl@googlemail.com>
+
+# We need a select here so we don't match all nodes with 'snps,dwmac'
+select:
+  properties:
+    compatible:
+      contains:
+        enum:
+          - amlogic,meson6-dwmac
+          - amlogic,meson8b-dwmac
+          - amlogic,meson8m2-dwmac
+          - amlogic,meson-gxbb-dwmac
+          - amlogic,meson-axg-dwmac
+  required:
+    - compatible
+
+allOf:
+  - $ref: "snps,dwmac.yaml#"
+  - if:
+      properties:
+        compatible:
+          contains:
+            enum:
+              - amlogic,meson8b-dwmac
+              - amlogic,meson8m2-dwmac
+              - amlogic,meson-gxbb-dwmac
+              - amlogic,meson-axg-dwmac
+
+    then:
+      properties:
+        clocks:
+          items:
+            - description: GMAC main clock
+            - description: First parent clock of the internal mux
+            - description: Second parent clock of the internal mux
+
+        clock-names:
+          minItems: 3
+          maxItems: 3
+          items:
+            - const: stmmaceth
+            - const: clkin0
+            - const: clkin1
+
+        amlogic,tx-delay-ns:
+          $ref: /schemas/types.yaml#definitions/uint32
+          description:
+            The internal RGMII TX clock delay (provided by this driver) in
+            nanoseconds. Allowed values are 0ns, 2ns, 4ns, 6ns.
+            When phy-mode is set to "rgmii" then the TX delay should be
+            explicitly configured. When not configured a fallback of 2ns is
+            used. When the phy-mode is set to either "rgmii-id" or "rgmii-txid"
+            the TX clock delay is already provided by the PHY. In that case
+            this property should be set to 0ns (which disables the TX clock
+            delay in the MAC to prevent the clock from going off because both
+            PHY and MAC are adding a delay).
+            Any configuration is ignored when the phy-mode is set to "rmii".
+
+properties:
+  compatible:
+    additionalItems: true
+    maxItems: 3
+    items:
+      - enum:
+          - amlogic,meson6-dwmac
+          - amlogic,meson8b-dwmac
+          - amlogic,meson8m2-dwmac
+          - amlogic,meson-gxbb-dwmac
+          - amlogic,meson-axg-dwmac
+    contains:
+      enum:
+        - snps,dwmac-3.70a
+        - snps,dwmac
+
+  reg:
+    items:
+      - description:
+          The first register range should be the one of the DWMAC controller
+      - description:
+          The second range is is for the Amlogic specific configuration
+          (for example the PRG_ETHERNET register range on Meson8b and newer)
+
+required:
+  - compatible
+  - reg
+  - interrupts
+  - interrupt-names
+  - clocks
+  - clock-names
+  - phy-mode
+
+examples:
+  - |
+    ethmac: ethernet@c9410000 {
+         compatible = "amlogic,meson-gxbb-dwmac", "snps,dwmac";
+         reg = <0xc9410000 0x10000>, <0xc8834540 0x8>;
+         interrupts = <8>;
+         interrupt-names = "macirq";
+         clocks = <&clk_eth>, <&clkc_fclk_div2>, <&clk_mpll2>;
+         clock-names = "stmmaceth", "clkin0", "clkin1";
+         phy-mode = "rgmii";
+    };
diff --git a/Documentation/devicetree/bindings/net/meson-dwmac.txt b/Documentation/devicetree/bindings/net/meson-dwmac.txt
deleted file mode 100644
index 1321bb194ed9..000000000000
--- a/Documentation/devicetree/bindings/net/meson-dwmac.txt
+++ /dev/null
@@ -1,71 +0,0 @@
-* Amlogic Meson DWMAC Ethernet controller
-
-The device inherits all the properties of the dwmac/stmmac devices
-described in the file stmmac.txt in the current directory with the
-following changes.
-
-Required properties on all platforms:
-
-- compatible:	Depending on the platform this should be one of:
-			- "amlogic,meson6-dwmac"
-			- "amlogic,meson8b-dwmac"
-			- "amlogic,meson8m2-dwmac"
-			- "amlogic,meson-gxbb-dwmac"
-			- "amlogic,meson-axg-dwmac"
-		Additionally "snps,dwmac" and any applicable more
-		detailed version number described in net/stmmac.txt
-		should be used.
-
-- reg:	The first register range should be the one of the DWMAC
-	controller. The second range is is for the Amlogic specific
-	configuration (for example the PRG_ETHERNET register range
-	on Meson8b and newer)
-
-Required properties on Meson8b, Meson8m2, GXBB and newer:
-- clock-names:	Should contain the following:
-		- "stmmaceth" - see stmmac.txt
-		- "clkin0" - first parent clock of the internal mux
-		- "clkin1" - second parent clock of the internal mux
-
-Optional properties on Meson8b, Meson8m2, GXBB and newer:
-- amlogic,tx-delay-ns:	The internal RGMII TX clock delay (provided
-			by this driver) in nanoseconds. Allowed values
-			are: 0ns, 2ns, 4ns, 6ns.
-			When phy-mode is set to "rgmii" then the TX
-			delay should be explicitly configured. When
-			not configured a fallback of 2ns is used.
-			When the phy-mode is set to either "rgmii-id"
-			or "rgmii-txid" the TX clock delay is already
-			provided by the PHY. In that case this
-			property should be set to 0ns (which disables
-			the TX clock delay in the MAC to prevent the
-			clock from going off because both PHY and MAC
-			are adding a delay).
-			Any configuration is ignored when the phy-mode
-			is set to "rmii".
-
-Example for Meson6:
-
-	ethmac: ethernet@c9410000 {
-		compatible = "amlogic,meson6-dwmac", "snps,dwmac";
-		reg = <0xc9410000 0x10000
-		       0xc1108108 0x4>;
-		interrupts = <0 8 1>;
-		interrupt-names = "macirq";
-		clocks = <&clk81>;
-		clock-names = "stmmaceth";
-	}
-
-Example for GXBB:
-	ethmac: ethernet@c9410000 {
-		compatible = "amlogic,meson-gxbb-dwmac", "snps,dwmac";
-		reg = <0x0 0xc9410000 0x0 0x10000>,
-			<0x0 0xc8834540 0x0 0x8>;
-		interrupts = <0 8 1>;
-		interrupt-names = "macirq";
-		clocks = <&clkc CLKID_ETH>,
-				<&clkc CLKID_FCLK_DIV2>,
-				<&clkc CLKID_MPLL2>;
-		clock-names = "stmmaceth", "clkin0", "clkin1";
-		phy-mode = "rgmii";
-	};
diff --git a/Documentation/devicetree/bindings/net/snps,dwmac.yaml b/Documentation/devicetree/bindings/net/snps,dwmac.yaml
index 76fea2be66ac..ff1dc662a4e3 100644
--- a/Documentation/devicetree/bindings/net/snps,dwmac.yaml
+++ b/Documentation/devicetree/bindings/net/snps,dwmac.yaml
@@ -50,6 +50,11 @@ properties:
         - allwinner,sun8i-r40-emac
         - allwinner,sun8i-v3s-emac
         - allwinner,sun50i-a64-emac
+        - amlogic,meson6-dwmac
+        - amlogic,meson8b-dwmac
+        - amlogic,meson8m2-dwmac
+        - amlogic,meson-gxbb-dwmac
+        - amlogic,meson-axg-dwmac
         - snps,dwmac
         - snps,dwmac-3.50a
         - snps,dwmac-3.610
-- 
2.22.0


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox