Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net-next] net: dsa: Expose tagging protocol to user-space
From: Andrew Lunn @ 2018-09-10  3:04 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Florian Fainelli, netdev, Vivien Didelot, David S. Miller,
	open list
In-Reply-To: <20180908094331.GB3246@nanopsycho.orion>

On Sat, Sep 08, 2018 at 11:43:31AM +0200, Jiri Pirko wrote:
> Fri, Sep 07, 2018 at 08:09:02PM CEST, f.fainelli@gmail.com wrote:
> >There is no way for user-space to know what a given DSA network device's
> >tagging protocol is. Expose this information through a dsa/tagging
> >attribute which reflects the tagging protocol currently in use.
> >
> >This is helpful for configuration (e.g: none behaves dramatically
> >different wrt. bridges) as well as for packet capture tools when there
> >is not a proper Ethernet type available.
> 
> 
> Hmm, I wonder. It this something that varies between ports of an
> individual ASIC? Or is it rather something defined per-ASIC. If so, this
> looks more like a devlink-api material.

Hi Jiri

This is between the CPU ethernet device and the switch port that
interface is connect to.

For the Marvell devices, any switch port can be connected to the CPU
Ethernet interface, and the same protocol is used. However, some
switches have a special port which should be used to connect to the
CPU ethernet and supports this tagging protocol. If the designer gets
it wrong and uses a different port to connect the CPU, no tagging
protocol can be used, which as Florian indicated, has a big impact on
bridging, etc.

And just to make it more complex, Marvel has two tagging
schemes. Older devices use DSA, newer devices uses EDSA. However, for
the very new devices in the 6390 family, Marvell made a subtle change
to how EDSA works, which broke it, so we had to go back to DSA.

Of the different tagging protocols used by the 50 or so switches Linux
supports, only the EDSA tagging protocol makes use of an
Ethertype. tcpdump knows how to decode these packets. For all the
other tagging protocols, it has no idea, the Ethertype is all messed
up, and it just prints hex. What i think Florian wants to do is stuff
the tagging protocol into the pcap-ng header so that tcpdump knows
what protocol is in use, and can put the correct protocol dissector in
the chain.

And just for completeness, there potentially is a second tagging
scheme when you have multiple switches connected together in a
cluster, but that is internal to the cluster. The CPU is unaware of
it. But if you are snooping on the traffic, you need to know what the
protocol is, so you can decode the frames. In theory, that could be
EDSA or DSA, and can be selected per intra-switch link. In practice,
they are all DSA, and i've not heard of anybody actually snooping this
traffic.

	Andrew

^ permalink raw reply

* Re: Re: [PATCH net-next v2 0/5] virtio: support packed ring
From: Tiwei Bie @ 2018-09-10  3:00 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: jasowang, virtualization, linux-kernel, netdev, virtio-dev, wexu,
	jfreimann
In-Reply-To: <20180907084509-mutt-send-email-mst@kernel.org>

On Fri, Sep 07, 2018 at 09:00:49AM -0400, Michael S. Tsirkin wrote:
> On Fri, Sep 07, 2018 at 09:22:25AM +0800, Tiwei Bie wrote:
> > On Mon, Aug 27, 2018 at 05:00:40PM +0300, Michael S. Tsirkin wrote:
> > > Are there still plans to test the performance with vost pmd?
> > > vhost doesn't seem to show a performance gain ...
> > > 
> > 
> > I tried some performance tests with vhost PMD. In guest, the
> > XDP program will return XDP_DROP directly. And in host, testpmd
> > will do txonly fwd.
> > 
> > When burst size is 1 and packet size is 64 in testpmd and
> > testpmd needs to iterate 5 Tx queues (but only the first two
> > queues are enabled) to prepare and inject packets, I got ~12%
> > performance boost (5.7Mpps -> 6.4Mpps). And if the vhost PMD
> > is faster (e.g. just need to iterate the first two queues to
> > prepare and inject packets), then I got similar performance
> > for both rings (~9.9Mpps) (packed ring's performance can be
> > lower, because it's more complicated in driver.)
> > 
> > I think packed ring makes vhost PMD faster, but it doesn't make
> > the driver faster. In packed ring, the ring is simplified, and
> > the handling of the ring in vhost (device) is also simplified,
> > but things are not simplified in driver, e.g. although there is
> > no desc table in the virtqueue anymore, driver still needs to
> > maintain a private desc state table (which is still managed as
> > a list in this patch set) to support the out-of-order desc
> > processing in vhost (device).
> > 
> > I think this patch set is mainly to make the driver have a full
> > functional support for the packed ring, which makes it possible
> > to leverage the packed ring feature in vhost (device). But I'm
> > not sure whether there is any other better idea, I'd like to
> > hear your thoughts. Thanks!
> 
> Just this: Jens seems to report a nice gain with virtio and
> vhost pmd across the board. Try to compare virtio and
> virtio pmd to see what does pmd do better?

The virtio PMD (drivers/net/virtio) in DPDK doesn't need to share
the virtio ring operation code with other drivers and is highly
optimized for network. E.g. in Rx, the Rx burst function won't
chain descs. So the ID management for the Rx ring can be quite
simple and straightforward, we just need to initialize these IDs
when initializing the ring and don't need to change these IDs
in data path anymore (the mergable Rx code in that patch set
assumes the descs will be written back in order, which should be
fixed. I.e., the ID in the desc should be used to index vq->descx[]).
The Tx code in that patch set also assumes the descs will be
written back by device in order, which should be fixed.

But in kernel virtio driver, the virtio_ring.c is very generic.
The enqueue (virtqueue_add()) and dequeue (virtqueue_get_buf_ctx())
functions need to support all the virtio devices and should be
able to handle all the possible cases that may happen. So although
the packed ring can be very efficient in some cases, currently
the room to optimize the performance in kernel's virtio_ring.c
isn't that much. If we want to take the fully advantage of the
packed ring's efficiency, we need some further e.g. API changes
in virtio_ring.c, which shouldn't be part of this patch set. So
I still think this patch set is mainly to make the kernel virtio
driver to have a full functional support of the packed ring, and
we can't expect impressive performance boost with it.

> 
> 
> > 
> > > 
> > > On Wed, Jul 11, 2018 at 10:27:06AM +0800, Tiwei Bie wrote:
> > > > Hello everyone,
> > > > 
> > > > This patch set implements packed ring support in virtio driver.
> > > > 
> > > > Some functional tests have been done with Jason's
> > > > packed ring implementation in vhost:
> > > > 
> > > > https://lkml.org/lkml/2018/7/3/33
> > > > 
> > > > Both of ping and netperf worked as expected.
> > > > 
> > > > v1 -> v2:
> > > > - Use READ_ONCE() to read event off_wrap and flags together (Jason);
> > > > - Add comments related to ccw (Jason);
> > > > 
> > > > RFC (v6) -> v1:
> > > > - Avoid extra virtio_wmb() in virtqueue_enable_cb_delayed_packed()
> > > >   when event idx is off (Jason);
> > > > - Fix bufs calculation in virtqueue_enable_cb_delayed_packed() (Jason);
> > > > - Test the state of the desc at used_idx instead of last_used_idx
> > > >   in virtqueue_enable_cb_delayed_packed() (Jason);
> > > > - Save wrap counter (as part of queue state) in the return value
> > > >   of virtqueue_enable_cb_prepare_packed();
> > > > - Refine the packed ring definitions in uapi;
> > > > - Rebase on the net-next tree;
> > > > 
> > > > RFC v5 -> RFC v6:
> > > > - Avoid tracking addr/len/flags when DMA API isn't used (MST/Jason);
> > > > - Define wrap counter as bool (Jason);
> > > > - Use ALIGN() in vring_init_packed() (Jason);
> > > > - Avoid using pointer to track `next` in detach_buf_packed() (Jason);
> > > > - Add comments for barriers (Jason);
> > > > - Don't enable RING_PACKED on ccw for now (noticed by Jason);
> > > > - Refine the memory barrier in virtqueue_poll();
> > > > - Add a missing memory barrier in virtqueue_enable_cb_delayed_packed();
> > > > - Remove the hacks in virtqueue_enable_cb_prepare_packed();
> > > > 
> > > > RFC v4 -> RFC v5:
> > > > - Save DMA addr, etc in desc state (Jason);
> > > > - Track used wrap counter;
> > > > 
> > > > RFC v3 -> RFC v4:
> > > > - Make ID allocation support out-of-order (Jason);
> > > > - Various fixes for EVENT_IDX support;
> > > > 
> > > > RFC v2 -> RFC v3:
> > > > - Split into small patches (Jason);
> > > > - Add helper virtqueue_use_indirect() (Jason);
> > > > - Just set id for the last descriptor of a list (Jason);
> > > > - Calculate the prev in virtqueue_add_packed() (Jason);
> > > > - Fix/improve desc suppression code (Jason/MST);
> > > > - Refine the code layout for XXX_split/packed and wrappers (MST);
> > > > - Fix the comments and API in uapi (MST);
> > > > - Remove the BUG_ON() for indirect (Jason);
> > > > - Some other refinements and bug fixes;
> > > > 
> > > > RFC v1 -> RFC v2:
> > > > - Add indirect descriptor support - compile test only;
> > > > - Add event suppression supprt - compile test only;
> > > > - Move vring_packed_init() out of uapi (Jason, MST);
> > > > - Merge two loops into one in virtqueue_add_packed() (Jason);
> > > > - Split vring_unmap_one() for packed ring and split ring (Jason);
> > > > - Avoid using '%' operator (Jason);
> > > > - Rename free_head -> next_avail_idx (Jason);
> > > > - Add comments for virtio_wmb() in virtqueue_add_packed() (Jason);
> > > > - Some other refinements and bug fixes;
> > > > 
> > > > Thanks!
> > > > 
> > > > Tiwei Bie (5):
> > > >   virtio: add packed ring definitions
> > > >   virtio_ring: support creating packed ring
> > > >   virtio_ring: add packed ring support
> > > >   virtio_ring: add event idx support in packed ring
> > > >   virtio_ring: enable packed ring
> > > > 
> > > >  drivers/s390/virtio/virtio_ccw.c   |   14 +
> > > >  drivers/virtio/virtio_ring.c       | 1365 ++++++++++++++++++++++------
> > > >  include/linux/virtio_ring.h        |    8 +-
> > > >  include/uapi/linux/virtio_config.h |    3 +
> > > >  include/uapi/linux/virtio_ring.h   |   43 +
> > > >  5 files changed, 1157 insertions(+), 276 deletions(-)
> > > > 
> > > > -- 
> > > > 2.18.0
> > > 
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> > > For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
> > > 

^ permalink raw reply

* Re: [PATCH net V2] ip: frags: fix crash in ip_do_fragment()
From: David Miller @ 2018-09-09 21:51 UTC (permalink / raw)
  To: edumazet; +Cc: ap420073, posk, netdev, pablo, fw
In-Reply-To: <CANn89iKXTr+zy7ms5v7RNaz0BvQ1DMt+A7OfyPkypJZ9BvzeDg@mail.gmail.com>

From: Eric Dumazet <edumazet@google.com>
Date: Sun, 9 Sep 2018 13:51:21 -0700

>> Hi Eric,
>> Thank you for review!
>>
>> I think netfilter side ipv6 code change is needed
>> because netfilter ipv6 defrag routine also set fp->ip_defrag_offset value
>> so that fp->sk will not be NULL.
>> And I think these crash in ip_do_fragment() and ip6_fragment() are
>> actually same bug.
>>
> 
> Right you are, thanks.
> 
> Reviewed-by: Eric Dumazet <edumazet@google.com>

Applied, thanks everyone.

^ permalink raw reply

* Re: Re: [PATCH net-next v2 4/5] virtio_ring: add event idx support in packed ring
From: Tiwei Bie @ 2018-09-10  2:35 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: jasowang, virtualization, linux-kernel, netdev, virtio-dev, wexu,
	jfreimann
In-Reply-To: <20180907100341-mutt-send-email-mst@kernel.org>

On Fri, Sep 07, 2018 at 10:10:14AM -0400, Michael S. Tsirkin wrote:
> On Wed, Jul 11, 2018 at 10:27:10AM +0800, Tiwei Bie wrote:
> > This commit introduces the EVENT_IDX support in packed ring.
> > 
> > Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
> 
> Besides the usual comment about hard-coded constants like <<15:
> does this actually do any good for performance? We don't
> have to if we do not want to.

Got it. Thanks.

> 
> > ---
> >  drivers/virtio/virtio_ring.c | 73 ++++++++++++++++++++++++++++++++----
> >  1 file changed, 65 insertions(+), 8 deletions(-)
> > 
> > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > index f317b485ba54..f79a1e17f7d1 100644
> > --- a/drivers/virtio/virtio_ring.c
> > +++ b/drivers/virtio/virtio_ring.c
> > @@ -1050,7 +1050,7 @@ static inline int virtqueue_add_packed(struct virtqueue *_vq,
> >  static bool virtqueue_kick_prepare_packed(struct virtqueue *_vq)
> >  {
> >  	struct vring_virtqueue *vq = to_vvq(_vq);
> > -	u16 flags;
> > +	u16 new, old, off_wrap, flags, wrap_counter, event_idx;
> >  	bool needs_kick;
> >  	u32 snapshot;
> >  
> > @@ -1059,9 +1059,19 @@ static bool virtqueue_kick_prepare_packed(struct virtqueue *_vq)
> >  	 * suppressions. */
> >  	virtio_mb(vq->weak_barriers);
> >  
> > +	old = vq->next_avail_idx - vq->num_added;
> > +	new = vq->next_avail_idx;
> > +	vq->num_added = 0;
> > +
> >  	snapshot = READ_ONCE(*(u32 *)vq->vring_packed.device);
> > +	off_wrap = virtio16_to_cpu(_vq->vdev, (__virtio16)(snapshot & 0xffff));
> >  	flags = virtio16_to_cpu(_vq->vdev, (__virtio16)(snapshot >> 16)) & 0x3;
> 
> 
> some kind of struct union would be helpful to make this readable.

I will define a struct/union for this.

> 
> >  
> > +	wrap_counter = off_wrap >> 15;
> > +	event_idx = off_wrap & ~(1 << 15);
> > +	if (wrap_counter != vq->avail_wrap_counter)
> > +		event_idx -= vq->vring_packed.num;
> > +
> >  #ifdef DEBUG
> >  	if (vq->last_add_time_valid) {
> >  		WARN_ON(ktime_to_ms(ktime_sub(ktime_get(),
> > @@ -1070,7 +1080,10 @@ static bool virtqueue_kick_prepare_packed(struct virtqueue *_vq)
> >  	vq->last_add_time_valid = false;
> >  #endif
> >  
> > -	needs_kick = (flags != VRING_EVENT_F_DISABLE);
> > +	if (flags == VRING_EVENT_F_DESC)
> > +		needs_kick = vring_need_event(event_idx, new, old);
> > +	else
> > +		needs_kick = (flags != VRING_EVENT_F_DISABLE);
> >  	END_USE(vq);
> >  	return needs_kick;
> >  }
> > @@ -1185,6 +1198,15 @@ static void *virtqueue_get_buf_ctx_packed(struct virtqueue *_vq,
> >  	ret = vq->desc_state_packed[id].data;
> >  	detach_buf_packed(vq, id, ctx);
> >  
> > +	/* If we expect an interrupt for the next entry, tell host
> > +	 * by writing event index and flush out the write before
> > +	 * the read in the next get_buf call. */
> > +	if (vq->event_flags_shadow == VRING_EVENT_F_DESC)
> > +		virtio_store_mb(vq->weak_barriers,
> > +				&vq->vring_packed.driver->off_wrap,
> > +				cpu_to_virtio16(_vq->vdev, vq->last_used_idx |
> > +					((u16)vq->used_wrap_counter << 15)));
> > +
> >  #ifdef DEBUG
> >  	vq->last_add_time_valid = false;
> >  #endif
> > @@ -1213,8 +1235,18 @@ static unsigned virtqueue_enable_cb_prepare_packed(struct virtqueue *_vq)
> >  	/* We optimistically turn back on interrupts, then check if there was
> >  	 * more to do. */
> >  
> > +	if (vq->event) {
> > +		vq->vring_packed.driver->off_wrap = cpu_to_virtio16(_vq->vdev,
> > +				vq->last_used_idx |
> > +				((u16)vq->used_wrap_counter << 15));
> > +		/* We need to update event offset and event wrap
> > +		 * counter first before updating event flags. */
> > +		virtio_wmb(vq->weak_barriers);
> > +	}
> > +
> >  	if (vq->event_flags_shadow == VRING_EVENT_F_DISABLE) {
> > -		vq->event_flags_shadow = VRING_EVENT_F_ENABLE;
> > +		vq->event_flags_shadow = vq->event ? VRING_EVENT_F_DESC :
> > +						     VRING_EVENT_F_ENABLE;
> >  		vq->vring_packed.driver->flags = cpu_to_virtio16(_vq->vdev,
> >  							vq->event_flags_shadow);
> >  	}
> > @@ -1238,22 +1270,47 @@ static bool virtqueue_poll_packed(struct virtqueue *_vq, unsigned off_wrap)
> >  static bool virtqueue_enable_cb_delayed_packed(struct virtqueue *_vq)
> >  {
> >  	struct vring_virtqueue *vq = to_vvq(_vq);
> > +	u16 bufs, used_idx, wrap_counter;
> >  
> >  	START_USE(vq);
> >  
> >  	/* We optimistically turn back on interrupts, then check if there was
> >  	 * more to do. */
> >  
> > +	if (vq->event) {
> > +		/* TODO: tune this threshold */
> > +		bufs = (vq->vring_packed.num - _vq->num_free) * 3 / 4;
> > +		wrap_counter = vq->used_wrap_counter;
> > +
> > +		used_idx = vq->last_used_idx + bufs;
> > +		if (used_idx >= vq->vring_packed.num) {
> > +			used_idx -= vq->vring_packed.num;
> > +			wrap_counter ^= 1;
> > +		}
> > +
> > +		vq->vring_packed.driver->off_wrap = cpu_to_virtio16(_vq->vdev,
> > +				used_idx | (wrap_counter << 15));
> > +
> > +		/* We need to update event offset and event wrap
> > +		 * counter first before updating event flags. */
> > +		virtio_wmb(vq->weak_barriers);
> > +	} else {
> > +		used_idx = vq->last_used_idx;
> > +		wrap_counter = vq->used_wrap_counter;
> > +	}
> > +
> >  	if (vq->event_flags_shadow == VRING_EVENT_F_DISABLE) {
> > -		vq->event_flags_shadow = VRING_EVENT_F_ENABLE;
> > +		vq->event_flags_shadow = vq->event ? VRING_EVENT_F_DESC :
> > +						     VRING_EVENT_F_ENABLE;
> >  		vq->vring_packed.driver->flags = cpu_to_virtio16(_vq->vdev,
> >  							vq->event_flags_shadow);
> > -		/* We need to enable interrupts first before re-checking
> > -		 * for more used buffers. */
> > -		virtio_mb(vq->weak_barriers);
> >  	}
> >  
> > -	if (more_used_packed(vq)) {
> > +	/* We need to update event suppression structure first
> > +	 * before re-checking for more used buffers. */
> > +	virtio_mb(vq->weak_barriers);
> > +
> 
> mb is expensive. We should not do it if we changed nothing.

I will try to avoid it when possible.

> 
> > +	if (is_used_desc_packed(vq, used_idx, wrap_counter)) {
> >  		END_USE(vq);
> >  		return false;
> >  	}
> > -- 
> > 2.18.0
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
> 

^ permalink raw reply

* Re: [PATCH net-next v2 2/5] virtio_ring: support creating packed ring
From: Tiwei Bie @ 2018-09-10  2:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: jasowang, virtualization, linux-kernel, netdev, virtio-dev, wexu,
	jfreimann
In-Reply-To: <20180907095208-mutt-send-email-mst@kernel.org>

On Fri, Sep 07, 2018 at 10:03:24AM -0400, Michael S. Tsirkin wrote:
> On Wed, Jul 11, 2018 at 10:27:08AM +0800, Tiwei Bie wrote:
> > This commit introduces the support for creating packed ring.
> > All split ring specific functions are added _split suffix.
> > Some necessary stubs for packed ring are also added.
> > 
> > Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
> 
> I'd rather have a patch just renaming split functions, then
> add all packed stuff in as a separate patch on top.

Sure, I will do that.

> 
> 
> > ---
> >  drivers/virtio/virtio_ring.c | 801 +++++++++++++++++++++++------------
> >  include/linux/virtio_ring.h  |   8 +-
> >  2 files changed, 546 insertions(+), 263 deletions(-)
> > 
> > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > index 814b395007b2..c4f8abc7445a 100644
> > --- a/drivers/virtio/virtio_ring.c
> > +++ b/drivers/virtio/virtio_ring.c
> > @@ -60,11 +60,15 @@ struct vring_desc_state {
> >  	struct vring_desc *indir_desc;	/* Indirect descriptor, if any. */
> >  };
> >  
> > +struct vring_desc_state_packed {
> > +	int next;			/* The next desc state. */
> 
> So this can go away with IN_ORDER?

Yes. If IN_ORDER is negotiated, next won't be needed anymore.
Currently, IN_ORDER isn't included in this patch set, because
some changes for split ring are needed to make sure that it
will use the descs in the expected order. After that,
optimizations can be done for both of split ring and packed
ring respectively.

> 
> > +};
> > +
> >  struct vring_virtqueue {
> >  	struct virtqueue vq;
> >  
> > -	/* Actual memory layout for this queue */
> > -	struct vring vring;
> > +	/* Is this a packed ring? */
> > +	bool packed;
> >  
> >  	/* Can we use weak barriers? */
> >  	bool weak_barriers;
> > @@ -86,11 +90,39 @@ struct vring_virtqueue {
> >  	/* Last used index we've seen. */
> >  	u16 last_used_idx;
> >  
> > -	/* Last written value to avail->flags */
> > -	u16 avail_flags_shadow;
> > +	union {
> > +		/* Available for split ring */
> > +		struct {
> > +			/* Actual memory layout for this queue. */
> > +			struct vring vring;
> >  
> > -	/* Last written value to avail->idx in guest byte order */
> > -	u16 avail_idx_shadow;
> > +			/* Last written value to avail->flags */
> > +			u16 avail_flags_shadow;
> > +
> > +			/* Last written value to avail->idx in
> > +			 * guest byte order. */
> > +			u16 avail_idx_shadow;
> > +		};
> 
> Name this field split so it's easier to detect misuse of e.g.
> packed fields in split code?

Good point, I'll do that.

> 
> > +
> > +		/* Available for packed ring */
> > +		struct {
> > +			/* Actual memory layout for this queue. */
> > +			struct vring_packed vring_packed;
> > +
> > +			/* Driver ring wrap counter. */
> > +			bool avail_wrap_counter;
> > +
> > +			/* Device ring wrap counter. */
> > +			bool used_wrap_counter;
> > +
> > +			/* Index of the next avail descriptor. */
> > +			u16 next_avail_idx;
> > +
> > +			/* Last written value to driver->flags in
> > +			 * guest byte order. */
> > +			u16 event_flags_shadow;
> > +		};
> > +	};
[...]
> > +
> > +/*
> > + * The layout for the packed ring is a continuous chunk of memory
> > + * which looks like this.
> > + *
> > + * struct vring_packed {
> > + *	// The actual descriptors (16 bytes each)
> > + *	struct vring_packed_desc desc[num];
> > + *
> > + *	// Padding to the next align boundary.
> > + *	char pad[];
> > + *
> > + *	// Driver Event Suppression
> > + *	struct vring_packed_desc_event driver;
> > + *
> > + *	// Device Event Suppression
> > + *	struct vring_packed_desc_event device;
> > + * };
> > + */
> 
> Why not just allocate event structures separately?
> Is it a win to have them share a cache line for some reason?

Will do that.

> 
> > +static inline void vring_init_packed(struct vring_packed *vr, unsigned int num,
> > +				     void *p, unsigned long align)
> > +{
> > +	vr->num = num;
> > +	vr->desc = p;
> > +	vr->driver = (void *)ALIGN(((uintptr_t)p +
> > +		sizeof(struct vring_packed_desc) * num), align);
> > +	vr->device = vr->driver + 1;
> > +}
> 
> What's all this about alignment? Where does it come from?

It comes from the `vring_align` parameter of vring_create_virtqueue()
and vring_new_virtqueue():

https://github.com/torvalds/linux/blob/a49a9dcce802/drivers/virtio/virtio_ring.c#L1061
https://github.com/torvalds/linux/blob/a49a9dcce802/drivers/virtio/virtio_ring.c#L1123

Should I just ignore it in packed ring?

CCW defined this:

#define KVM_VIRTIO_CCW_RING_ALIGN 4096

I'm not familiar with CCW. Currently, in this patch set, packed ring
isn't enabled on CCW dues to some legacy accessors are not implemented
in packed ring yet.

> 
> > +
> > +static inline unsigned vring_size_packed(unsigned int num, unsigned long align)
> > +{
> > +	return ((sizeof(struct vring_packed_desc) * num + align - 1)
> > +		& ~(align - 1)) + sizeof(struct vring_packed_desc_event) * 2;
> > +}
[...]
> > @@ -1129,10 +1388,17 @@ struct virtqueue *vring_new_virtqueue(unsigned int index,
> >  				      void (*callback)(struct virtqueue *vq),
> >  				      const char *name)
> >  {
> > -	struct vring vring;
> > -	vring_init(&vring, num, pages, vring_align);
> > -	return __vring_new_virtqueue(index, vring, vdev, weak_barriers, context,
> > -				     notify, callback, name);
> > +	union vring_union vring;
> > +	bool packed;
> > +
> > +	packed = virtio_has_feature(vdev, VIRTIO_F_RING_PACKED);
> > +	if (packed)
> > +		vring_init_packed(&vring.vring_packed, num, pages, vring_align);
> > +	else
> > +		vring_init(&vring.vring_split, num, pages, vring_align);
> 
> 
> vring_init in the UAPI header is more or less a bug.
> I'd just stop using it, keep it around for legacy userspace.

Got it. I'd like to do that. Thanks.

> 
> > +
> > +	return __vring_new_virtqueue(index, vring, packed, vdev, weak_barriers,
> > +				     context, notify, callback, name);
> >  }
> >  EXPORT_SYMBOL_GPL(vring_new_virtqueue);
> >  
> > @@ -1142,7 +1408,9 @@ void vring_del_virtqueue(struct virtqueue *_vq)
> >  
> >  	if (vq->we_own_ring) {
> >  		vring_free_queue(vq->vq.vdev, vq->queue_size_in_bytes,
> > -				 vq->vring.desc, vq->queue_dma_addr);
> > +				 vq->packed ? (void *)vq->vring_packed.desc :
> > +					      (void *)vq->vring.desc,
> > +				 vq->queue_dma_addr);
> >  	}
> >  	list_del(&_vq->list);
> >  	kfree(vq);
> > @@ -1184,7 +1452,7 @@ unsigned int virtqueue_get_vring_size(struct virtqueue *_vq)
> >  
> >  	struct vring_virtqueue *vq = to_vvq(_vq);
> >  
> > -	return vq->vring.num;
> > +	return vq->packed ? vq->vring_packed.num : vq->vring.num;
> >  }
> >  EXPORT_SYMBOL_GPL(virtqueue_get_vring_size);
> >  
> > @@ -1227,6 +1495,10 @@ dma_addr_t virtqueue_get_avail_addr(struct virtqueue *_vq)
> >  
> >  	BUG_ON(!vq->we_own_ring);
> >  
> > +	if (vq->packed)
> > +		return vq->queue_dma_addr + ((char *)vq->vring_packed.driver -
> > +				(char *)vq->vring_packed.desc);
> > +
> >  	return vq->queue_dma_addr +
> >  		((char *)vq->vring.avail - (char *)vq->vring.desc);
> >  }
> > @@ -1238,11 +1510,16 @@ dma_addr_t virtqueue_get_used_addr(struct virtqueue *_vq)
> >  
> >  	BUG_ON(!vq->we_own_ring);
> >  
> > +	if (vq->packed)
> > +		return vq->queue_dma_addr + ((char *)vq->vring_packed.device -
> > +				(char *)vq->vring_packed.desc);
> > +
> >  	return vq->queue_dma_addr +
> >  		((char *)vq->vring.used - (char *)vq->vring.desc);
> >  }
> >  EXPORT_SYMBOL_GPL(virtqueue_get_used_addr);
> >  
> > +/* Only available for split ring */
> >  const struct vring *virtqueue_get_vring(struct virtqueue *vq)
> >  {
> >  	return &to_vvq(vq)->vring;
> > diff --git a/include/linux/virtio_ring.h b/include/linux/virtio_ring.h
> > index fab02133a919..992142b35f55 100644
> > --- a/include/linux/virtio_ring.h
> > +++ b/include/linux/virtio_ring.h
> > @@ -60,6 +60,11 @@ static inline void virtio_store_mb(bool weak_barriers,
> >  struct virtio_device;
> >  struct virtqueue;
> >  
> > +union vring_union {
> > +	struct vring vring_split;
> > +	struct vring_packed vring_packed;
> > +};
> > +
> >  /*
> >   * Creates a virtqueue and allocates the descriptor ring.  If
> >   * may_reduce_num is set, then this may allocate a smaller ring than
> > @@ -79,7 +84,8 @@ struct virtqueue *vring_create_virtqueue(unsigned int index,
> >  
> >  /* Creates a virtqueue with a custom layout. */
> >  struct virtqueue *__vring_new_virtqueue(unsigned int index,
> > -					struct vring vring,
> > +					union vring_union vring,
> > +					bool packed,
> >  					struct virtio_device *vdev,
> >  					bool weak_barriers,
> >  					bool ctx,
> > -- 
> > 2.18.0

^ permalink raw reply

* Re: [PATCH net-next v2 1/5] virtio: add packed ring definitions
From: Tiwei Bie @ 2018-09-10  2:13 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: jasowang, virtualization, linux-kernel, netdev, virtio-dev, wexu,
	jfreimann
In-Reply-To: <20180907094922-mutt-send-email-mst@kernel.org>

On Fri, Sep 07, 2018 at 09:51:23AM -0400, Michael S. Tsirkin wrote:
> On Wed, Jul 11, 2018 at 10:27:07AM +0800, Tiwei Bie wrote:
> > Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
> > ---
> >  include/uapi/linux/virtio_config.h |  3 +++
> >  include/uapi/linux/virtio_ring.h   | 43 ++++++++++++++++++++++++++++++
> >  2 files changed, 46 insertions(+)
> > 
> > diff --git a/include/uapi/linux/virtio_config.h b/include/uapi/linux/virtio_config.h
> > index 449132c76b1c..1196e1c1d4f6 100644
> > --- a/include/uapi/linux/virtio_config.h
> > +++ b/include/uapi/linux/virtio_config.h
> > @@ -75,6 +75,9 @@
> >   */
> >  #define VIRTIO_F_IOMMU_PLATFORM		33
> >  
> > +/* This feature indicates support for the packed virtqueue layout. */
> > +#define VIRTIO_F_RING_PACKED		34
> > +
> >  /*
> >   * Does the device support Single Root I/O Virtualization?
> >   */
> > diff --git a/include/uapi/linux/virtio_ring.h b/include/uapi/linux/virtio_ring.h
> > index 6d5d5faa989b..0254a2ba29cf 100644
> > --- a/include/uapi/linux/virtio_ring.h
> > +++ b/include/uapi/linux/virtio_ring.h
> > @@ -44,6 +44,10 @@
> >  /* This means the buffer contains a list of buffer descriptors. */
> >  #define VRING_DESC_F_INDIRECT	4
> >  
> > +/* Mark a descriptor as available or used. */
> > +#define VRING_DESC_F_AVAIL	(1ul << 7)
> > +#define VRING_DESC_F_USED	(1ul << 15)
> > +
> >  /* The Host uses this in used->flags to advise the Guest: don't kick me when
> >   * you add a buffer.  It's unreliable, so it's simply an optimization.  Guest
> >   * will still kick if it's out of buffers. */
> > @@ -53,6 +57,17 @@
> >   * optimization.  */
> >  #define VRING_AVAIL_F_NO_INTERRUPT	1
> >  
> > +/* Enable events. */
> > +#define VRING_EVENT_F_ENABLE	0x0
> > +/* Disable events. */
> > +#define VRING_EVENT_F_DISABLE	0x1
> > +/*
> > + * Enable events for a specific descriptor
> > + * (as specified by Descriptor Ring Change Event Offset/Wrap Counter).
> > + * Only valid if VIRTIO_RING_F_EVENT_IDX has been negotiated.
> > + */
> > +#define VRING_EVENT_F_DESC	0x2
> > +
> >  /* We support indirect buffer descriptors */
> >  #define VIRTIO_RING_F_INDIRECT_DESC	28
> >  
> 
> These are for the packed ring, right? Pls prefix accordingly.

How about something like this:

#define VRING_PACKED_DESC_F_AVAIL	(1u << 7)
#define VRING_PACKED_DESC_F_USED	(1u << 15)

#define VRING_PACKED_EVENT_F_ENABLE	0x0
#define VRING_PACKED_EVENT_F_DISABLE	0x1
#define VRING_PACKED_EVENT_F_DESC	0x2


> Also, you likely need macros for the wrap counters.

How about something like this:

#define VRING_PACKED_EVENT_WRAP_COUNTER_SHIFT	15
#define VRING_PACKED_EVENT_WRAP_COUNTER_MASK	\
			(1u << VRING_PACKED_WRAP_COUNTER_SHIFT)
#define VRING_PACKED_EVENT_OFFSET_MASK	\
			((1u << VRING_PACKED_WRAP_COUNTER_SHIFT) - 1)

> 
> > @@ -171,4 +186,32 @@ static inline int vring_need_event(__u16 event_idx, __u16 new_idx, __u16 old)
> >  	return (__u16)(new_idx - event_idx - 1) < (__u16)(new_idx - old);
> >  }
> >  
> > +struct vring_packed_desc_event {
> > +	/* Descriptor Ring Change Event Offset/Wrap Counter. */
> > +	__virtio16 off_wrap;
> > +	/* Descriptor Ring Change Event Flags. */
> > +	__virtio16 flags;
> > +};
> > +
> > +struct vring_packed_desc {
> > +	/* Buffer Address. */
> > +	__virtio64 addr;
> > +	/* Buffer Length. */
> > +	__virtio32 len;
> > +	/* Buffer ID. */
> > +	__virtio16 id;
> > +	/* The flags depending on descriptor type. */
> > +	__virtio16 flags;
> > +};
> 
> Don't use __virtioXX types, just __leXX ones.

Got it, will do that.

> 
> > +
> > +struct vring_packed {
> > +	unsigned int num;
> > +
> > +	struct vring_packed_desc *desc;
> > +
> > +	struct vring_packed_desc_event *driver;
> > +
> > +	struct vring_packed_desc_event *device;
> > +};
> > +
> >  #endif /* _UAPI_LINUX_VIRTIO_RING_H */
> > -- 
> > 2.18.0

^ permalink raw reply

* Re: [PATCH net-next] ieee802154: ca8210: Use kmemdup instead of duplicating it in ca8210_test_int_driver_write
From: YueHaibing @ 2018-09-10  2:08 UTC (permalink / raw)
  To: davem, h.morris, alex.aring, stefan; +Cc: linux-kernel, netdev, linux-wpan
In-Reply-To: <20180906034228.7660-1-yuehaibing@huawei.com>

sorry, Pls ignore this duplicated patch.

On 2018/9/6 11:42, YueHaibing wrote:
> Replace calls to kmalloc followed by a memcpy with a direct call tokmemdup.
> 
> Patch found using coccinelle.
> 
> Signed-off-by: YueHaibing <yuehaibing@huawei.com>
> ---
>  drivers/net/ieee802154/ca8210.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ieee802154/ca8210.c b/drivers/net/ieee802154/ca8210.c
> index 58299fb..e21279d 100644
> --- a/drivers/net/ieee802154/ca8210.c
> +++ b/drivers/net/ieee802154/ca8210.c
> @@ -634,10 +634,9 @@ static int ca8210_test_int_driver_write(
>  	for (i = 0; i < len; i++)
>  		dev_dbg(&priv->spi->dev, "%#03x\n", buf[i]);
>  
> -	fifo_buffer = kmalloc(len, GFP_KERNEL);
> +	fifo_buffer = kmemdup(buf, len, GFP_KERNEL);
>  	if (!fifo_buffer)
>  		return -ENOMEM;
> -	memcpy(fifo_buffer, buf, len);
>  	kfifo_in(&test->up_fifo, &fifo_buffer, 4);
>  	wake_up_interruptible(&priv->test.readq);
>  
> 

^ permalink raw reply

* Re: [PATCH net-next v2 3/5] virtio_ring: add packed ring support
From: Tiwei Bie @ 2018-09-10  2:03 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: virtio-dev, netdev, linux-kernel, virtualization, wexu
In-Reply-To: <20180907083500-mutt-send-email-mst@kernel.org>

On Fri, Sep 07, 2018 at 09:49:14AM -0400, Michael S. Tsirkin wrote:
> On Wed, Jul 11, 2018 at 10:27:09AM +0800, Tiwei Bie wrote:
> > This commit introduces the support (without EVENT_IDX) for
> > packed ring.
> > 
> > Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
> > ---
> >  drivers/virtio/virtio_ring.c | 495 ++++++++++++++++++++++++++++++++++-
> >  1 file changed, 487 insertions(+), 8 deletions(-)
> > 
> > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > index c4f8abc7445a..f317b485ba54 100644
> > --- a/drivers/virtio/virtio_ring.c
> > +++ b/drivers/virtio/virtio_ring.c
> > @@ -55,12 +55,21 @@
> >  #define END_USE(vq)
> >  #endif
> >  
> > +#define _VRING_DESC_F_AVAIL(b)	((__u16)(b) << 7)
> > +#define _VRING_DESC_F_USED(b)	((__u16)(b) << 15)
> 
> Underscore followed by an upper case letter isn't a good idea
> for a symbol. And it's not nice that this does not
> use the VRING_DESC_F_USED from the header.
> How about ((b) ? VRING_DESC_F_USED : 0) instead?
> Is produced code worse then?

Yes, I think the produced code is worse when we use
conditional expression. Below is a simple test:

#define foo1(b) ((b) << 10)
#define foo2(b) ((b) ? (1 << 10) : 0)

unsigned short bar(unsigned short b)
{
	return foo1(b);
}

unsigned short baz(unsigned short b)
{
	return foo2(b);
}

With `gcc -O3 -S`, I got below assembly code:

	.file	"tmp.c"
	.text
	.p2align 4,,15
	.globl	bar
	.type	bar, @function
bar:
.LFB0:
	.cfi_startproc
	movl	%edi, %eax
	sall	$10, %eax
	ret
	.cfi_endproc
.LFE0:
	.size	bar, .-bar
	.p2align 4,,15
	.globl	baz
	.type	baz, @function
baz:
.LFB1:
	.cfi_startproc
	xorl	%eax, %eax
	testw	%di, %di
	setne	%al
	sall	$10, %eax
	ret
	.cfi_endproc
.LFE1:
	.size	baz, .-baz
	.ident	"GCC: (Debian 8.2.0-4) 8.2.0"
	.section	.note.GNU-stack,"",@progbits

That is to say, for `((b) << 10)`, it will shift the
register directly. But for `((b) ? (1 << 10) : 0)`,
in above code, it will zero eax first, and set al
to 1 depend on whether di is 0, and shift eax.

> 
> > +
> >  struct vring_desc_state {
> >  	void *data;			/* Data for callback. */
> >  	struct vring_desc *indir_desc;	/* Indirect descriptor, if any. */
> >  };
> >  
> >  struct vring_desc_state_packed {
> > +	void *data;			/* Data for callback. */
> > +	struct vring_packed_desc *indir_desc; /* Indirect descriptor, if any. */
> 
> Include vring_desc_state for these?

Sure.

> 
> > +	int num;			/* Descriptor list length. */
> > +	dma_addr_t addr;		/* Buffer DMA addr. */
> > +	u32 len;			/* Buffer length. */
> > +	u16 flags;			/* Descriptor flags. */
> 
> This seems only to be used for indirect. Check indirect field above
> instead and drop this.

The `flags` is also needed to know the direction, i.e. DMA_FROM_DEVICE
or DMA_TO_DEVICE when do DMA unmap.

> 
> >  	int next;			/* The next desc state. */
> 
> Packing things into 16 bit integers will help reduce
> cache pressure here.
> 
> Also, this is only used for unmap, so when dma API is not used,
> maintaining addr and len list management is unnecessary. How about we
> maintain addr/len in a separate array, avoiding accesses on unmap in that case?

Sure. I'll give it a try.

> 
> Also, lots of architectures have a nop unmap in the DMA API.
> See a proposed patch at the end for optimizing that case.

Got it. Thanks.

> 
> >  };
> >  
> > @@ -660,7 +669,6 @@ static bool virtqueue_poll_split(struct virtqueue *_vq, unsigned last_used_idx)
> >  {
> >  	struct vring_virtqueue *vq = to_vvq(_vq);
> >  
> > -	virtio_mb(vq->weak_barriers);
> >  	return (u16)last_used_idx != virtio16_to_cpu(_vq->vdev, vq->vring.used->idx);
> >  }
> 
> why is this changing the split queue implementation?

Because above barrier is also needed by virtqueue_poll_packed(),
so I moved it to virtqueue_poll() and also added some comments
for it.

> 
> 
> >  
> > @@ -757,6 +765,72 @@ static inline unsigned vring_size_packed(unsigned int num, unsigned long align)
> >  		& ~(align - 1)) + sizeof(struct vring_packed_desc_event) * 2;
> >  }
> >  
> > +static void vring_unmap_state_packed(const struct vring_virtqueue *vq,
> > +				     struct vring_desc_state_packed *state)
> > +{
> > +	u16 flags;
> > +
> > +	if (!vring_use_dma_api(vq->vq.vdev))
> > +		return;
> > +
> > +	flags = state->flags;
> > +
> > +	if (flags & VRING_DESC_F_INDIRECT) {
> > +		dma_unmap_single(vring_dma_dev(vq),
> > +				 state->addr, state->len,
> > +				 (flags & VRING_DESC_F_WRITE) ?
> > +				 DMA_FROM_DEVICE : DMA_TO_DEVICE);
> > +	} else {
> > +		dma_unmap_page(vring_dma_dev(vq),
> > +			       state->addr, state->len,
> > +			       (flags & VRING_DESC_F_WRITE) ?
> > +			       DMA_FROM_DEVICE : DMA_TO_DEVICE);
> > +	}
> > +}
> > +
> > +static void vring_unmap_desc_packed(const struct vring_virtqueue *vq,
> > +				   struct vring_packed_desc *desc)
> > +{
> > +	u16 flags;
> > +
> > +	if (!vring_use_dma_api(vq->vq.vdev))
> > +		return;
> > +
> > +	flags = virtio16_to_cpu(vq->vq.vdev, desc->flags);
> 
> I see no reason to use virtioXX wrappers for the packed ring.
> That's there to support legacy devices. Just use leXX.

Okay.

> 
> > +
> > +	if (flags & VRING_DESC_F_INDIRECT) {
> > +		dma_unmap_single(vring_dma_dev(vq),
> > +				 virtio64_to_cpu(vq->vq.vdev, desc->addr),
> > +				 virtio32_to_cpu(vq->vq.vdev, desc->len),
> > +				 (flags & VRING_DESC_F_WRITE) ?
> > +				 DMA_FROM_DEVICE : DMA_TO_DEVICE);
> > +	} else {
> > +		dma_unmap_page(vring_dma_dev(vq),
> > +			       virtio64_to_cpu(vq->vq.vdev, desc->addr),
> > +			       virtio32_to_cpu(vq->vq.vdev, desc->len),
> > +			       (flags & VRING_DESC_F_WRITE) ?
> > +			       DMA_FROM_DEVICE : DMA_TO_DEVICE);
> > +	}
> > +}
> > +
> > +static struct vring_packed_desc *alloc_indirect_packed(struct virtqueue *_vq,
> > +						       unsigned int total_sg,
> > +						       gfp_t gfp)
> > +{
> > +	struct vring_packed_desc *desc;
> > +
> > +	/*
> > +	 * We require lowmem mappings for the descriptors because
> > +	 * otherwise virt_to_phys will give us bogus addresses in the
> > +	 * virtqueue.
> 
> Where is virt_to_phys used? I don't see it in this patch.

In vring_map_single(), virt_to_phys() will be used to translate
the address to phys if dma api isn't used:

https://github.com/torvalds/linux/blob/a49a9dcce802/drivers/virtio/virtio_ring.c#L197

> 
> > +	 */
> > +	gfp &= ~__GFP_HIGHMEM;
> > +
> > +	desc = kmalloc(total_sg * sizeof(struct vring_packed_desc), gfp);
> > +
> > +	return desc;
> > +}
> > +
> >  static inline int virtqueue_add_packed(struct virtqueue *_vq,
> >  				       struct scatterlist *sgs[],
> >  				       unsigned int total_sg,
> > @@ -766,47 +840,449 @@ static inline int virtqueue_add_packed(struct virtqueue *_vq,
> >  				       void *ctx,
> >  				       gfp_t gfp)
> >  {
> > +	struct vring_virtqueue *vq = to_vvq(_vq);
> > +	struct vring_packed_desc *desc;
> > +	struct scatterlist *sg;
> > +	unsigned int i, n, descs_used, uninitialized_var(prev), err_idx;
> > +	__virtio16 uninitialized_var(head_flags), flags;
> > +	u16 head, avail_wrap_counter, id, curr;
> > +	bool indirect;
> > +
> > +	START_USE(vq);
> > +
> > +	BUG_ON(data == NULL);
> > +	BUG_ON(ctx && vq->indirect);
> > +
> > +	if (unlikely(vq->broken)) {
> > +		END_USE(vq);
> > +		return -EIO;
> > +	}
> > +
> > +#ifdef DEBUG
> > +	{
> > +		ktime_t now = ktime_get();
> > +
> > +		/* No kick or get, with .1 second between?  Warn. */
> > +		if (vq->last_add_time_valid)
> > +			WARN_ON(ktime_to_ms(ktime_sub(now, vq->last_add_time))
> > +					    > 100);
> > +		vq->last_add_time = now;
> > +		vq->last_add_time_valid = true;
> > +	}
> > +#endif
> 
> Add incline helpers for this debug stuff rather than
> duplicate it from split ring?

Sure. I'd like to do that.

> 
> 
> > +
> > +	BUG_ON(total_sg == 0);
> > +
> > +	head = vq->next_avail_idx;
> > +	avail_wrap_counter = vq->avail_wrap_counter;
> > +
> > +	if (virtqueue_use_indirect(_vq, total_sg))
> > +		desc = alloc_indirect_packed(_vq, total_sg, gfp);
> > +	else {
> > +		desc = NULL;
> > +		WARN_ON_ONCE(total_sg > vq->vring_packed.num && !vq->indirect);
> > +	}
> 
> 
> Apparently this attempts to treat indirect descriptors same as
> direct ones. This is how it is in the split ring, but not in
> the packed one. I think you want two separate functions, for
> direct and indirect.

Okay, I'll do that.

> 
> > +
> > +	if (desc) {
> > +		/* Use a single buffer which doesn't continue */
> > +		indirect = true;
> > +		/* Set up rest to use this indirect table. */
> > +		i = 0;
> > +		descs_used = 1;
> > +	} else {
> > +		indirect = false;
> > +		desc = vq->vring_packed.desc;
> > +		i = head;
> > +		descs_used = total_sg;
> > +	}
> > +
> > +	if (vq->vq.num_free < descs_used) {
> > +		pr_debug("Can't add buf len %i - avail = %i\n",
> > +			 descs_used, vq->vq.num_free);
> > +		/* FIXME: for historical reasons, we force a notify here if
> > +		 * there are outgoing parts to the buffer.  Presumably the
> > +		 * host should service the ring ASAP. */
> > +		if (out_sgs)
> > +			vq->notify(&vq->vq);
> > +		if (indirect)
> > +			kfree(desc);
> > +		END_USE(vq);
> > +		return -ENOSPC;
> > +	}
> > +
> > +	id = vq->free_head;
> > +	BUG_ON(id == vq->vring_packed.num);
> > +
> > +	curr = id;
> > +	for (n = 0; n < out_sgs + in_sgs; n++) {
> > +		for (sg = sgs[n]; sg; sg = sg_next(sg)) {
> > +			dma_addr_t addr = vring_map_one_sg(vq, sg, n < out_sgs ?
> > +					       DMA_TO_DEVICE : DMA_FROM_DEVICE);
> > +			if (vring_mapping_error(vq, addr))
> > +				goto unmap_release;
> > +
> > +			flags = cpu_to_virtio16(_vq->vdev, VRING_DESC_F_NEXT |
> > +				  (n < out_sgs ? 0 : VRING_DESC_F_WRITE) |
> > +				  _VRING_DESC_F_AVAIL(vq->avail_wrap_counter) |
> > +				  _VRING_DESC_F_USED(!vq->avail_wrap_counter));
> 
> Spec says:
> 	The VIRTQ_DESC_F_WRITE flags bit is the only valid flag for descriptors in the
> 	indirect table.
> 
> All this logic isn't needed for indirect.
> 
> Also, why re-calculate avail/used flags every time? They only change
> when you wrap around.

Will do that. Thanks.

> 
> 
> > +			if (!indirect && i == head)
> > +				head_flags = flags;
> > +			else
> > +				desc[i].flags = flags;
> > +
> > +			desc[i].addr = cpu_to_virtio64(_vq->vdev, addr);
> > +			desc[i].len = cpu_to_virtio32(_vq->vdev, sg->length);
> > +			i++;
> > +			if (!indirect) {
> > +				if (vring_use_dma_api(_vq->vdev)) {
> > +					vq->desc_state_packed[curr].addr = addr;
> > +					vq->desc_state_packed[curr].len =
> > +						sg->length;
> > +					vq->desc_state_packed[curr].flags =
> > +						virtio16_to_cpu(_vq->vdev,
> > +								flags);
> > +				}
> > +				curr = vq->desc_state_packed[curr].next;
> > +
> > +				if (i >= vq->vring_packed.num) {
> > +					i = 0;
> > +					vq->avail_wrap_counter ^= 1;
> > +				}
> > +			}
> > +		}
> > +	}
> > +
> > +	prev = (i > 0 ? i : vq->vring_packed.num) - 1;
> > +	desc[prev].id = cpu_to_virtio16(_vq->vdev, id);
> 
> Is it easier to write this out in all descriptors, to avoid the need to
> calculate prev?

Yeah, I'll do that.

> 
> > +
> > +	/* Last one doesn't continue. */
> > +	if (total_sg == 1)
> > +		head_flags &= cpu_to_virtio16(_vq->vdev, ~VRING_DESC_F_NEXT);
> > +	else
> > +		desc[prev].flags &= cpu_to_virtio16(_vq->vdev,
> > +						~VRING_DESC_F_NEXT);
> 
> Wouldn't it be easier to avoid setting VRING_DESC_F_NEXT
> in the first place?
> 	 if (n != in_sgs - 1 && n != out_sgs - 1)

Will this affect the branch prediction?

> 
> must be better than writing descriptor an extra time.

Not quite sure about this. I think this descriptor has just
been written, it should still be in the cache.

> 
> > +
> > +	if (indirect) {
> > +		/* Now that the indirect table is filled in, map it. */
> > +		dma_addr_t addr = vring_map_single(
> > +			vq, desc, total_sg * sizeof(struct vring_packed_desc),
> > +			DMA_TO_DEVICE);
> > +		if (vring_mapping_error(vq, addr))
> > +			goto unmap_release;
> > +
> > +		head_flags = cpu_to_virtio16(_vq->vdev, VRING_DESC_F_INDIRECT |
> > +				      _VRING_DESC_F_AVAIL(avail_wrap_counter) |
> > +				      _VRING_DESC_F_USED(!avail_wrap_counter));
> > +		vq->vring_packed.desc[head].addr = cpu_to_virtio64(_vq->vdev,
> > +								   addr);
> > +		vq->vring_packed.desc[head].len = cpu_to_virtio32(_vq->vdev,
> > +				total_sg * sizeof(struct vring_packed_desc));
> > +		vq->vring_packed.desc[head].id = cpu_to_virtio16(_vq->vdev, id);
> > +
> > +		if (vring_use_dma_api(_vq->vdev)) {
> > +			vq->desc_state_packed[id].addr = addr;
> > +			vq->desc_state_packed[id].len = total_sg *
> > +					sizeof(struct vring_packed_desc);
> > +			vq->desc_state_packed[id].flags =
> > +					virtio16_to_cpu(_vq->vdev, head_flags);
> > +		}
> > +	}
> > +
> > +	/* We're using some buffers from the free list. */
> > +	vq->vq.num_free -= descs_used;
> > +
> > +	/* Update free pointer */
> > +	if (indirect) {
> > +		n = head + 1;
> > +		if (n >= vq->vring_packed.num) {
> > +			n = 0;
> > +			vq->avail_wrap_counter ^= 1;
> > +		}
> > +		vq->next_avail_idx = n;
> > +		vq->free_head = vq->desc_state_packed[id].next;
> > +	} else {
> > +		vq->next_avail_idx = i;
> > +		vq->free_head = curr;
> > +	}
> > +
> > +	/* Store token and indirect buffer state. */
> > +	vq->desc_state_packed[id].num = descs_used;
> > +	vq->desc_state_packed[id].data = data;
> > +	if (indirect)
> > +		vq->desc_state_packed[id].indir_desc = desc;
> > +	else
> > +		vq->desc_state_packed[id].indir_desc = ctx;
> > +
> > +	/* A driver MUST NOT make the first descriptor in the list
> > +	 * available before all subsequent descriptors comprising
> > +	 * the list are made available. */
> > +	virtio_wmb(vq->weak_barriers);
> > +	vq->vring_packed.desc[head].flags = head_flags;
> > +	vq->num_added += descs_used;
> > +
> > +	pr_debug("Added buffer head %i to %p\n", head, vq);
> > +	END_USE(vq);
> > +
> > +	return 0;
> > +
> > +unmap_release:
> > +	err_idx = i;
> > +	i = head;
> > +
> > +	for (n = 0; n < total_sg; n++) {
> > +		if (i == err_idx)
> > +			break;
> > +		vring_unmap_desc_packed(vq, &desc[i]);
> > +		i++;
> > +		if (!indirect && i >= vq->vring_packed.num)
> > +			i = 0;
> > +	}
> > +
> > +	vq->avail_wrap_counter = avail_wrap_counter;
> > +
> > +	if (indirect)
> > +		kfree(desc);
> > +
> > +	END_USE(vq);
> >  	return -EIO;
> >  }
> >  
> >  static bool virtqueue_kick_prepare_packed(struct virtqueue *_vq)
> >  {
> > -	return false;
> > +	struct vring_virtqueue *vq = to_vvq(_vq);
> > +	u16 flags;
> > +	bool needs_kick;
> > +	u32 snapshot;
> > +
> > +	START_USE(vq);
> > +	/* We need to expose the new flags value before checking notification
> > +	 * suppressions. */
> > +	virtio_mb(vq->weak_barriers);
> > +
> > +	snapshot = READ_ONCE(*(u32 *)vq->vring_packed.device);
> > +	flags = virtio16_to_cpu(_vq->vdev, (__virtio16)(snapshot >> 16)) & 0x3;
> 
> What are all these hard-coded things? Also either we do READ_ONCE
> everywhere or nowhere. Why is this place special? Any why read 32 bit
> if you only want the 16?

Yeah, READ_ONCE() and reading 32bits are not really needed in
this patch. But it's needed in the next patch. I thought it's
not wrong to do this, so I introduced them from the first place.

Just to double check: is the below code (apart from the
hard-coded value and virtio16) from the next patch OK?

"""
    snapshot = READ_ONCE(*(u32 *)vq->vring_packed.device);
+   off_wrap = virtio16_to_cpu(_vq->vdev, (__virtio16)(snapshot & 0xffff));
    flags = virtio16_to_cpu(_vq->vdev, (__virtio16)(snapshot >> 16)) & 0x3;
"""

> 
> And doesn't sparse complain about cast to __virtio16?

I'll give it a try in the next version.

> 
> > +
> > +#ifdef DEBUG
> > +	if (vq->last_add_time_valid) {
> > +		WARN_ON(ktime_to_ms(ktime_sub(ktime_get(),
> > +					      vq->last_add_time)) > 100);
> > +	}
> > +	vq->last_add_time_valid = false;
> > +#endif
> > +
> > +	needs_kick = (flags != VRING_EVENT_F_DISABLE);
> > +	END_USE(vq);
> > +	return needs_kick;
> > +}
> > +
> > +static void detach_buf_packed(struct vring_virtqueue *vq,
> > +			      unsigned int id, void **ctx)
> > +{
> > +	struct vring_desc_state_packed *state = NULL;
> > +	struct vring_packed_desc *desc;
> > +	unsigned int curr, i;
> > +
> > +	/* Clear data ptr. */
> > +	vq->desc_state_packed[id].data = NULL;
> > +
> > +	curr = id;
> > +	for (i = 0; i < vq->desc_state_packed[id].num; i++) {
> > +		state = &vq->desc_state_packed[curr];
> > +		vring_unmap_state_packed(vq, state);
> > +		curr = state->next;
> > +	}
> > +
> > +	BUG_ON(state == NULL);
> > +	vq->vq.num_free += vq->desc_state_packed[id].num;
> > +	state->next = vq->free_head;
> > +	vq->free_head = id;
> > +
> > +	if (vq->indirect) {
> > +		u32 len;
> > +
> > +		/* Free the indirect table, if any, now that it's unmapped. */
> > +		desc = vq->desc_state_packed[id].indir_desc;
> > +		if (!desc)
> > +			return;
> > +
> > +		if (vring_use_dma_api(vq->vq.vdev)) {
> > +			len = vq->desc_state_packed[id].len;
> > +			for (i = 0; i < len / sizeof(struct vring_packed_desc);
> > +					i++)
> > +				vring_unmap_desc_packed(vq, &desc[i]);
> > +		}
> > +		kfree(desc);
> > +		vq->desc_state_packed[id].indir_desc = NULL;
> > +	} else if (ctx) {
> > +		*ctx = vq->desc_state_packed[id].indir_desc;
> > +	}
> > +}
> > +
> > +static inline bool is_used_desc_packed(const struct vring_virtqueue *vq,
> > +				       u16 idx, bool used_wrap_counter)
> > +{
> > +	u16 flags;
> > +	bool avail, used;
> > +
> > +	flags = virtio16_to_cpu(vq->vq.vdev,
> > +				vq->vring_packed.desc[idx].flags);
> > +	avail = !!(flags & VRING_DESC_F_AVAIL);
> > +	used = !!(flags & VRING_DESC_F_USED);
> > +
> > +	return avail == used && used == used_wrap_counter;
> 
> I think that you don't need to look at avail flag to detect a used
> descriptor. The reason device writes it is to avoid confusing
> *device* next time descriptor wraps.

Okay, I'll just check the used flag.

> 
> >  }
> >  
> >  static inline bool more_used_packed(const struct vring_virtqueue *vq)
> >  {
> > -	return false;
> > +	return is_used_desc_packed(vq, vq->last_used_idx,
> > +			vq->used_wrap_counter);
> >  }
> >  
> >  static void *virtqueue_get_buf_ctx_packed(struct virtqueue *_vq,
> >  					  unsigned int *len,
> >  					  void **ctx)
> >  {
> > -	return NULL;
> > +	struct vring_virtqueue *vq = to_vvq(_vq);
> > +	u16 last_used, id;
> > +	void *ret;
> > +
> > +	START_USE(vq);
> > +
> > +	if (unlikely(vq->broken)) {
> > +		END_USE(vq);
> > +		return NULL;
> > +	}
> > +
> > +	if (!more_used_packed(vq)) {
> > +		pr_debug("No more buffers in queue\n");
> > +		END_USE(vq);
> > +		return NULL;
> > +	}
> > +
> > +	/* Only get used elements after they have been exposed by host. */
> > +	virtio_rmb(vq->weak_barriers);
> > +
> > +	last_used = vq->last_used_idx;
> > +	id = virtio16_to_cpu(_vq->vdev, vq->vring_packed.desc[last_used].id);
> > +	*len = virtio32_to_cpu(_vq->vdev, vq->vring_packed.desc[last_used].len);
> > +
> > +	if (unlikely(id >= vq->vring_packed.num)) {
> > +		BAD_RING(vq, "id %u out of range\n", id);
> > +		return NULL;
> > +	}
> > +	if (unlikely(!vq->desc_state_packed[id].data)) {
> > +		BAD_RING(vq, "id %u is not a head!\n", id);
> > +		return NULL;
> > +	}
> > +
> > +	vq->last_used_idx += vq->desc_state_packed[id].num;
> > +	if (vq->last_used_idx >= vq->vring_packed.num) {
> > +		vq->last_used_idx -= vq->vring_packed.num;
> > +		vq->used_wrap_counter ^= 1;
> > +	}
> > +
> > +	/* detach_buf_packed clears data, so grab it now. */
> > +	ret = vq->desc_state_packed[id].data;
> > +	detach_buf_packed(vq, id, ctx);
> > +
> > +#ifdef DEBUG
> > +	vq->last_add_time_valid = false;
> > +#endif
> > +
> > +	END_USE(vq);
> > +	return ret;
> >  }
> >  
> >  static void virtqueue_disable_cb_packed(struct virtqueue *_vq)
> >  {
> > +	struct vring_virtqueue *vq = to_vvq(_vq);
> > +
> > +	if (vq->event_flags_shadow != VRING_EVENT_F_DISABLE) {
> > +		vq->event_flags_shadow = VRING_EVENT_F_DISABLE;
> > +		vq->vring_packed.driver->flags = cpu_to_virtio16(_vq->vdev,
> > +							vq->event_flags_shadow);
> > +	}
> >  }
> >  
> >  static unsigned virtqueue_enable_cb_prepare_packed(struct virtqueue *_vq)
> >  {
> > -	return 0;
> > +	struct vring_virtqueue *vq = to_vvq(_vq);
> > +
> > +	START_USE(vq);
> > +
> > +	/* We optimistically turn back on interrupts, then check if there was
> > +	 * more to do. */
> > +
> > +	if (vq->event_flags_shadow == VRING_EVENT_F_DISABLE) {
> > +		vq->event_flags_shadow = VRING_EVENT_F_ENABLE;
> > +		vq->vring_packed.driver->flags = cpu_to_virtio16(_vq->vdev,
> > +							vq->event_flags_shadow);
> > +	}
> > +
> > +	END_USE(vq);
> > +	return vq->last_used_idx | ((u16)vq->used_wrap_counter << 15);
> >  }
> >  
> > -static bool virtqueue_poll_packed(struct virtqueue *_vq, unsigned last_used_idx)
> > +static bool virtqueue_poll_packed(struct virtqueue *_vq, unsigned off_wrap)
> >  {
> > -	return false;
> > +	struct vring_virtqueue *vq = to_vvq(_vq);
> > +	bool wrap_counter;
> > +	u16 used_idx;
> > +
> > +	wrap_counter = off_wrap >> 15;
> > +	used_idx = off_wrap & ~(1 << 15);
> > +
> > +	return is_used_desc_packed(vq, used_idx, wrap_counter);
> 
> These >> 15 << 15 all over the place duplicate info.
> Pls put 15 in the header.

Sure.

> 
> Also can you maintain the counters properly shifted?
> Then just use them.

Then, we may need to maintain both of the shifted wrapper
counters and un-shifted wrapper counters at the same time.

> 
> 
> >  }
> >  
> >  static bool virtqueue_enable_cb_delayed_packed(struct virtqueue *_vq)
> >  {
> > -	return false;
> > +	struct vring_virtqueue *vq = to_vvq(_vq);
> > +
> > +	START_USE(vq);
> > +
> > +	/* We optimistically turn back on interrupts, then check if there was
> > +	 * more to do. */
> > +
> > +	if (vq->event_flags_shadow == VRING_EVENT_F_DISABLE) {
> > +		vq->event_flags_shadow = VRING_EVENT_F_ENABLE;
> > +		vq->vring_packed.driver->flags = cpu_to_virtio16(_vq->vdev,
> > +							vq->event_flags_shadow);
> > +		/* We need to enable interrupts first before re-checking
> > +		 * for more used buffers. */
> > +		virtio_mb(vq->weak_barriers);
> > +	}
> > +
> > +	if (more_used_packed(vq)) {
> > +		END_USE(vq);
> > +		return false;
> > +	}
> > +
> > +	END_USE(vq);
> > +	return true;
> >  }
> >  
> >  static void *virtqueue_detach_unused_buf_packed(struct virtqueue *_vq)
> >  {
> > +	struct vring_virtqueue *vq = to_vvq(_vq);
> > +	unsigned int i;
> > +	void *buf;
> > +
> > +	START_USE(vq);
> > +
> > +	for (i = 0; i < vq->vring_packed.num; i++) {
> > +		if (!vq->desc_state_packed[i].data)
> > +			continue;
> > +		/* detach_buf clears data, so grab it now. */
> > +		buf = vq->desc_state_packed[i].data;
> > +		detach_buf_packed(vq, i, NULL);
> > +		END_USE(vq);
> > +		return buf;
> > +	}
> > +	/* That should have freed everything. */
> > +	BUG_ON(vq->vq.num_free != vq->vring_packed.num);
> > +
> > +	END_USE(vq);
> >  	return NULL;
> >  }
> >  
> > @@ -1083,6 +1559,9 @@ bool virtqueue_poll(struct virtqueue *_vq, unsigned last_used_idx)
> >  {
> >  	struct vring_virtqueue *vq = to_vvq(_vq);
> >  
> > +	/* We need to enable interrupts first before re-checking
> > +	 * for more used buffers. */
> > +	virtio_mb(vq->weak_barriers);
> >  	return vq->packed ? virtqueue_poll_packed(_vq, last_used_idx) :
> >  			    virtqueue_poll_split(_vq, last_used_idx);
> >  }
> 
> Possible optimization for when dma API is in use:

Got it. Will give it a try!

Best regards,
Tiwei Bie

> 
> --->
> 
> dma: detecting nop unmap
> 
> drivers need to maintain the dma address for unmap purposes,
> but these cycles are wasted when unmap callback is not
> defined. Add an API for drivers to check that and avoid
> unmap completely. Debug builds still have unmap.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> 
> ----
> 
> diff --git a/include/linux/dma-debug.h b/include/linux/dma-debug.h
> index a785f2507159..38b2c27c8449 100644
> --- a/include/linux/dma-debug.h
> +++ b/include/linux/dma-debug.h
> @@ -42,6 +42,11 @@ extern void debug_dma_mapping_error(struct device *dev, dma_addr_t dma_addr);
>  extern void debug_dma_unmap_page(struct device *dev, dma_addr_t addr,
>  				 size_t size, int direction, bool map_single);
>  
> +static inline bool has_debug_dma_unmap(struct device *dev)
> +{
> +	return true;
> +}
> +
>  extern void debug_dma_map_sg(struct device *dev, struct scatterlist *sg,
>  			     int nents, int mapped_ents, int direction);
>  
> @@ -121,6 +126,11 @@ static inline void debug_dma_unmap_page(struct device *dev, dma_addr_t addr,
>  {
>  }
>  
> +static inline bool has_debug_dma_unmap(struct device *dev)
> +{
> +	return false;
> +}
> +
>  static inline void debug_dma_map_sg(struct device *dev, struct scatterlist *sg,
>  				    int nents, int mapped_ents, int direction)
>  {
> diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
> index 1db6a6b46d0d..656f3e518166 100644
> --- a/include/linux/dma-mapping.h
> +++ b/include/linux/dma-mapping.h
> @@ -241,6 +241,14 @@ static inline dma_addr_t dma_map_single_attrs(struct device *dev, void *ptr,
>  	return addr;
>  }
>  
> +static inline bool has_dma_unmap(struct device *dev)
> +{
> +	const struct dma_map_ops *ops = get_dma_ops(dev);
> +
> +	return ops->unmap_page || ops->unmap_sg || ops->unmap_resource ||
> +		has_dma_unmap(dev);
> +}
> +
>  static inline void dma_unmap_single_attrs(struct device *dev, dma_addr_t addr,
>  					  size_t size,
>  					  enum dma_data_direction dir,

^ permalink raw reply

* Re: [PATCH v4 3/3] IB/ipoib: Log sysfs 'dev_id' accesses from userspace
From: Arseny Maslennikov @ 2018-09-09 20:55 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: linux-rdma, Doug Ledford, netdev
In-Reply-To: <20180909181146.GA29632@cello.Dlink>

[-- Attachment #1: Type: text/plain, Size: 3764 bytes --]

On Sun, Sep 09, 2018 at 09:11:46PM +0300, Arseny Maslennikov wrote:
> On Fri, Sep 07, 2018 at 09:43:59AM -0600, Jason Gunthorpe wrote:
> > On Thu, Sep 06, 2018 at 05:51:12PM +0300, Arseny Maslennikov wrote:
> > > Some tools may currently be using only the deprecated attribute;
> > > let's print an elaborate and clear deprecation notice to kmsg.
> > > 
> > > To do that, we have to replace the whole sysfs file, since we inherit
> > > the original one from netdev.
> > > 
> > > Signed-off-by: Arseny Maslennikov <ar@cs.msu.ru>
> > >  drivers/infiniband/ulp/ipoib/ipoib_main.c | 31 +++++++++++++++++++++++
> > >  1 file changed, 31 insertions(+)
> > > 
> > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> > > index 30f840f874b3..74732726ec6f 100644
> > > +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> > > @@ -2386,6 +2386,35 @@ int ipoib_add_pkey_attr(struct net_device *dev)
> > >  	return device_create_file(&dev->dev, &dev_attr_pkey);
> > >  }
> > >  
> > > +/*
> > > + * We erroneously exposed the iface's port number in the dev_id
> > > + * sysfs field long after dev_port was introduced for that purpose[1],
> > > + * and we need to stop everyone from relying on that.
> > > + * Let's overload the shower routine for the dev_id file here
> > > + * to gently bring the issue up.
> > > + *
> > > + * [1] https://www.spinics.net/lists/netdev/msg272123.html
> > > + */
> > > +static ssize_t dev_id_show(struct device *dev,
> > > +			   struct device_attribute *attr, char *buf)
> > > +{
> > > +	struct net_device *ndev = to_net_dev(dev);
> > > +
> > > +	if (ndev->dev_id == ndev->dev_port)
> > > +		netdev_info_once(ndev,
> > > +			"\"%s\" wants to know my dev_id. Should it look at dev_port instead? See Documentation/ABI/testing/sysfs-class-net for more info.\n",
> > > +			current->comm);
> > > +
> > > +	return sprintf(buf, "%#x\n", ndev->dev_id);
> > > +}
> > > +static DEVICE_ATTR_RO(dev_id);
> > > +
> > > +int ipoib_intercept_dev_id_attr(struct net_device *dev)
> > > +{
> > > +	device_remove_file(&dev->dev, &dev_attr_dev_id);
> > > +	return device_create_file(&dev->dev, &dev_attr_dev_id);
> > > +}
> > 
> > Isn't this racey with userspace? Ie what happens if udev is querying
> > the dev_id right here?
> 
> udev in particular does not use dev_id at all since 2014, because "why
> would we keep using dev_id if it is not the right thing to use?".
> 
> > 
> > Do we know there is no userspace doing this?
> > 
> 
> Not for sure.
> 
> If we move all the sysfs handling stuff we introduce in _add_port():
>  - pkey
>  - umcast
>  - {create,delete}_child
>  - connected/datagram mode
> to _ndo_init(), which is called by register_netdev before it sends
> the netlink message, would that suffice to eliminate the race?
> (Sysfs files for {create,delete}_child go to _parent_init() then).
> 

No, we can't, sorry for the noise. ndo_init() runs before the kobject
becomes available.

Anyway, our sysfs attributes being racy is unrelated to the patch series
subject, and I can't come up with any other ideas what to do with them
that do not involve adjustments to register_netdev.

> > >  static struct net_device *ipoib_add_port(const char *format,
> > >  					 struct ib_device *hca, u8 port)
> > >  {
> > > @@ -2427,6 +2456,8 @@ static struct net_device *ipoib_add_port(const char *format,
> > >  	 */
> > >  	ndev->priv_destructor = ipoib_intf_free;
> > >  
> > > +	if (ipoib_intercept_dev_id_attr(ndev))
> > > +		goto sysfs_failed;
> > 
> > No device_remove_file needed?
> > 
> 
> As far as I understand, unregister_netdevice takes care of it
> while destroying the related kobject.
> 
> > Jason



[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH net V2] ip: frags: fix crash in ip_do_fragment()
From: Eric Dumazet @ 2018-09-09 20:51 UTC (permalink / raw)
  To: Taehee Yoo
  Cc: David Miller, Peter Oskolkov, netdev, Pablo Neira Ayuso,
	Florian Westphal
In-Reply-To: <CAMArcTWjDRy-LpEKXzv4Zs-MiMGC25SkzCmNkB5Ks35_MWhoKQ@mail.gmail.com>

> Hi Eric,
> Thank you for review!
>
> I think netfilter side ipv6 code change is needed
> because netfilter ipv6 defrag routine also set fp->ip_defrag_offset value
> so that fp->sk will not be NULL.
> And I think these crash in ip_do_fragment() and ip6_fragment() are
> actually same bug.
>

Right you are, thanks.

Reviewed-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply

* [PATCH v3 net-next 6/6] net: dsa: Add Lantiq / Intel DSA driver for vrx200
From: Hauke Mehrtens @ 2018-09-09 20:20 UTC (permalink / raw)
  To: davem
  Cc: netdev, andrew, vivien.didelot, f.fainelli, john, linux-mips, dev,
	hauke.mehrtens, devicetree
In-Reply-To: <20180909201647.32727-1-hauke@hauke-m.d>

This adds the DSA driver for the GSWIP Switch found in the VRX200 SoC.
This switch is integrated in the DSL SoC, this SoC uses a GSWIP version
2.1, there are other SoCs using different versions of this IP block, but
this driver was only tested with the version found in the VRX200.
Currently only the basic features are implemented which will forward all
packages to the CPU and let the CPU do the forwarding. The hardware also
support Layer 2 offloading which is not yet implemented in this driver.

The GPHY FW loaded is now done by this driver and not any more by the
separate driver in drivers/soc/lantiq/gphy.c, I will remove this driver
is a separate patch. to make use of the GPHY this switch driver is
needed anyway. Other SoCs have more embedded GPHYs so this driver should
support a variable number of GPHYs. After the firmware was loaded the
GPHY can be probed on the MDIO bus and it behaves like an external GPHY,
without the firmware it can not be probed on the MDIO bus.

The clock names in the sysctrl.c file have to be changed because the
clocks are now used by a different driver. This should be cleaned up and
a real common clock driver should provide the clocks instead.

Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de>
---
 MAINTAINERS                     |    2 +
 arch/mips/lantiq/xway/sysctrl.c |    8 +-
 drivers/net/dsa/Kconfig         |    8 +
 drivers/net/dsa/Makefile        |    1 +
 drivers/net/dsa/lantiq_gswip.c  | 1169 +++++++++++++++++++++++++++++++++++++++
 drivers/net/dsa/lantiq_pce.h    |  153 +++++
 6 files changed, 1337 insertions(+), 4 deletions(-)
 create mode 100644 drivers/net/dsa/lantiq_gswip.c
 create mode 100644 drivers/net/dsa/lantiq_pce.h

diff --git a/MAINTAINERS b/MAINTAINERS
index ffff912d31b5..5ae7385840e3 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8172,6 +8172,8 @@ L:	netdev@vger.kernel.org
 S:	Maintained
 F:	net/dsa/tag_gswip.c
 F:	drivers/net/ethernet/lantiq_xrx200.c
+F:	drivers/net/dsa/lantiq_pce.h
+F:	drivers/net/dsa/intel_gswip.c
 
 LANTIQ MIPS ARCHITECTURE
 M:	John Crispin <john@phrozen.org>
diff --git a/arch/mips/lantiq/xway/sysctrl.c b/arch/mips/lantiq/xway/sysctrl.c
index eeb89a37e27e..fe25c99089b7 100644
--- a/arch/mips/lantiq/xway/sysctrl.c
+++ b/arch/mips/lantiq/xway/sysctrl.c
@@ -516,8 +516,8 @@ void __init ltq_soc_init(void)
 		clkdev_add_pmu("1e10b308.eth", NULL, 0, 0, PMU_SWITCH |
 			       PMU_PPE_DP | PMU_PPE_TC);
 		clkdev_add_pmu("1da00000.usif", "NULL", 1, 0, PMU_USIF);
-		clkdev_add_pmu("1f203020.gphy", NULL, 1, 0, PMU_GPHY);
-		clkdev_add_pmu("1f203068.gphy", NULL, 1, 0, PMU_GPHY);
+		clkdev_add_pmu("1e108000.gswip", "gphy0", 0, 0, PMU_GPHY);
+		clkdev_add_pmu("1e108000.gswip", "gphy1", 0, 0, PMU_GPHY);
 		clkdev_add_pmu("1e103100.deu", NULL, 1, 0, PMU_DEU);
 		clkdev_add_pmu("1e116000.mei", "afe", 1, 2, PMU_ANALOG_DSL_AFE);
 		clkdev_add_pmu("1e116000.mei", "dfe", 1, 0, PMU_DFE);
@@ -540,8 +540,8 @@ void __init ltq_soc_init(void)
 				PMU_SWITCH | PMU_PPE_DPLUS | PMU_PPE_DPLUM |
 				PMU_PPE_EMA | PMU_PPE_TC | PMU_PPE_SLL01 |
 				PMU_PPE_QSB | PMU_PPE_TOP);
-		clkdev_add_pmu("1f203020.gphy", NULL, 0, 0, PMU_GPHY);
-		clkdev_add_pmu("1f203068.gphy", NULL, 0, 0, PMU_GPHY);
+		clkdev_add_pmu("1e108000.gswip", "gphy0", 0, 0, PMU_GPHY);
+		clkdev_add_pmu("1e108000.gswip", "gphy1", 0, 0, PMU_GPHY);
 		clkdev_add_pmu("1e103000.sdio", NULL, 1, 0, PMU_SDIO);
 		clkdev_add_pmu("1e103100.deu", NULL, 1, 0, PMU_DEU);
 		clkdev_add_pmu("1e116000.mei", "dfe", 1, 0, PMU_DFE);
diff --git a/drivers/net/dsa/Kconfig b/drivers/net/dsa/Kconfig
index d3ce1e4cb4d3..7c09d8f195ae 100644
--- a/drivers/net/dsa/Kconfig
+++ b/drivers/net/dsa/Kconfig
@@ -23,6 +23,14 @@ config NET_DSA_LOOP
 	  This enables support for a fake mock-up switch chip which
 	  exercises the DSA APIs.
 
+config NET_DSA_LANTIQ_GSWIP
+	tristate "Lantiq / Intel GSWIP"
+	depends on NET_DSA
+	select NET_DSA_TAG_GSWIP
+	---help---
+	  This enables support for the Lantiq / Intel GSWIP 2.1 found in
+	  the xrx200 / VR9 SoC.
+
 config NET_DSA_MT7530
 	tristate "Mediatek MT7530 Ethernet switch support"
 	depends on NET_DSA
diff --git a/drivers/net/dsa/Makefile b/drivers/net/dsa/Makefile
index 46c1cba91ffe..82e5d794c41f 100644
--- a/drivers/net/dsa/Makefile
+++ b/drivers/net/dsa/Makefile
@@ -5,6 +5,7 @@ obj-$(CONFIG_NET_DSA_LOOP)	+= dsa_loop.o
 ifdef CONFIG_NET_DSA_LOOP
 obj-$(CONFIG_FIXED_PHY)		+= dsa_loop_bdinfo.o
 endif
+obj-$(CONFIG_NET_DSA_LANTIQ_GSWIP) += lantiq_gswip.o
 obj-$(CONFIG_NET_DSA_MT7530)	+= mt7530.o
 obj-$(CONFIG_NET_DSA_MV88E6060) += mv88e6060.o
 obj-$(CONFIG_NET_DSA_QCA8K)	+= qca8k.o
diff --git a/drivers/net/dsa/lantiq_gswip.c b/drivers/net/dsa/lantiq_gswip.c
new file mode 100644
index 000000000000..9c28d0b2fcdb
--- /dev/null
+++ b/drivers/net/dsa/lantiq_gswip.c
@@ -0,0 +1,1169 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Lantiq / Intel GSWIP switch driver for VRX200 SoCs
+ *
+ * Copyright (C) 2010 Lantiq Deutschland
+ * Copyright (C) 2012 John Crispin <john@phrozen.org>
+ * Copyright (C) 2017 - 2018 Hauke Mehrtens <hauke@hauke-m.de>
+ */
+
+#include <linux/clk.h>
+#include <linux/etherdevice.h>
+#include <linux/firmware.h>
+#include <linux/if_bridge.h>
+#include <linux/if_vlan.h>
+#include <linux/iopoll.h>
+#include <linux/mfd/syscon.h>
+#include <linux/module.h>
+#include <linux/of_mdio.h>
+#include <linux/of_net.h>
+#include <linux/of_platform.h>
+#include <linux/phy.h>
+#include <linux/phylink.h>
+#include <linux/platform_device.h>
+#include <linux/regmap.h>
+#include <linux/reset.h>
+#include <net/dsa.h>
+#include <dt-bindings/mips/lantiq_rcu_gphy.h>
+
+#include "lantiq_pce.h"
+
+/* GSWIP MDIO Registers */
+#define GSWIP_MDIO_GLOB			0x00
+#define  GSWIP_MDIO_GLOB_ENABLE		BIT(15)
+#define GSWIP_MDIO_CTRL			0x08
+#define  GSWIP_MDIO_CTRL_BUSY		BIT(12)
+#define  GSWIP_MDIO_CTRL_RD		BIT(11)
+#define  GSWIP_MDIO_CTRL_WR		BIT(10)
+#define  GSWIP_MDIO_CTRL_PHYAD_MASK	0x1f
+#define  GSWIP_MDIO_CTRL_PHYAD_SHIFT	5
+#define  GSWIP_MDIO_CTRL_REGAD_MASK	0x1f
+#define GSWIP_MDIO_READ			0x09
+#define GSWIP_MDIO_WRITE		0x0A
+#define GSWIP_MDIO_MDC_CFG0		0x0B
+#define GSWIP_MDIO_MDC_CFG1		0x0C
+#define GSWIP_MDIO_PHYp(p)		(0x15 - (p))
+#define  GSWIP_MDIO_PHY_LINK_MASK	0x6000
+#define  GSWIP_MDIO_PHY_LINK_AUTO	0x0000
+#define  GSWIP_MDIO_PHY_LINK_DOWN	0x4000
+#define  GSWIP_MDIO_PHY_LINK_UP		0x2000
+#define  GSWIP_MDIO_PHY_SPEED_MASK	0x1800
+#define  GSWIP_MDIO_PHY_SPEED_AUTO	0x1800
+#define  GSWIP_MDIO_PHY_SPEED_M10	0x0000
+#define  GSWIP_MDIO_PHY_SPEED_M100	0x0800
+#define  GSWIP_MDIO_PHY_SPEED_G1	0x1000
+#define  GSWIP_MDIO_PHY_FDUP_MASK	0x0600
+#define  GSWIP_MDIO_PHY_FDUP_AUTO	0x0000
+#define  GSWIP_MDIO_PHY_FDUP_EN		0x0200
+#define  GSWIP_MDIO_PHY_FDUP_DIS	0x0600
+#define  GSWIP_MDIO_PHY_FCONTX_MASK	0x0180
+#define  GSWIP_MDIO_PHY_FCONTX_AUTO	0x0000
+#define  GSWIP_MDIO_PHY_FCONTX_EN	0x0100
+#define  GSWIP_MDIO_PHY_FCONTX_DIS	0x0180
+#define  GSWIP_MDIO_PHY_FCONRX_MASK	0x0060
+#define  GSWIP_MDIO_PHY_FCONRX_AUTO	0x0000
+#define  GSWIP_MDIO_PHY_FCONRX_EN	0x0020
+#define  GSWIP_MDIO_PHY_FCONRX_DIS	0x0060
+#define  GSWIP_MDIO_PHY_ADDR_MASK	0x001f
+#define  GSWIP_MDIO_PHY_MASK		(GSWIP_MDIO_PHY_ADDR_MASK | \
+					 GSWIP_MDIO_PHY_FCONRX_MASK | \
+					 GSWIP_MDIO_PHY_FCONTX_MASK | \
+					 GSWIP_MDIO_PHY_LINK_MASK | \
+					 GSWIP_MDIO_PHY_SPEED_MASK | \
+					 GSWIP_MDIO_PHY_FDUP_MASK)
+
+/* GSWIP MII Registers */
+#define GSWIP_MII_CFG0			0x00
+#define GSWIP_MII_CFG1			0x02
+#define GSWIP_MII_CFG5			0x04
+#define  GSWIP_MII_CFG_EN		BIT(14)
+#define  GSWIP_MII_CFG_LDCLKDIS		BIT(12)
+#define  GSWIP_MII_CFG_MODE_MIIP	0x0
+#define  GSWIP_MII_CFG_MODE_MIIM	0x1
+#define  GSWIP_MII_CFG_MODE_RMIIP	0x2
+#define  GSWIP_MII_CFG_MODE_RMIIM	0x3
+#define  GSWIP_MII_CFG_MODE_RGMII	0x4
+#define  GSWIP_MII_CFG_MODE_MASK	0xf
+#define  GSWIP_MII_CFG_RATE_M2P5	0x00
+#define  GSWIP_MII_CFG_RATE_M25	0x10
+#define  GSWIP_MII_CFG_RATE_M125	0x20
+#define  GSWIP_MII_CFG_RATE_M50	0x30
+#define  GSWIP_MII_CFG_RATE_AUTO	0x40
+#define  GSWIP_MII_CFG_RATE_MASK	0x70
+#define GSWIP_MII_PCDU0			0x01
+#define GSWIP_MII_PCDU1			0x03
+#define GSWIP_MII_PCDU5			0x05
+#define  GSWIP_MII_PCDU_TXDLY_MASK	GENMASK(2, 0)
+#define  GSWIP_MII_PCDU_RXDLY_MASK	GENMASK(9, 7)
+
+/* GSWIP Core Registers */
+#define GSWIP_SWRES			0x000
+#define  GSWIP_SWRES_R1			BIT(1)	/* GSWIP Software reset */
+#define  GSWIP_SWRES_R0			BIT(0)	/* GSWIP Hardware reset */
+#define GSWIP_VERSION			0x013
+#define  GSWIP_VERSION_REV_SHIFT	0
+#define  GSWIP_VERSION_REV_MASK		GENMASK(7, 0)
+#define  GSWIP_VERSION_MOD_SHIFT	8
+#define  GSWIP_VERSION_MOD_MASK		GENMASK(15, 8)
+#define   GSWIP_VERSION_2_0		0x100
+#define   GSWIP_VERSION_2_1		0x021
+#define   GSWIP_VERSION_2_2		0x122
+#define   GSWIP_VERSION_2_2_ETC		0x022
+
+#define GSWIP_BM_RAM_VAL(x)		(0x043 - (x))
+#define GSWIP_BM_RAM_ADDR		0x044
+#define GSWIP_BM_RAM_CTRL		0x045
+#define  GSWIP_BM_RAM_CTRL_BAS		BIT(15)
+#define  GSWIP_BM_RAM_CTRL_OPMOD	BIT(5)
+#define  GSWIP_BM_RAM_CTRL_ADDR_MASK	GENMASK(4, 0)
+#define GSWIP_BM_QUEUE_GCTRL		0x04A
+#define  GSWIP_BM_QUEUE_GCTRL_GL_MOD	BIT(10)
+/* buffer management Port Configuration Register */
+#define GSWIP_BM_PCFGp(p)		(0x080 + ((p) * 2))
+#define  GSWIP_BM_PCFG_CNTEN		BIT(0)	/* RMON Counter Enable */
+#define  GSWIP_BM_PCFG_IGCNT		BIT(1)	/* Ingres Special Tag RMON count */
+/* buffer management Port Control Register */
+#define GSWIP_BM_RMON_CTRLp(p)		(0x81 + ((p) * 2))
+#define  GSWIP_BM_CTRL_RMON_RAM1_RES	BIT(0)	/* Software Reset for RMON RAM 1 */
+#define  GSWIP_BM_CTRL_RMON_RAM2_RES	BIT(1)	/* Software Reset for RMON RAM 2 */
+
+/* PCE */
+#define GSWIP_PCE_TBL_KEY(x)		(0x447 - (x))
+#define GSWIP_PCE_TBL_MASK		0x448
+#define GSWIP_PCE_TBL_VAL(x)		(0x44D - (x))
+#define GSWIP_PCE_TBL_ADDR		0x44E
+#define GSWIP_PCE_TBL_CTRL		0x44F
+#define  GSWIP_PCE_TBL_CTRL_BAS		BIT(15)
+#define  GSWIP_PCE_TBL_CTRL_TYPE	BIT(13)
+#define  GSWIP_PCE_TBL_CTRL_VLD		BIT(12)
+#define  GSWIP_PCE_TBL_CTRL_KEYFORM	BIT(11)
+#define  GSWIP_PCE_TBL_CTRL_GMAP_MASK	GENMASK(10, 7)
+#define  GSWIP_PCE_TBL_CTRL_OPMOD_MASK	GENMASK(6, 5)
+#define  GSWIP_PCE_TBL_CTRL_OPMOD_ADRD	0x00
+#define  GSWIP_PCE_TBL_CTRL_OPMOD_ADWR	0x20
+#define  GSWIP_PCE_TBL_CTRL_OPMOD_KSRD	0x40
+#define  GSWIP_PCE_TBL_CTRL_OPMOD_KSWR	0x60
+#define  GSWIP_PCE_TBL_CTRL_ADDR_MASK	GENMASK(4, 0)
+#define GSWIP_PCE_PMAP1			0x453	/* Monitoring port map */
+#define GSWIP_PCE_PMAP2			0x454	/* Default Multicast port map */
+#define GSWIP_PCE_PMAP3			0x455	/* Default Unknown Unicast port map */
+#define GSWIP_PCE_GCTRL_0		0x456
+#define  GSWIP_PCE_GCTRL_0_MC_VALID	BIT(3)
+#define  GSWIP_PCE_GCTRL_0_VLAN		BIT(14) /* VLAN aware Switching */
+#define GSWIP_PCE_GCTRL_1		0x457
+#define  GSWIP_PCE_GCTRL_1_MAC_GLOCK	BIT(2)	/* MAC Address table lock */
+#define  GSWIP_PCE_GCTRL_1_MAC_GLOCK_MOD	BIT(3) /* Mac address table lock forwarding mode */
+#define GSWIP_PCE_PCTRL_0p(p)		(0x480 + ((p) * 0xA))
+#define  GSWIP_PCE_PCTRL_0_INGRESS	BIT(11)
+#define  GSWIP_PCE_PCTRL_0_PSTATE_LISTEN	0x0
+#define  GSWIP_PCE_PCTRL_0_PSTATE_RX		0x1
+#define  GSWIP_PCE_PCTRL_0_PSTATE_TX		0x2
+#define  GSWIP_PCE_PCTRL_0_PSTATE_LEARNING	0x3
+#define  GSWIP_PCE_PCTRL_0_PSTATE_FORWARDING	0x7
+#define  GSWIP_PCE_PCTRL_0_PSTATE_MASK	GENMASK(2, 0)
+
+#define GSWIP_MAC_FLEN			0x8C5
+#define GSWIP_MAC_CTRL_2p(p)		(0x905 + ((p) * 0xC))
+#define GSWIP_MAC_CTRL_2_MLEN		BIT(3) /* Maximum Untagged Frame Lnegth */
+
+/* Ethernet Switch Fetch DMA Port Control Register */
+#define GSWIP_FDMA_PCTRLp(p)		(0xA80 + ((p) * 0x6))
+#define  GSWIP_FDMA_PCTRL_EN		BIT(0)	/* FDMA Port Enable */
+#define  GSWIP_FDMA_PCTRL_STEN		BIT(1)	/* Special Tag Insertion Enable */
+#define  GSWIP_FDMA_PCTRL_VLANMOD_MASK	GENMASK(4, 3)	/* VLAN Modification Control */
+#define  GSWIP_FDMA_PCTRL_VLANMOD_SHIFT	3	/* VLAN Modification Control */
+#define  GSWIP_FDMA_PCTRL_VLANMOD_DIS	(0x0 << GSWIP_FDMA_PCTRL_VLANMOD_SHIFT)
+#define  GSWIP_FDMA_PCTRL_VLANMOD_PRIO	(0x1 << GSWIP_FDMA_PCTRL_VLANMOD_SHIFT)
+#define  GSWIP_FDMA_PCTRL_VLANMOD_ID	(0x2 << GSWIP_FDMA_PCTRL_VLANMOD_SHIFT)
+#define  GSWIP_FDMA_PCTRL_VLANMOD_BOTH	(0x3 << GSWIP_FDMA_PCTRL_VLANMOD_SHIFT)
+
+/* Ethernet Switch Store DMA Port Control Register */
+#define GSWIP_SDMA_PCTRLp(p)		(0xBC0 + ((p) * 0x6))
+#define  GSWIP_SDMA_PCTRL_EN		BIT(0)	/* SDMA Port Enable */
+#define  GSWIP_SDMA_PCTRL_FCEN		BIT(1)	/* Flow Control Enable */
+#define  GSWIP_SDMA_PCTRL_PAUFWD	BIT(1)	/* Pause Frame Forwarding */
+
+#define XRX200_GPHY_FW_ALIGN	(16 * 1024)
+
+struct gswip_hw_info {
+	int max_ports;
+	int cpu_port;
+};
+
+struct xway_gphy_match_data {
+	char *fe_firmware_name;
+	char *ge_firmware_name;
+};
+
+struct gswip_gphy_fw {
+	struct clk *clk_gate;
+	struct reset_control *reset;
+	u32 fw_addr_offset;
+	char *fw_name;
+};
+
+struct gswip_priv {
+	__iomem void *gswip;
+	__iomem void *mdio;
+	__iomem void *mii;
+	const struct gswip_hw_info *hw_info;
+	const struct xway_gphy_match_data *gphy_fw_name_cfg;
+	struct dsa_switch *ds;
+	struct device *dev;
+	struct regmap *rcu_regmap;
+	int num_gphy_fw;
+	struct gswip_gphy_fw *gphy_fw;
+};
+
+struct gswip_rmon_cnt_desc {
+	unsigned int size;
+	unsigned int offset;
+	const char *name;
+};
+
+#define MIB_DESC(_size, _offset, _name) {.size = _size, .offset = _offset, .name = _name}
+
+static const struct gswip_rmon_cnt_desc gswip_rmon_cnt[] = {
+	/** Receive Packet Count (only packets that are accepted and not discarded). */
+	MIB_DESC(1, 0x1F, "RxGoodPkts"),
+	MIB_DESC(1, 0x23, "RxUnicastPkts"),
+	MIB_DESC(1, 0x22, "RxMulticastPkts"),
+	MIB_DESC(1, 0x21, "RxFCSErrorPkts"),
+	MIB_DESC(1, 0x1D, "RxUnderSizeGoodPkts"),
+	MIB_DESC(1, 0x1E, "RxUnderSizeErrorPkts"),
+	MIB_DESC(1, 0x1B, "RxOversizeGoodPkts"),
+	MIB_DESC(1, 0x1C, "RxOversizeErrorPkts"),
+	MIB_DESC(1, 0x20, "RxGoodPausePkts"),
+	MIB_DESC(1, 0x1A, "RxAlignErrorPkts"),
+	MIB_DESC(1, 0x12, "Rx64BytePkts"),
+	MIB_DESC(1, 0x13, "Rx127BytePkts"),
+	MIB_DESC(1, 0x14, "Rx255BytePkts"),
+	MIB_DESC(1, 0x15, "Rx511BytePkts"),
+	MIB_DESC(1, 0x16, "Rx1023BytePkts"),
+	/** Receive Size 1024-1522 (or more, if configured) Packet Count. */
+	MIB_DESC(1, 0x17, "RxMaxBytePkts"),
+	MIB_DESC(1, 0x18, "RxDroppedPkts"),
+	MIB_DESC(1, 0x19, "RxFilteredPkts"),
+	MIB_DESC(2, 0x24, "RxGoodBytes"),
+	MIB_DESC(2, 0x26, "RxBadBytes"),
+	MIB_DESC(1, 0x11, "TxAcmDroppedPkts"),
+	MIB_DESC(1, 0x0C, "TxGoodPkts"),
+	MIB_DESC(1, 0x06, "TxUnicastPkts"),
+	MIB_DESC(1, 0x07, "TxMulticastPkts"),
+	MIB_DESC(1, 0x00, "Tx64BytePkts"),
+	MIB_DESC(1, 0x01, "Tx127BytePkts"),
+	MIB_DESC(1, 0x02, "Tx255BytePkts"),
+	MIB_DESC(1, 0x03, "Tx511BytePkts"),
+	MIB_DESC(1, 0x04, "Tx1023BytePkts"),
+	/** Transmit Size 1024-1522 (or more, if configured) Packet Count. */
+	MIB_DESC(1, 0x05, "TxMaxBytePkts"),
+	MIB_DESC(1, 0x08, "TxSingleCollCount"),
+	MIB_DESC(1, 0x09, "TxMultCollCount"),
+	MIB_DESC(1, 0x0A, "TxLateCollCount"),
+	MIB_DESC(1, 0x0B, "TxExcessCollCount"),
+	MIB_DESC(1, 0x0D, "TxPauseCount"),
+	MIB_DESC(1, 0x10, "TxDroppedPkts"),
+	MIB_DESC(2, 0x0E, "TxGoodBytes"),
+};
+
+static u32 gswip_switch_r(struct gswip_priv *priv, u32 offset)
+{
+	return __raw_readl(priv->gswip + (offset * 4));
+}
+
+static void gswip_switch_w(struct gswip_priv *priv, u32 val, u32 offset)
+{
+	__raw_writel(val, priv->gswip + (offset * 4));
+}
+
+static void gswip_switch_mask(struct gswip_priv *priv, u32 clear, u32 set,
+			      u32 offset)
+{
+	u32 val = gswip_switch_r(priv, offset);
+
+	val &= ~(clear);
+	val |= set;
+	gswip_switch_w(priv, val, offset);
+}
+
+static u32 gswip_switch_r_timeout(struct gswip_priv *priv, u32 offset,
+				  u32 cleared)
+{
+	u32 val;
+
+	return readx_poll_timeout(__raw_readl, priv->gswip + (offset * 4), val,
+				  (val & cleared) == 0, 20, 50000);
+}
+
+static u32 gswip_mdio_r(struct gswip_priv *priv, u32 offset)
+{
+	return __raw_readl(priv->mdio + (offset * 4));
+}
+
+static void gswip_mdio_w(struct gswip_priv *priv, u32 val, u32 offset)
+{
+	__raw_writel(val, priv->mdio + (offset * 4));
+}
+
+static void gswip_mdio_mask(struct gswip_priv *priv, u32 clear, u32 set,
+			    u32 offset)
+{
+	u32 val = gswip_mdio_r(priv, offset);
+
+	val &= ~(clear);
+	val |= set;
+	gswip_mdio_w(priv, val, offset);
+}
+
+static u32 gswip_mii_r(struct gswip_priv *priv, u32 offset)
+{
+	return __raw_readl(priv->mii + (offset * 4));
+}
+
+static void gswip_mii_w(struct gswip_priv *priv, u32 val, u32 offset)
+{
+	__raw_writel(val, priv->mii + (offset * 4));
+}
+
+static void gswip_mii_mask(struct gswip_priv *priv, u32 clear, u32 set,
+			   u32 offset)
+{
+	u32 val = gswip_mii_r(priv, offset);
+
+	val &= ~(clear);
+	val |= set;
+	gswip_mii_w(priv, val, offset);
+}
+
+static void gswip_mii_mask_cfg(struct gswip_priv *priv, u32 clear, u32 set,
+			       int port)
+{
+	switch (port) {
+	case 0:
+		gswip_mii_mask(priv, clear, set, GSWIP_MII_CFG0);
+		break;
+	case 1:
+		gswip_mii_mask(priv, clear, set, GSWIP_MII_CFG1);
+		break;
+	case 5:
+		gswip_mii_mask(priv, clear, set, GSWIP_MII_CFG5);
+		break;
+	}
+}
+
+static void gswip_mii_mask_pcdu(struct gswip_priv *priv, u32 clear, u32 set,
+				int port)
+{
+	switch (port) {
+	case 0:
+		gswip_mii_mask(priv, clear, set, GSWIP_MII_PCDU0);
+		break;
+	case 1:
+		gswip_mii_mask(priv, clear, set, GSWIP_MII_PCDU1);
+		break;
+	case 5:
+		gswip_mii_mask(priv, clear, set, GSWIP_MII_PCDU5);
+		break;
+	}
+}
+
+static int gswip_mdio_poll(struct gswip_priv *priv)
+{
+	int cnt = 100;
+
+	while (likely(cnt--)) {
+		u32 ctrl = gswip_mdio_r(priv, GSWIP_MDIO_CTRL);
+
+		if ((ctrl & GSWIP_MDIO_CTRL_BUSY) == 0)
+			return 0;
+		usleep_range(20, 40);
+	}
+
+	return -ETIMEDOUT;
+}
+
+static int gswip_mdio_wr(struct mii_bus *bus, int addr, int reg, u16 val)
+{
+	struct gswip_priv *priv = bus->priv;
+	int err;
+
+	err = gswip_mdio_poll(priv);
+	if (err) {
+		dev_err(&bus->dev, "waiting for MDIO bus busy timed out\n");
+		return err;
+	}
+
+	gswip_mdio_w(priv, val, GSWIP_MDIO_WRITE);
+	gswip_mdio_w(priv, GSWIP_MDIO_CTRL_BUSY | GSWIP_MDIO_CTRL_WR |
+		((addr & GSWIP_MDIO_CTRL_PHYAD_MASK) << GSWIP_MDIO_CTRL_PHYAD_SHIFT) |
+		(reg & GSWIP_MDIO_CTRL_REGAD_MASK),
+		GSWIP_MDIO_CTRL);
+
+	return 0;
+}
+
+static int gswip_mdio_rd(struct mii_bus *bus, int addr, int reg)
+{
+	struct gswip_priv *priv = bus->priv;
+	int err;
+
+	err = gswip_mdio_poll(priv);
+	if (err) {
+		dev_err(&bus->dev, "waiting for MDIO bus busy timed out\n");
+		return err;
+	}
+
+	gswip_mdio_w(priv, GSWIP_MDIO_CTRL_BUSY | GSWIP_MDIO_CTRL_RD |
+		((addr & GSWIP_MDIO_CTRL_PHYAD_MASK) << GSWIP_MDIO_CTRL_PHYAD_SHIFT) |
+		(reg & GSWIP_MDIO_CTRL_REGAD_MASK),
+		GSWIP_MDIO_CTRL);
+
+	err = gswip_mdio_poll(priv);
+	if (err) {
+		dev_err(&bus->dev, "waiting for MDIO bus busy timed out\n");
+		return err;
+	}
+
+	return gswip_mdio_r(priv, GSWIP_MDIO_READ);
+}
+
+static int gswip_mdio(struct gswip_priv *priv, struct device_node *mdio_np)
+{
+	struct dsa_switch *ds = priv->ds;
+
+	ds->slave_mii_bus = devm_mdiobus_alloc(priv->dev);
+	if (!ds->slave_mii_bus)
+		return -ENOMEM;
+
+	ds->slave_mii_bus->priv = priv;
+	ds->slave_mii_bus->read = gswip_mdio_rd;
+	ds->slave_mii_bus->write = gswip_mdio_wr;
+	ds->slave_mii_bus->name = "lantiq,xrx200-mdio";
+	snprintf(ds->slave_mii_bus->id, MII_BUS_ID_SIZE, "%s-mii",
+		 dev_name(priv->dev));
+	ds->slave_mii_bus->parent = priv->dev;
+	ds->slave_mii_bus->phy_mask = ~ds->phys_mii_mask;
+
+	return of_mdiobus_register(ds->slave_mii_bus, mdio_np);
+}
+
+static int gswip_port_enable(struct dsa_switch *ds, int port,
+			     struct phy_device *phydev)
+{
+	struct gswip_priv *priv = ds->priv;
+
+	/* RMON Counter Enable for port */
+	gswip_switch_w(priv, GSWIP_BM_PCFG_CNTEN, GSWIP_BM_PCFGp(port));
+
+	/* enable port fetch/store dma & VLAN Modification */
+	gswip_switch_mask(priv, 0, GSWIP_FDMA_PCTRL_EN |
+				   GSWIP_FDMA_PCTRL_VLANMOD_BOTH,
+			 GSWIP_FDMA_PCTRLp(port));
+	gswip_switch_mask(priv, 0, GSWIP_SDMA_PCTRL_EN,
+			  GSWIP_SDMA_PCTRLp(port));
+	gswip_switch_mask(priv, 0, GSWIP_PCE_PCTRL_0_INGRESS,
+			  GSWIP_PCE_PCTRL_0p(port));
+
+	if (!dsa_is_cpu_port(ds, port)) {
+		u32 macconf = GSWIP_MDIO_PHY_LINK_AUTO |
+			      GSWIP_MDIO_PHY_SPEED_AUTO |
+			      GSWIP_MDIO_PHY_FDUP_AUTO |
+			      GSWIP_MDIO_PHY_FCONTX_AUTO |
+			      GSWIP_MDIO_PHY_FCONRX_AUTO |
+			      (phydev->mdio.addr & GSWIP_MDIO_PHY_ADDR_MASK);
+
+		gswip_mdio_w(priv, macconf, GSWIP_MDIO_PHYp(port));
+		/* Activate MDIO auto polling */
+		gswip_mdio_mask(priv, 0, BIT(port), GSWIP_MDIO_MDC_CFG0);
+	}
+
+	return 0;
+}
+
+static void gswip_port_disable(struct dsa_switch *ds, int port,
+			       struct phy_device *phy)
+{
+	struct gswip_priv *priv = ds->priv;
+
+	if (!dsa_is_cpu_port(ds, port)) {
+		gswip_mdio_mask(priv, GSWIP_MDIO_PHY_LINK_DOWN,
+				GSWIP_MDIO_PHY_LINK_MASK,
+				GSWIP_MDIO_PHYp(port));
+		/* Deactivate MDIO auto polling */
+		gswip_mdio_mask(priv, BIT(port), 0, GSWIP_MDIO_MDC_CFG0);
+	}
+
+	gswip_switch_mask(priv, GSWIP_FDMA_PCTRL_EN, 0,
+			  GSWIP_FDMA_PCTRLp(port));
+	gswip_switch_mask(priv, GSWIP_SDMA_PCTRL_EN, 0,
+			  GSWIP_SDMA_PCTRLp(port));
+}
+
+static int gswip_pce_load_microcode(struct gswip_priv *priv)
+{
+	int i;
+	int err;
+
+	gswip_switch_mask(priv, GSWIP_PCE_TBL_CTRL_ADDR_MASK |
+				GSWIP_PCE_TBL_CTRL_OPMOD_MASK,
+			  GSWIP_PCE_TBL_CTRL_OPMOD_ADWR, GSWIP_PCE_TBL_CTRL);
+	gswip_switch_w(priv, 0, GSWIP_PCE_TBL_MASK);
+
+	for (i = 0; i < ARRAY_SIZE(gswip_pce_microcode); i++) {
+		gswip_switch_w(priv, i, GSWIP_PCE_TBL_ADDR);
+		gswip_switch_w(priv, gswip_pce_microcode[i].val_0,
+			       GSWIP_PCE_TBL_VAL(0));
+		gswip_switch_w(priv, gswip_pce_microcode[i].val_1,
+			       GSWIP_PCE_TBL_VAL(1));
+		gswip_switch_w(priv, gswip_pce_microcode[i].val_2,
+			       GSWIP_PCE_TBL_VAL(2));
+		gswip_switch_w(priv, gswip_pce_microcode[i].val_3,
+			       GSWIP_PCE_TBL_VAL(3));
+
+		/* start the table access: */
+		gswip_switch_mask(priv, 0, GSWIP_PCE_TBL_CTRL_BAS,
+				  GSWIP_PCE_TBL_CTRL);
+		err = gswip_switch_r_timeout(priv, GSWIP_PCE_TBL_CTRL,
+					     GSWIP_PCE_TBL_CTRL_BAS);
+		if (err)
+			return err;
+	}
+
+	/* tell the switch that the microcode is loaded */
+	gswip_switch_mask(priv, 0, GSWIP_PCE_GCTRL_0_MC_VALID,
+			  GSWIP_PCE_GCTRL_0);
+
+	return 0;
+}
+
+static int gswip_setup(struct dsa_switch *ds)
+{
+	struct gswip_priv *priv = ds->priv;
+	unsigned int cpu_port = priv->hw_info->cpu_port;
+	int i;
+	int err;
+
+	gswip_switch_w(priv, GSWIP_SWRES_R0, GSWIP_SWRES);
+	usleep_range(5000, 10000);
+	gswip_switch_w(priv, 0, GSWIP_SWRES);
+
+	/* disable port fetch/store dma on all ports */
+	for (i = 0; i < priv->hw_info->max_ports; i++)
+		gswip_port_disable(ds, i, NULL);
+
+	/* enable Switch */
+	gswip_mdio_mask(priv, 0, GSWIP_MDIO_GLOB_ENABLE, GSWIP_MDIO_GLOB);
+
+	err = gswip_pce_load_microcode(priv);
+	if (err) {
+		dev_err(priv->dev, "writing PCE microcode failed, %i", err);
+		return err;
+	}
+
+	/* Default unknown Broadcast/Multicast/Unicast port maps */
+	gswip_switch_w(priv, BIT(cpu_port), GSWIP_PCE_PMAP1);
+	gswip_switch_w(priv, BIT(cpu_port), GSWIP_PCE_PMAP2);
+	gswip_switch_w(priv, BIT(cpu_port), GSWIP_PCE_PMAP3);
+
+	/* disable PHY auto polling */
+	gswip_mdio_w(priv, 0x0, GSWIP_MDIO_MDC_CFG0);
+	/* Configure the MDIO Clock 2.5 MHz */
+	gswip_mdio_mask(priv, 0xff, 0x09, GSWIP_MDIO_MDC_CFG1);
+
+	/* Disable the xMII link */
+	gswip_mii_mask_cfg(priv, GSWIP_MII_CFG_EN, 0, 0);
+	gswip_mii_mask_cfg(priv, GSWIP_MII_CFG_EN, 0, 1);
+	gswip_mii_mask_cfg(priv, GSWIP_MII_CFG_EN, 0, 5);
+
+	/* enable special tag insertion on cpu port */
+	gswip_switch_mask(priv, 0, GSWIP_FDMA_PCTRL_STEN,
+			  GSWIP_FDMA_PCTRLp(cpu_port));
+
+	gswip_switch_mask(priv, 0, GSWIP_MAC_CTRL_2_MLEN,
+			  GSWIP_MAC_CTRL_2p(cpu_port));
+	gswip_switch_w(priv, VLAN_ETH_FRAME_LEN + 8, GSWIP_MAC_FLEN);
+	gswip_switch_mask(priv, 0, GSWIP_BM_QUEUE_GCTRL_GL_MOD,
+			  GSWIP_BM_QUEUE_GCTRL);
+
+	/* VLAN aware Switching */
+	gswip_switch_mask(priv, 0, GSWIP_PCE_GCTRL_0_VLAN, GSWIP_PCE_GCTRL_0);
+
+	/* Mac Address Table Lock */
+	gswip_switch_mask(priv, 0, GSWIP_PCE_GCTRL_1_MAC_GLOCK |
+				   GSWIP_PCE_GCTRL_1_MAC_GLOCK_MOD,
+			  GSWIP_PCE_GCTRL_1);
+
+	gswip_port_enable(ds, cpu_port, NULL);
+	return 0;
+}
+
+static enum dsa_tag_protocol gswip_get_tag_protocol(struct dsa_switch *ds,
+						    int port)
+{
+	return DSA_TAG_PROTO_GSWIP;
+}
+
+static void gswip_phylink_validate(struct dsa_switch *ds, int port,
+				   unsigned long *supported,
+				   struct phylink_link_state *state)
+{
+	__ETHTOOL_DECLARE_LINK_MODE_MASK(mask) = { 0, };
+
+	switch (port) {
+	case 0:
+	case 1:
+		if (!phy_interface_mode_is_rgmii(state->interface) &&
+		    state->interface != PHY_INTERFACE_MODE_MII &&
+		    state->interface != PHY_INTERFACE_MODE_REVMII &&
+		    state->interface != PHY_INTERFACE_MODE_RMII) {
+			bitmap_zero(supported, __ETHTOOL_LINK_MODE_MASK_NBITS);
+			dev_err(ds->dev,
+			"Unsupported interface: %d\n", state->interface);
+			return;
+		}
+		break;
+	case 2:
+	case 3:
+	case 4:
+		if (state->interface != PHY_INTERFACE_MODE_INTERNAL) {
+			bitmap_zero(supported, __ETHTOOL_LINK_MODE_MASK_NBITS);
+			dev_err(ds->dev,
+			"Unsupported interface: %d\n", state->interface);
+			return;
+		}
+		break;
+	case 5:
+		if (!phy_interface_mode_is_rgmii(state->interface) &&
+		    state->interface != PHY_INTERFACE_MODE_INTERNAL) {
+			bitmap_zero(supported, __ETHTOOL_LINK_MODE_MASK_NBITS);
+			dev_err(ds->dev,
+			"Unsupported interface: %d\n", state->interface);
+			return;
+		}
+		break;
+	}
+
+	/* Allow all the expected bits */
+	phylink_set(mask, Autoneg);
+	phylink_set_port_modes(mask);
+	phylink_set(mask, Pause);
+	phylink_set(mask, Asym_Pause);
+
+	/* With the exclusion of MII and Reverse MII, we support Gigabit,
+	 * including Half duplex
+	 */
+	if (state->interface != PHY_INTERFACE_MODE_MII &&
+	    state->interface != PHY_INTERFACE_MODE_REVMII) {
+		phylink_set(mask, 1000baseT_Full);
+		phylink_set(mask, 1000baseT_Half);
+	}
+
+	phylink_set(mask, 10baseT_Half);
+	phylink_set(mask, 10baseT_Full);
+	phylink_set(mask, 100baseT_Half);
+	phylink_set(mask, 100baseT_Full);
+
+	bitmap_and(supported, supported, mask,
+		   __ETHTOOL_LINK_MODE_MASK_NBITS);
+	bitmap_and(state->advertising, state->advertising, mask,
+		   __ETHTOOL_LINK_MODE_MASK_NBITS);
+}
+
+static void gswip_phylink_mac_config(struct dsa_switch *ds, int port,
+				     unsigned int mode,
+				     const struct phylink_link_state *state)
+{
+	struct gswip_priv *priv = ds->priv;
+	u32 miicfg = 0;
+
+	miicfg |= GSWIP_MII_CFG_LDCLKDIS;
+
+	switch (state->interface) {
+	case PHY_INTERFACE_MODE_MII:
+	case PHY_INTERFACE_MODE_INTERNAL:
+		miicfg |= GSWIP_MII_CFG_MODE_MIIM;
+		break;
+	case PHY_INTERFACE_MODE_REVMII:
+		miicfg |= GSWIP_MII_CFG_MODE_MIIP;
+		break;
+	case PHY_INTERFACE_MODE_RMII:
+		miicfg |= GSWIP_MII_CFG_MODE_RMIIM;
+		break;
+	case PHY_INTERFACE_MODE_RGMII:
+	case PHY_INTERFACE_MODE_RGMII_ID:
+	case PHY_INTERFACE_MODE_RGMII_RXID:
+	case PHY_INTERFACE_MODE_RGMII_TXID:
+		miicfg |= GSWIP_MII_CFG_MODE_RGMII;
+		break;
+	default:
+		dev_err(ds->dev,
+			"Unsupported interface: %d\n", state->interface);
+		return;
+	}
+	gswip_mii_mask_cfg(priv, GSWIP_MII_CFG_MODE_MASK, miicfg, port);
+
+	switch (state->interface) {
+	case PHY_INTERFACE_MODE_RGMII_ID:
+		gswip_mii_mask_pcdu(priv, GSWIP_MII_PCDU_TXDLY_MASK |
+					  GSWIP_MII_PCDU_RXDLY_MASK, 0, port);
+		break;
+	case PHY_INTERFACE_MODE_RGMII_RXID:
+		gswip_mii_mask_pcdu(priv, GSWIP_MII_PCDU_RXDLY_MASK, 0, port);
+		break;
+	case PHY_INTERFACE_MODE_RGMII_TXID:
+		gswip_mii_mask_pcdu(priv, GSWIP_MII_PCDU_TXDLY_MASK, 0, port);
+		break;
+	default:
+		break;
+	}
+}
+
+static void gswip_phylink_mac_link_down(struct dsa_switch *ds, int port,
+					unsigned int mode,
+					phy_interface_t interface)
+{
+	struct gswip_priv *priv = ds->priv;
+
+	gswip_mii_mask_cfg(priv, GSWIP_MII_CFG_EN, 0, port);
+}
+
+static void gswip_phylink_mac_link_up(struct dsa_switch *ds, int port,
+				      unsigned int mode,
+				      phy_interface_t interface,
+				      struct phy_device *phydev)
+{
+	struct gswip_priv *priv = ds->priv;
+
+	/* Enable the xMII interface only for the external PHY */
+	if (interface != PHY_INTERFACE_MODE_INTERNAL)
+		gswip_mii_mask_cfg(priv, 0, GSWIP_MII_CFG_EN, port);
+}
+
+static void gswip_get_strings(struct dsa_switch *ds, int port, u32 stringset,
+			      uint8_t *data)
+{
+	int i;
+
+	if (stringset != ETH_SS_STATS)
+		return;
+
+	for (i = 0; i < ARRAY_SIZE(gswip_rmon_cnt); i++)
+		strncpy(data + i * ETH_GSTRING_LEN, gswip_rmon_cnt[i].name,
+			ETH_GSTRING_LEN);
+}
+
+static u32 gswip_bcm_ram_entry_read(struct gswip_priv *priv, u32 table,
+				    u32 index)
+{
+	u32 result;
+	int err;
+
+	gswip_switch_w(priv, index, GSWIP_BM_RAM_ADDR);
+	gswip_switch_mask(priv, GSWIP_BM_RAM_CTRL_ADDR_MASK |
+				GSWIP_BM_RAM_CTRL_OPMOD,
+			      table | GSWIP_BM_RAM_CTRL_BAS,
+			      GSWIP_BM_RAM_CTRL);
+
+	err = gswip_switch_r_timeout(priv, GSWIP_BM_RAM_CTRL,
+				     GSWIP_BM_RAM_CTRL_BAS);
+	if (err) {
+		dev_err(priv->dev, "timeout while reading table: %u, index: %u",
+			table, index);
+		return 0;
+	}
+
+	result = gswip_switch_r(priv, GSWIP_BM_RAM_VAL(0));
+	result |= gswip_switch_r(priv, GSWIP_BM_RAM_VAL(1)) << 16;
+
+	return result;
+}
+
+static void gswip_get_ethtool_stats(struct dsa_switch *ds, int port,
+				    uint64_t *data)
+{
+	struct gswip_priv *priv = ds->priv;
+	const struct gswip_rmon_cnt_desc *rmon_cnt;
+	int i;
+	u64 high;
+
+	for (i = 0; i < ARRAY_SIZE(gswip_rmon_cnt); i++) {
+		rmon_cnt = &gswip_rmon_cnt[i];
+
+		data[i] = gswip_bcm_ram_entry_read(priv, port,
+						   rmon_cnt->offset);
+		if (rmon_cnt->size == 2) {
+			high = gswip_bcm_ram_entry_read(priv, port,
+							rmon_cnt->offset + 1);
+			data[i] |= high << 32;
+		}
+	}
+}
+
+static int gswip_get_sset_count(struct dsa_switch *ds, int port, int sset)
+{
+	if (sset != ETH_SS_STATS)
+		return 0;
+
+	return ARRAY_SIZE(gswip_rmon_cnt);
+}
+
+static const struct dsa_switch_ops gswip_switch_ops = {
+	.get_tag_protocol	= gswip_get_tag_protocol,
+	.setup			= gswip_setup,
+	.port_enable		= gswip_port_enable,
+	.port_disable		= gswip_port_disable,
+	.phylink_validate	= gswip_phylink_validate,
+	.phylink_mac_config	= gswip_phylink_mac_config,
+	.phylink_mac_link_down	= gswip_phylink_mac_link_down,
+	.phylink_mac_link_up	= gswip_phylink_mac_link_up,
+	.get_strings		= gswip_get_strings,
+	.get_ethtool_stats	= gswip_get_ethtool_stats,
+	.get_sset_count		= gswip_get_sset_count,
+};
+
+static const struct xway_gphy_match_data xrx200a1x_gphy_data = {
+	.fe_firmware_name = "lantiq/xrx200_phy22f_a14.bin",
+	.ge_firmware_name = "lantiq/xrx200_phy11g_a14.bin",
+};
+
+static const struct xway_gphy_match_data xrx200a2x_gphy_data = {
+	.fe_firmware_name = "lantiq/xrx200_phy22f_a22.bin",
+	.ge_firmware_name = "lantiq/xrx200_phy11g_a22.bin",
+};
+
+static const struct xway_gphy_match_data xrx300_gphy_data = {
+	.fe_firmware_name = "lantiq/xrx300_phy22f_a21.bin",
+	.ge_firmware_name = "lantiq/xrx300_phy11g_a21.bin",
+};
+
+static const struct of_device_id xway_gphy_match[] = {
+	{ .compatible = "lantiq,xrx200-gphy-fw", .data = NULL },
+	{ .compatible = "lantiq,xrx200a1x-gphy-fw", .data = &xrx200a1x_gphy_data },
+	{ .compatible = "lantiq,xrx200a2x-gphy-fw", .data = &xrx200a2x_gphy_data },
+	{ .compatible = "lantiq,xrx300-gphy-fw", .data = &xrx300_gphy_data },
+	{ .compatible = "lantiq,xrx330-gphy-fw", .data = &xrx300_gphy_data },
+	{},
+};
+
+static int gswip_gphy_fw_load(struct gswip_priv *priv, struct gswip_gphy_fw *gphy_fw)
+{
+	struct device *dev = priv->dev;
+	const struct firmware *fw;
+	void *fw_addr;
+	dma_addr_t dma_addr;
+	dma_addr_t dev_addr;
+	size_t size;
+	int ret;
+
+	ret = clk_prepare_enable(gphy_fw->clk_gate);
+	if (ret)
+		return ret;
+
+	reset_control_assert(gphy_fw->reset);
+
+	ret = request_firmware(&fw, gphy_fw->fw_name, dev);
+	if (ret) {
+		dev_err(dev, "failed to load firmware: %s, error: %i\n",
+			gphy_fw->fw_name, ret);
+		return ret;
+	}
+
+	/* GPHY cores need the firmware code in a persistent and contiguous
+	 * memory area with a 16 kB boundary aligned start address.
+	 */
+	size = fw->size + XRX200_GPHY_FW_ALIGN;
+
+	fw_addr = dmam_alloc_coherent(dev, size, &dma_addr, GFP_KERNEL);
+	if (fw_addr) {
+		fw_addr = PTR_ALIGN(fw_addr, XRX200_GPHY_FW_ALIGN);
+		dev_addr = ALIGN(dma_addr, XRX200_GPHY_FW_ALIGN);
+		memcpy(fw_addr, fw->data, fw->size);
+	} else {
+		dev_err(dev, "failed to alloc firmware memory\n");
+		release_firmware(fw);
+		return -ENOMEM;
+	}
+
+	release_firmware(fw);
+
+	ret = regmap_write(priv->rcu_regmap, gphy_fw->fw_addr_offset, dev_addr);
+	if (ret)
+		return ret;
+
+	reset_control_deassert(gphy_fw->reset);
+
+	return ret;
+}
+
+static int gswip_gphy_fw_probe(struct gswip_priv *priv,
+			       struct gswip_gphy_fw *gphy_fw,
+			       struct device_node *gphy_fw_np, int i)
+{
+	struct device *dev = priv->dev;
+	u32 gphy_mode;
+	int ret;
+	char gphyname[10];
+
+	snprintf(gphyname, sizeof(gphyname), "gphy%d", i);
+
+	gphy_fw->clk_gate = devm_clk_get(dev, gphyname);
+	if (IS_ERR(gphy_fw->clk_gate)) {
+		dev_err(dev, "Failed to lookup gate clock\n");
+		return PTR_ERR(gphy_fw->clk_gate);
+	}
+
+	ret = of_property_read_u32(gphy_fw_np, "reg", &gphy_fw->fw_addr_offset);
+	if (ret)
+		return ret;
+
+	ret = of_property_read_u32(gphy_fw_np, "lantiq,gphy-mode", &gphy_mode);
+	/* Default to GE mode */
+	if (ret)
+		gphy_mode = GPHY_MODE_GE;
+
+	switch (gphy_mode) {
+	case GPHY_MODE_FE:
+		gphy_fw->fw_name = priv->gphy_fw_name_cfg->fe_firmware_name;
+		break;
+	case GPHY_MODE_GE:
+		gphy_fw->fw_name = priv->gphy_fw_name_cfg->ge_firmware_name;
+		break;
+	default:
+		dev_err(dev, "Unknown GPHY mode %d\n", gphy_mode);
+		return -EINVAL;
+	}
+
+	gphy_fw->reset = of_reset_control_array_get_exclusive(gphy_fw_np);
+	if (IS_ERR(priv->gphy_fw)) {
+		if (PTR_ERR(priv->gphy_fw) != -EPROBE_DEFER)
+			dev_err(dev, "Failed to lookup gphy reset\n");
+		return PTR_ERR(priv->gphy_fw);
+	}
+
+	return gswip_gphy_fw_load(priv, gphy_fw);
+}
+
+static void gswip_gphy_fw_remove(struct gswip_priv *priv,
+				 struct gswip_gphy_fw *gphy_fw)
+{
+	int ret;
+
+	/* check if the device was fully probed */
+	if (!gphy_fw->fw_name)
+		return;
+
+	ret = regmap_write(priv->rcu_regmap, gphy_fw->fw_addr_offset, 0);
+	if (ret)
+		dev_err(priv->dev, "can not reset GPHY FW pointer");
+
+	clk_disable_unprepare(gphy_fw->clk_gate);
+
+	reset_control_put(gphy_fw->reset);
+}
+
+static int gswip_gphy_fw_list(struct gswip_priv *priv,
+			      struct device_node *gphy_fw_list_np, u32 version)
+{
+	struct device *dev = priv->dev;
+	struct device_node *gphy_fw_np;
+	const struct of_device_id *match;
+	int err;
+	int i = 0;
+
+	/* The The VRX200 rev 1.1 uses the GSWIP 2.0 and needs the older
+	 * GPHY firmware. The VRX200 rev 1.2 uses the GSWIP 2.1 and also
+	 * needs a different GPHY firmware.
+	 */
+	if (of_device_is_compatible(gphy_fw_list_np, "lantiq,xrx200-gphy-fw")) {
+		switch (version) {
+		case GSWIP_VERSION_2_0:
+			priv->gphy_fw_name_cfg = &xrx200a1x_gphy_data;
+			break;
+		case GSWIP_VERSION_2_1:
+			priv->gphy_fw_name_cfg = &xrx200a2x_gphy_data;
+			break;
+		default:
+			dev_err(dev, "unknown GSWIP version: 0x%x", version);
+			return -ENOENT;
+		}
+	}
+
+	match = of_match_node(xway_gphy_match, gphy_fw_list_np);
+	if (match && match->data)
+		priv->gphy_fw_name_cfg = match->data;
+
+	if (!priv->gphy_fw_name_cfg) {
+		dev_err(dev, "GPHY compatible type not supported");
+		return -ENOENT;
+	}
+
+	priv->num_gphy_fw = of_get_available_child_count(gphy_fw_list_np);
+	if (!priv->num_gphy_fw)
+		return -ENOENT;
+
+	priv->rcu_regmap = syscon_regmap_lookup_by_phandle(gphy_fw_list_np,
+							   "lantiq,rcu");
+	if (IS_ERR(priv->rcu_regmap))
+		return PTR_ERR(priv->rcu_regmap);
+
+	priv->gphy_fw = devm_kmalloc_array(dev, priv->num_gphy_fw,
+					   sizeof(*priv->gphy_fw),
+					   GFP_KERNEL | __GFP_ZERO);
+	if (!priv->gphy_fw)
+		return -ENOMEM;
+
+	for_each_available_child_of_node(gphy_fw_list_np, gphy_fw_np) {
+		err = gswip_gphy_fw_probe(priv, &priv->gphy_fw[i],
+					  gphy_fw_np, i);
+		if (err)
+			goto remove_gphy;
+		i++;
+	}
+
+	return 0;
+
+remove_gphy:
+	for (i = 0; i < priv->num_gphy_fw; i++)
+		gswip_gphy_fw_remove(priv, &priv->gphy_fw[i]);
+	return err;
+}
+
+static int gswip_probe(struct platform_device *pdev)
+{
+	struct gswip_priv *priv;
+	struct resource *gswip_res, *mdio_res, *mii_res;
+	struct device_node *mdio_np, *gphy_fw_np;
+	struct device *dev = &pdev->dev;
+	int err;
+	int i;
+	u32 version;
+
+	priv = devm_kzalloc(dev, sizeof(*priv), GFP_KERNEL);
+	if (!priv)
+		return -ENOMEM;
+
+	gswip_res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
+	priv->gswip = devm_ioremap_resource(dev, gswip_res);
+	if (!priv->gswip)
+		return -ENOMEM;
+
+	mdio_res = platform_get_resource(pdev, IORESOURCE_MEM, 1);
+	priv->mdio = devm_ioremap_resource(dev, mdio_res);
+	if (!priv->mdio)
+		return -ENOMEM;
+
+	mii_res = platform_get_resource(pdev, IORESOURCE_MEM, 2);
+	priv->mii = devm_ioremap_resource(dev, mii_res);
+	if (!priv->mii)
+		return -ENOMEM;
+
+	priv->hw_info = of_device_get_match_data(dev);
+	if (!priv->hw_info)
+		return -EINVAL;
+
+	priv->ds = dsa_switch_alloc(dev, priv->hw_info->max_ports);
+	if (!priv->ds)
+		return -ENOMEM;
+
+	priv->ds->priv = priv;
+	priv->ds->ops = &gswip_switch_ops;
+	priv->dev = dev;
+	version = gswip_switch_r(priv, GSWIP_VERSION);
+
+	/* bring up the mdio bus */
+	gphy_fw_np = of_find_compatible_node(pdev->dev.of_node, NULL,
+					     "lantiq,gphy-fw");
+	if (gphy_fw_np) {
+		err = gswip_gphy_fw_list(priv, gphy_fw_np, version);
+		if (err) {
+			dev_err(dev, "gphy fw probe failed\n");
+			return err;
+		}
+	}
+
+	/* bring up the mdio bus */
+	mdio_np = of_find_compatible_node(pdev->dev.of_node, NULL,
+					  "lantiq,xrx200-mdio");
+	if (mdio_np) {
+		err = gswip_mdio(priv, mdio_np);
+		if (err) {
+			dev_err(dev, "mdio probe failed\n");
+			goto gphy_fw;
+		}
+	}
+
+	err = dsa_register_switch(priv->ds);
+	if (err) {
+		dev_err(dev, "dsa switch register failed: %i\n", err);
+		goto mdio_bus;
+	}
+	if (priv->ds->dst->cpu_dp->index != priv->hw_info->cpu_port) {
+		dev_err(dev, "wrong CPU port defined, HW only supports port: %i",
+			priv->hw_info->cpu_port);
+		err = -EINVAL;
+		goto mdio_bus;
+	}
+
+	platform_set_drvdata(pdev, priv);
+
+	dev_info(dev, "probed GSWIP version %lx mod %lx\n",
+		 (version & GSWIP_VERSION_REV_MASK) >> GSWIP_VERSION_REV_SHIFT,
+		 (version & GSWIP_VERSION_MOD_MASK) >> GSWIP_VERSION_MOD_SHIFT);
+	return 0;
+
+mdio_bus:
+	if (mdio_np)
+		mdiobus_unregister(priv->ds->slave_mii_bus);
+gphy_fw:
+	for (i = 0; i < priv->num_gphy_fw; i++)
+		gswip_gphy_fw_remove(priv, &priv->gphy_fw[i]);
+	return err;
+}
+
+static int gswip_remove(struct platform_device *pdev)
+{
+	struct gswip_priv *priv = platform_get_drvdata(pdev);
+	int i;
+
+	if (!priv)
+		return 0;
+
+	/* disable the switch */
+	gswip_mdio_mask(priv, GSWIP_MDIO_GLOB_ENABLE, 0, GSWIP_MDIO_GLOB);
+
+	dsa_unregister_switch(priv->ds);
+
+	if (priv->ds->slave_mii_bus)
+		mdiobus_unregister(priv->ds->slave_mii_bus);
+
+	for (i = 0; i < priv->num_gphy_fw; i++)
+		gswip_gphy_fw_remove(priv, &priv->gphy_fw[i]);
+
+	return 0;
+}
+
+static const struct gswip_hw_info gswip_xrx200 = {
+	.max_ports = 7,
+	.cpu_port = 6,
+};
+
+static const struct of_device_id gswip_of_match[] = {
+	{ .compatible = "lantiq,xrx200-gswip", .data = &gswip_xrx200 },
+	{},
+};
+MODULE_DEVICE_TABLE(of, gswip_of_match);
+
+static struct platform_driver gswip_driver = {
+	.probe = gswip_probe,
+	.remove = gswip_remove,
+	.driver = {
+		.name = "gswip",
+		.of_match_table = gswip_of_match,
+	},
+};
+
+module_platform_driver(gswip_driver);
+
+MODULE_AUTHOR("Hauke Mehrtens <hauke@hauke-m.de>");
+MODULE_DESCRIPTION("Lantiq / Intel GSWIP driver");
+MODULE_LICENSE("GPL v2");
diff --git a/drivers/net/dsa/lantiq_pce.h b/drivers/net/dsa/lantiq_pce.h
new file mode 100644
index 000000000000..180663138e75
--- /dev/null
+++ b/drivers/net/dsa/lantiq_pce.h
@@ -0,0 +1,153 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * PCE microcode extracted from UGW 7.1.1 switch api
+ *
+ * Copyright (c) 2012, 2014, 2015 Lantiq Deutschland GmbH
+ * Copyright (C) 2012 John Crispin <john@phrozen.org>
+ * Copyright (C) 2017 - 2018 Hauke Mehrtens <hauke@hauke-m.de>
+ */
+
+enum {
+	OUT_MAC0 = 0,
+	OUT_MAC1,
+	OUT_MAC2,
+	OUT_MAC3,
+	OUT_MAC4,
+	OUT_MAC5,
+	OUT_ETHTYP,
+	OUT_VTAG0,
+	OUT_VTAG1,
+	OUT_ITAG0,
+	OUT_ITAG1,	/*10 */
+	OUT_ITAG2,
+	OUT_ITAG3,
+	OUT_IP0,
+	OUT_IP1,
+	OUT_IP2,
+	OUT_IP3,
+	OUT_SIP0,
+	OUT_SIP1,
+	OUT_SIP2,
+	OUT_SIP3,	/*20*/
+	OUT_SIP4,
+	OUT_SIP5,
+	OUT_SIP6,
+	OUT_SIP7,
+	OUT_DIP0,
+	OUT_DIP1,
+	OUT_DIP2,
+	OUT_DIP3,
+	OUT_DIP4,
+	OUT_DIP5,	/*30*/
+	OUT_DIP6,
+	OUT_DIP7,
+	OUT_SESID,
+	OUT_PROT,
+	OUT_APP0,
+	OUT_APP1,
+	OUT_IGMP0,
+	OUT_IGMP1,
+	OUT_IPOFF,	/*39*/
+	OUT_NONE = 63,
+};
+
+/* parser's microcode length type */
+#define INSTR		0
+#define IPV6		1
+#define LENACCU		2
+
+/* parser's microcode flag type */
+enum {
+	FLAG_ITAG = 0,
+	FLAG_VLAN,
+	FLAG_SNAP,
+	FLAG_PPPOE,
+	FLAG_IPV6,
+	FLAG_IPV6FL,
+	FLAG_IPV4,
+	FLAG_IGMP,
+	FLAG_TU,
+	FLAG_HOP,
+	FLAG_NN1,	/*10 */
+	FLAG_NN2,
+	FLAG_END,
+	FLAG_NO,	/*13*/
+};
+
+struct gswip_pce_microcode {
+	u16 val_3;
+	u16 val_2;
+	u16 val_1;
+	u16 val_0;
+};
+
+#define MC_ENTRY(val, msk, ns, out, len, type, flags, ipv4_len) \
+	{ val, msk, ((ns) << 10 | (out) << 4 | (len) >> 1),\
+		((len) & 1) << 15 | (type) << 13 | (flags) << 9 | (ipv4_len) << 8 }
+static const struct gswip_pce_microcode gswip_pce_microcode[] = {
+	/*      value    mask    ns  fields      L  type     flags       ipv4_len */
+	MC_ENTRY(0x88c3, 0xFFFF,  1, OUT_ITAG0,  4, INSTR,   FLAG_ITAG,  0),
+	MC_ENTRY(0x8100, 0xFFFF,  2, OUT_VTAG0,  2, INSTR,   FLAG_VLAN,  0),
+	MC_ENTRY(0x88A8, 0xFFFF,  1, OUT_VTAG0,  2, INSTR,   FLAG_VLAN,  0),
+	MC_ENTRY(0x8100, 0xFFFF,  1, OUT_VTAG0,  2, INSTR,   FLAG_VLAN,  0),
+	MC_ENTRY(0x8864, 0xFFFF, 17, OUT_ETHTYP, 1, INSTR,   FLAG_NO,    0),
+	MC_ENTRY(0x0800, 0xFFFF, 21, OUT_ETHTYP, 1, INSTR,   FLAG_NO,    0),
+	MC_ENTRY(0x86DD, 0xFFFF, 22, OUT_ETHTYP, 1, INSTR,   FLAG_NO,    0),
+	MC_ENTRY(0x8863, 0xFFFF, 16, OUT_ETHTYP, 1, INSTR,   FLAG_NO,    0),
+	MC_ENTRY(0x0000, 0xF800, 10, OUT_NONE,   0, INSTR,   FLAG_NO,    0),
+	MC_ENTRY(0x0000, 0x0000, 40, OUT_ETHTYP, 1, INSTR,   FLAG_NO,    0),
+	MC_ENTRY(0x0600, 0x0600, 40, OUT_ETHTYP, 1, INSTR,   FLAG_NO,    0),
+	MC_ENTRY(0x0000, 0x0000, 12, OUT_NONE,   1, INSTR,   FLAG_NO,    0),
+	MC_ENTRY(0xAAAA, 0xFFFF, 14, OUT_NONE,   1, INSTR,   FLAG_NO,    0),
+	MC_ENTRY(0x0000, 0x0000, 41, OUT_NONE,   0, INSTR,   FLAG_NO,    0),
+	MC_ENTRY(0x0300, 0xFF00, 41, OUT_NONE,   0, INSTR,   FLAG_SNAP,  0),
+	MC_ENTRY(0x0000, 0x0000, 41, OUT_NONE,   0, INSTR,   FLAG_NO,    0),
+	MC_ENTRY(0x0000, 0x0000, 41, OUT_DIP7,   3, INSTR,   FLAG_NO,    0),
+	MC_ENTRY(0x0000, 0x0000, 18, OUT_DIP7,   3, INSTR,   FLAG_PPPOE, 0),
+	MC_ENTRY(0x0021, 0xFFFF, 21, OUT_NONE,   1, INSTR,   FLAG_NO,    0),
+	MC_ENTRY(0x0057, 0xFFFF, 22, OUT_NONE,   1, INSTR,   FLAG_NO,    0),
+	MC_ENTRY(0x0000, 0x0000, 40, OUT_NONE,   0, INSTR,   FLAG_NO,    0),
+	MC_ENTRY(0x4000, 0xF000, 24, OUT_IP0,    4, INSTR,   FLAG_IPV4,  1),
+	MC_ENTRY(0x6000, 0xF000, 27, OUT_IP0,    3, INSTR,   FLAG_IPV6,  0),
+	MC_ENTRY(0x0000, 0x0000, 41, OUT_NONE,   0, INSTR,   FLAG_NO,    0),
+	MC_ENTRY(0x0000, 0x0000, 25, OUT_IP3,    2, INSTR,   FLAG_NO,    0),
+	MC_ENTRY(0x0000, 0x0000, 26, OUT_SIP0,   4, INSTR,   FLAG_NO,    0),
+	MC_ENTRY(0x0000, 0x0000, 40, OUT_NONE,   0, LENACCU, FLAG_NO,    0),
+	MC_ENTRY(0x1100, 0xFF00, 39, OUT_PROT,   1, INSTR,   FLAG_NO,    0),
+	MC_ENTRY(0x0600, 0xFF00, 39, OUT_PROT,   1, INSTR,   FLAG_NO,    0),
+	MC_ENTRY(0x0000, 0xFF00, 33, OUT_IP3,   17, INSTR,   FLAG_HOP,   0),
+	MC_ENTRY(0x2B00, 0xFF00, 33, OUT_IP3,   17, INSTR,   FLAG_NN1,   0),
+	MC_ENTRY(0x3C00, 0xFF00, 33, OUT_IP3,   17, INSTR,   FLAG_NN2,   0),
+	MC_ENTRY(0x0000, 0x0000, 39, OUT_PROT,   1, INSTR,   FLAG_NO,    0),
+	MC_ENTRY(0x0000, 0x00E0, 35, OUT_NONE,   0, INSTR,   FLAG_NO,    0),
+	MC_ENTRY(0x0000, 0x0000, 40, OUT_NONE,   0, INSTR,   FLAG_NO,    0),
+	MC_ENTRY(0x0000, 0xFF00, 33, OUT_NONE,   0, IPV6,    FLAG_HOP,   0),
+	MC_ENTRY(0x2B00, 0xFF00, 33, OUT_NONE,   0, IPV6,    FLAG_NN1,   0),
+	MC_ENTRY(0x3C00, 0xFF00, 33, OUT_NONE,   0, IPV6,    FLAG_NN2,   0),
+	MC_ENTRY(0x0000, 0x0000, 40, OUT_PROT,   1, IPV6,    FLAG_NO,    0),
+	MC_ENTRY(0x0000, 0x0000, 40, OUT_SIP0,  16, INSTR,   FLAG_NO,    0),
+	MC_ENTRY(0x0000, 0x0000, 41, OUT_APP0,   4, INSTR,   FLAG_IGMP,  0),
+	MC_ENTRY(0x0000, 0x0000, 41, OUT_NONE,   0, INSTR,   FLAG_END,   0),
+	MC_ENTRY(0x0000, 0x0000, 41, OUT_NONE,   0, INSTR,   FLAG_END,   0),
+	MC_ENTRY(0x0000, 0x0000, 41, OUT_NONE,   0, INSTR,   FLAG_END,   0),
+	MC_ENTRY(0x0000, 0x0000, 41, OUT_NONE,   0, INSTR,   FLAG_END,   0),
+	MC_ENTRY(0x0000, 0x0000, 41, OUT_NONE,   0, INSTR,   FLAG_END,   0),
+	MC_ENTRY(0x0000, 0x0000, 41, OUT_NONE,   0, INSTR,   FLAG_END,   0),
+	MC_ENTRY(0x0000, 0x0000, 41, OUT_NONE,   0, INSTR,   FLAG_END,   0),
+	MC_ENTRY(0x0000, 0x0000, 41, OUT_NONE,   0, INSTR,   FLAG_END,   0),
+	MC_ENTRY(0x0000, 0x0000, 41, OUT_NONE,   0, INSTR,   FLAG_END,   0),
+	MC_ENTRY(0x0000, 0x0000, 41, OUT_NONE,   0, INSTR,   FLAG_END,   0),
+	MC_ENTRY(0x0000, 0x0000, 41, OUT_NONE,   0, INSTR,   FLAG_END,   0),
+	MC_ENTRY(0x0000, 0x0000, 41, OUT_NONE,   0, INSTR,   FLAG_END,   0),
+	MC_ENTRY(0x0000, 0x0000, 41, OUT_NONE,   0, INSTR,   FLAG_END,   0),
+	MC_ENTRY(0x0000, 0x0000, 41, OUT_NONE,   0, INSTR,   FLAG_END,   0),
+	MC_ENTRY(0x0000, 0x0000, 41, OUT_NONE,   0, INSTR,   FLAG_END,   0),
+	MC_ENTRY(0x0000, 0x0000, 41, OUT_NONE,   0, INSTR,   FLAG_END,   0),
+	MC_ENTRY(0x0000, 0x0000, 41, OUT_NONE,   0, INSTR,   FLAG_END,   0),
+	MC_ENTRY(0x0000, 0x0000, 41, OUT_NONE,   0, INSTR,   FLAG_END,   0),
+	MC_ENTRY(0x0000, 0x0000, 41, OUT_NONE,   0, INSTR,   FLAG_END,   0),
+	MC_ENTRY(0x0000, 0x0000, 41, OUT_NONE,   0, INSTR,   FLAG_END,   0),
+	MC_ENTRY(0x0000, 0x0000, 41, OUT_NONE,   0, INSTR,   FLAG_END,   0),
+	MC_ENTRY(0x0000, 0x0000, 41, OUT_NONE,   0, INSTR,   FLAG_END,   0),
+	MC_ENTRY(0x0000, 0x0000, 41, OUT_NONE,   0, INSTR,   FLAG_END,   0),
+};
-- 
2.11.0

^ permalink raw reply related

* [PATCH v3 net-next 5/6] dt-bindings: net: dsa: Add lantiq,xrx200-gswip DT bindings
From: Hauke Mehrtens @ 2018-09-09 20:20 UTC (permalink / raw)
  To: davem
  Cc: netdev, andrew, vivien.didelot, f.fainelli, john, linux-mips, dev,
	hauke.mehrtens, devicetree
In-Reply-To: <20180909201647.32727-1-hauke@hauke-m.d>

This adds the binding for the GSWIP (Gigabit switch) core found in the
xrx200 / VR9 Lantiq / Intel SoC.

This part takes care of the switch, MDIO bus, and loading the FW into
the embedded GPHYs.

Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de>
Cc: devicetree@vger.kernel.org
---
 .../devicetree/bindings/net/dsa/lantiq-gswip.txt   | 141 +++++++++++++++++++++
 1 file changed, 141 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/net/dsa/lantiq-gswip.txt

diff --git a/Documentation/devicetree/bindings/net/dsa/lantiq-gswip.txt b/Documentation/devicetree/bindings/net/dsa/lantiq-gswip.txt
new file mode 100644
index 000000000000..a089f5856778
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/dsa/lantiq-gswip.txt
@@ -0,0 +1,141 @@
+Lantiq GSWIP Ethernet switches
+==================================
+
+Required properties for GSWIP core:
+
+- compatible	: "lantiq,xrx200-gswip" for the embedded GSWIP in the
+		  xRX200 SoC
+- reg		: memory range of the GSWIP core registers
+		: memory range of the GSWIP MDIO registers
+		: memory range of the GSWIP MII registers
+
+See Documentation/devicetree/bindings/net/dsa/dsa.txt for a list of
+additional required and optional properties.
+
+
+Required properties for MDIO bus:
+- compatible	: "lantiq,xrx200-mdio" for the MDIO bus inside the GSWIP
+		  core of the xRX200 SoC and the PHYs connected to it.
+
+See Documentation/devicetree/bindings/net/mdio.txt for a list of additional
+required and optional properties.
+
+
+Required properties for GPHY firmware loading:
+- compatible	: "lantiq,gphy-fw" and "lantiq,xrx200-gphy-fw",
+		  "lantiq,xrx200a1x-gphy-fw", "lantiq,xrx200a2x-gphy-fw",
+		  "lantiq,xrx300-gphy-fw", or "lantiq,xrx330-gphy-fw"
+		  for the loading of the firmware into the embedded
+		  GPHY core of the SoC.
+- lantiq,rcu	: reference to the rcu syscon
+
+The GPHY firmware loader has a list of GPHY entries, one for each
+embedded GPHY
+
+- reg		: Offset of the GPHY firmware register in the RCU
+		  register range
+- resets	: list of resets of the embedded GPHY
+- reset-names	: list of names of the resets
+
+Example:
+
+Ethernet switch on the VRX200 SoC:
+
+gswip: gswip@E108000 {
+	#address-cells = <1>;
+	#size-cells = <0>;
+	compatible = "lantiq,xrx200-gswip";
+	reg = <	0xE108000 0x3000 /* switch */
+		0xE10B100 0x70 /* mdio */
+		0xE10B1D8 0x30 /* mii */
+		>;
+	dsa,member = <0 0>;
+
+	ports {
+		#address-cells = <1>;
+		#size-cells = <0>;
+
+		port@0 {
+			reg = <0>;
+			label = "lan3";
+			phy-mode = "rgmii";
+			phy-handle = <&phy0>;
+		};
+
+		port@1 {
+			reg = <1>;
+			label = "lan4";
+			phy-mode = "rgmii";
+			phy-handle = <&phy1>;
+		};
+
+		port@2 {
+			reg = <2>;
+			label = "lan2";
+			phy-mode = "internal";
+			phy-handle = <&phy11>;
+		};
+
+		port@4 {
+			reg = <4>;
+			label = "lan1";
+			phy-mode = "internal";
+			phy-handle = <&phy13>;
+		};
+
+		port@5 {
+			reg = <5>;
+			label = "wan";
+			phy-mode = "rgmii";
+			phy-handle = <&phy5>;
+		};
+
+		port@6 {
+			reg = <0x6>;
+			label = "cpu";
+			ethernet = <&eth0>;
+		};
+	};
+
+	mdio@0 {
+		#address-cells = <1>;
+		#size-cells = <0>;
+		compatible = "lantiq,xrx200-mdio";
+		reg = <0>;
+
+		phy0: ethernet-phy@0 {
+			reg = <0x0>;
+		};
+		phy1: ethernet-phy@1 {
+			reg = <0x1>;
+		};
+		phy5: ethernet-phy@5 {
+			reg = <0x5>;
+		};
+		phy11: ethernet-phy@11 {
+			reg = <0x11>;
+		};
+		phy13: ethernet-phy@13 {
+			reg = <0x13>;
+		};
+	};
+
+	gphy-fw {
+		compatible = "lantiq,xrx200-gphy-fw", "lantiq,gphy-fw";
+		lantiq,rcu = <&rcu0>;
+
+		gphy@20 {
+			reg = <0x20>;
+
+			resets = <&reset0 31 30>;
+			reset-names = "gphy";
+		};
+
+		gphy@68 {
+			reg = <0x68>;
+
+			resets = <&reset0 29 28>;
+			reset-names = "gphy";
+		};
+	};
+};
-- 
2.11.0

^ permalink raw reply related

* [PATCH v3 net-next 4/6] net: lantiq: Add Lantiq / Intel VRX200 Ethernet driver
From: Hauke Mehrtens @ 2018-09-09 20:16 UTC (permalink / raw)
  To: davem
  Cc: netdev, andrew, vivien.didelot, f.fainelli, john, linux-mips, dev,
	hauke.mehrtens, devicetree, Hauke Mehrtens
In-Reply-To: <20180909201647.32727-1-hauke@hauke-m.de>

This drives the PMAC between the GSWIP Switch and the CPU in the VRX200
SoC. This is currently only the very basic version of the Ethernet
driver.

When the DMA channel is activated we receive some packets which were
send to the SoC while it was still in U-Boot, these packets have the
wrong header. Resetting the IP cores did not work so we read out the
extra packets at the beginning and discard them.

This also adapts the clock code in sysctrl.c to use the default name of
the device node so that the driver gets the correct clock. sysctrl.c
should be replaced with a proper common clock driver later.

Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de>
---
 MAINTAINERS                          |   1 +
 arch/mips/lantiq/xway/sysctrl.c      |   6 +-
 drivers/net/ethernet/Kconfig         |   7 +
 drivers/net/ethernet/Makefile        |   1 +
 drivers/net/ethernet/lantiq_xrx200.c | 564 +++++++++++++++++++++++++++++++++++
 5 files changed, 576 insertions(+), 3 deletions(-)
 create mode 100644 drivers/net/ethernet/lantiq_xrx200.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 4b2ee65f6086..ffff912d31b5 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8171,6 +8171,7 @@ M:	Hauke Mehrtens <hauke@hauke-m.de>
 L:	netdev@vger.kernel.org
 S:	Maintained
 F:	net/dsa/tag_gswip.c
+F:	drivers/net/ethernet/lantiq_xrx200.c
 
 LANTIQ MIPS ARCHITECTURE
 M:	John Crispin <john@phrozen.org>
diff --git a/arch/mips/lantiq/xway/sysctrl.c b/arch/mips/lantiq/xway/sysctrl.c
index e0af39b33e28..eeb89a37e27e 100644
--- a/arch/mips/lantiq/xway/sysctrl.c
+++ b/arch/mips/lantiq/xway/sysctrl.c
@@ -505,7 +505,7 @@ void __init ltq_soc_init(void)
 		clkdev_add_pmu("1a800000.pcie", "msi", 1, 1, PMU1_PCIE2_MSI);
 		clkdev_add_pmu("1a800000.pcie", "pdi", 1, 1, PMU1_PCIE2_PDI);
 		clkdev_add_pmu("1a800000.pcie", "ctl", 1, 1, PMU1_PCIE2_CTL);
-		clkdev_add_pmu("1e108000.eth", NULL, 0, 0, PMU_SWITCH | PMU_PPE_DP);
+		clkdev_add_pmu("1e10b308.eth", NULL, 0, 0, PMU_SWITCH | PMU_PPE_DP);
 		clkdev_add_pmu("1da00000.usif", "NULL", 1, 0, PMU_USIF);
 		clkdev_add_pmu("1e103100.deu", NULL, 1, 0, PMU_DEU);
 	} else if (of_machine_is_compatible("lantiq,ar10")) {
@@ -513,7 +513,7 @@ void __init ltq_soc_init(void)
 				  ltq_ar10_fpi_hz(), ltq_ar10_pp32_hz());
 		clkdev_add_pmu("1e101000.usb", "otg", 1, 0, PMU_USB0);
 		clkdev_add_pmu("1e106000.usb", "otg", 1, 0, PMU_USB1);
-		clkdev_add_pmu("1e108000.eth", NULL, 0, 0, PMU_SWITCH |
+		clkdev_add_pmu("1e10b308.eth", NULL, 0, 0, PMU_SWITCH |
 			       PMU_PPE_DP | PMU_PPE_TC);
 		clkdev_add_pmu("1da00000.usif", "NULL", 1, 0, PMU_USIF);
 		clkdev_add_pmu("1f203020.gphy", NULL, 1, 0, PMU_GPHY);
@@ -536,7 +536,7 @@ void __init ltq_soc_init(void)
 		clkdev_add_pmu(NULL, "ahb", 1, 0, PMU_AHBM | PMU_AHBS);
 
 		clkdev_add_pmu("1da00000.usif", "NULL", 1, 0, PMU_USIF);
-		clkdev_add_pmu("1e108000.eth", NULL, 0, 0,
+		clkdev_add_pmu("1e10b308.eth", NULL, 0, 0,
 				PMU_SWITCH | PMU_PPE_DPLUS | PMU_PPE_DPLUM |
 				PMU_PPE_EMA | PMU_PPE_TC | PMU_PPE_SLL01 |
 				PMU_PPE_QSB | PMU_PPE_TOP);
diff --git a/drivers/net/ethernet/Kconfig b/drivers/net/ethernet/Kconfig
index 6fde68aa13a4..885e00d17807 100644
--- a/drivers/net/ethernet/Kconfig
+++ b/drivers/net/ethernet/Kconfig
@@ -108,6 +108,13 @@ config LANTIQ_ETOP
 	---help---
 	  Support for the MII0 inside the Lantiq SoC
 
+config LANTIQ_XRX200
+	tristate "Lantiq / Intel xRX200 PMAC network driver"
+	depends on SOC_TYPE_XWAY
+	---help---
+	  Support for the PMAC of the Gigabit switch (GSWIP) inside the
+	  Lantiq / Intel VRX200 VDSL SoC
+
 source "drivers/net/ethernet/marvell/Kconfig"
 source "drivers/net/ethernet/mediatek/Kconfig"
 source "drivers/net/ethernet/mellanox/Kconfig"
diff --git a/drivers/net/ethernet/Makefile b/drivers/net/ethernet/Makefile
index b45d5f626b59..7b5bf9682066 100644
--- a/drivers/net/ethernet/Makefile
+++ b/drivers/net/ethernet/Makefile
@@ -49,6 +49,7 @@ obj-$(CONFIG_NET_VENDOR_XSCALE) += xscale/
 obj-$(CONFIG_JME) += jme.o
 obj-$(CONFIG_KORINA) += korina.o
 obj-$(CONFIG_LANTIQ_ETOP) += lantiq_etop.o
+obj-$(CONFIG_LANTIQ_XRX200) += lantiq_xrx200.o
 obj-$(CONFIG_NET_VENDOR_MARVELL) += marvell/
 obj-$(CONFIG_NET_VENDOR_MEDIATEK) += mediatek/
 obj-$(CONFIG_NET_VENDOR_MELLANOX) += mellanox/
diff --git a/drivers/net/ethernet/lantiq_xrx200.c b/drivers/net/ethernet/lantiq_xrx200.c
new file mode 100644
index 000000000000..c8b6d908f0cc
--- /dev/null
+++ b/drivers/net/ethernet/lantiq_xrx200.c
@@ -0,0 +1,564 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Lantiq / Intel PMAC driver for XRX200 SoCs
+ *
+ * Copyright (C) 2010 Lantiq Deutschland
+ * Copyright (C) 2012 John Crispin <john@phrozen.org>
+ * Copyright (C) 2017 - 2018 Hauke Mehrtens <hauke@hauke-m.de>
+ */
+
+#include <linux/etherdevice.h>
+#include <linux/module.h>
+#include <linux/platform_device.h>
+#include <linux/interrupt.h>
+#include <linux/clk.h>
+#include <linux/delay.h>
+
+#include <linux/of_net.h>
+#include <linux/of_platform.h>
+
+#include <xway_dma.h>
+
+/* DMA */
+#define XRX200_DMA_DATA_LEN	0x600
+#define XRX200_DMA_RX		0
+#define XRX200_DMA_TX		1
+
+/* cpu port mac */
+#define PMAC_RX_IPG		0x0024
+#define PMAC_RX_IPG_MASK	0xf
+
+#define PMAC_HD_CTL		0x0000
+/* Add Ethernet header to packets from DMA to PMAC */
+#define PMAC_HD_CTL_ADD		BIT(0)
+/* Add VLAN tag to Packets from DMA to PMAC */
+#define PMAC_HD_CTL_TAG		BIT(1)
+/* Add CRC to packets from DMA to PMAC */
+#define PMAC_HD_CTL_AC		BIT(2)
+/* Add status header to packets from PMAC to DMA */
+#define PMAC_HD_CTL_AS		BIT(3)
+/* Remove CRC from packets from PMAC to DMA */
+#define PMAC_HD_CTL_RC		BIT(4)
+/* Remove Layer-2 header from packets from PMAC to DMA */
+#define PMAC_HD_CTL_RL2		BIT(5)
+/* Status header is present from DMA to PMAC */
+#define PMAC_HD_CTL_RXSH	BIT(6)
+/* Add special tag from PMAC to switch */
+#define PMAC_HD_CTL_AST		BIT(7)
+/* Remove specail Tag from PMAC to DMA */
+#define PMAC_HD_CTL_RST		BIT(8)
+/* Check CRC from DMA to PMAC */
+#define PMAC_HD_CTL_CCRC	BIT(9)
+/* Enable reaction to Pause frames in the PMAC */
+#define PMAC_HD_CTL_FC		BIT(10)
+
+struct xrx200_chan {
+	int tx_free;
+
+	struct napi_struct napi;
+	struct ltq_dma_channel dma;
+	struct sk_buff *skb[LTQ_DESC_NUM];
+
+	struct xrx200_priv *priv;
+};
+
+struct xrx200_priv {
+	struct clk *clk;
+
+	struct xrx200_chan chan_tx;
+	struct xrx200_chan chan_rx;
+
+	struct net_device *net_dev;
+	struct device *dev;
+
+	__iomem void *pmac_reg;
+};
+
+static u32 xrx200_pmac_r32(struct xrx200_priv *priv, u32 offset)
+{
+	return __raw_readl(priv->pmac_reg + offset);
+}
+
+static void xrx200_pmac_w32(struct xrx200_priv *priv, u32 val, u32 offset)
+{
+	__raw_writel(val, priv->pmac_reg + offset);
+}
+
+static void xrx200_pmac_mask(struct xrx200_priv *priv, u32 clear, u32 set,
+			     u32 offset)
+{
+	u32 val = xrx200_pmac_r32(priv, offset);
+
+	val &= ~(clear);
+	val |= set;
+	xrx200_pmac_w32(priv, val, offset);
+}
+
+/* drop all the packets from the DMA ring */
+static void xrx200_flush_dma(struct xrx200_chan *ch)
+{
+	int i;
+
+	for (i = 0; i < LTQ_DESC_NUM; i++) {
+		struct ltq_dma_desc *desc = &ch->dma.desc_base[ch->dma.desc];
+
+		if ((desc->ctl & (LTQ_DMA_OWN | LTQ_DMA_C)) != LTQ_DMA_C)
+			break;
+
+		desc->ctl = LTQ_DMA_OWN | LTQ_DMA_RX_OFFSET(NET_IP_ALIGN) |
+			    XRX200_DMA_DATA_LEN;
+		ch->dma.desc++;
+		ch->dma.desc %= LTQ_DESC_NUM;
+	}
+}
+
+static int xrx200_open(struct net_device *net_dev)
+{
+	struct xrx200_priv *priv = netdev_priv(net_dev);
+	int err;
+
+	/* enable clock gate */
+	err = clk_prepare_enable(priv->clk);
+	if (err)
+		return err;
+
+	napi_enable(&priv->chan_tx.napi);
+	ltq_dma_open(&priv->chan_tx.dma);
+	ltq_dma_enable_irq(&priv->chan_tx.dma);
+
+	napi_enable(&priv->chan_rx.napi);
+	ltq_dma_open(&priv->chan_rx.dma);
+	/* The boot loader does not always deactivate the receiving of frames
+	 * on the ports and then some packets queue up in the PPE buffers.
+	 * They already passed the PMAC so they do not have the tags
+	 * configured here. Read the these packets here and drop them.
+	 * The HW should have written them into memory after 10us
+	 */
+	usleep_range(20, 40);
+	xrx200_flush_dma(&priv->chan_rx);
+	ltq_dma_enable_irq(&priv->chan_rx.dma);
+
+	netif_wake_queue(net_dev);
+
+	return 0;
+}
+
+static int xrx200_close(struct net_device *net_dev)
+{
+	struct xrx200_priv *priv = netdev_priv(net_dev);
+
+	netif_stop_queue(net_dev);
+
+	napi_disable(&priv->chan_rx.napi);
+	ltq_dma_close(&priv->chan_rx.dma);
+
+	napi_disable(&priv->chan_tx.napi);
+	ltq_dma_close(&priv->chan_tx.dma);
+
+	clk_disable_unprepare(priv->clk);
+
+	return 0;
+}
+
+static int xrx200_alloc_skb(struct xrx200_chan *ch)
+{
+	int ret = 0;
+
+	ch->skb[ch->dma.desc] = netdev_alloc_skb_ip_align(ch->priv->net_dev,
+							  XRX200_DMA_DATA_LEN);
+	if (!ch->skb[ch->dma.desc]) {
+		ret = -ENOMEM;
+		goto skip;
+	}
+
+	ch->dma.desc_base[ch->dma.desc].addr = dma_map_single(ch->priv->dev,
+			ch->skb[ch->dma.desc]->data, XRX200_DMA_DATA_LEN,
+			DMA_FROM_DEVICE);
+	if (unlikely(dma_mapping_error(ch->priv->dev,
+				       ch->dma.desc_base[ch->dma.desc].addr))) {
+		dev_kfree_skb_any(ch->skb[ch->dma.desc]);
+		ret = -ENOMEM;
+		goto skip;
+	}
+
+skip:
+	ch->dma.desc_base[ch->dma.desc].ctl =
+		LTQ_DMA_OWN | LTQ_DMA_RX_OFFSET(NET_IP_ALIGN) |
+		XRX200_DMA_DATA_LEN;
+
+	return ret;
+}
+
+static int xrx200_hw_receive(struct xrx200_chan *ch)
+{
+	struct xrx200_priv *priv = ch->priv;
+	struct ltq_dma_desc *desc = &ch->dma.desc_base[ch->dma.desc];
+	struct sk_buff *skb = ch->skb[ch->dma.desc];
+	int len = (desc->ctl & LTQ_DMA_SIZE_MASK);
+	struct net_device *net_dev = priv->net_dev;
+	int ret;
+
+	ret = xrx200_alloc_skb(ch);
+
+	ch->dma.desc++;
+	ch->dma.desc %= LTQ_DESC_NUM;
+
+	if (ret) {
+		netdev_err(net_dev, "failed to allocate new rx buffer\n");
+		return ret;
+	}
+
+	skb_put(skb, len);
+	skb->protocol = eth_type_trans(skb, net_dev);
+	netif_receive_skb(skb);
+	net_dev->stats.rx_packets++;
+	net_dev->stats.rx_bytes += len - ETH_FCS_LEN;
+
+	return 0;
+}
+
+static int xrx200_poll_rx(struct napi_struct *napi, int budget)
+{
+	struct xrx200_chan *ch = container_of(napi,
+				struct xrx200_chan, napi);
+	int rx = 0;
+	int ret;
+
+	while (rx < budget) {
+		struct ltq_dma_desc *desc = &ch->dma.desc_base[ch->dma.desc];
+
+		if ((desc->ctl & (LTQ_DMA_OWN | LTQ_DMA_C)) == LTQ_DMA_C) {
+			ret = xrx200_hw_receive(ch);
+			if (ret)
+				return ret;
+			rx++;
+		} else {
+			break;
+		}
+	}
+
+	if (rx < budget) {
+		napi_complete(&ch->napi);
+		ltq_dma_enable_irq(&ch->dma);
+	}
+
+	return rx;
+}
+
+static int xrx200_tx_housekeeping(struct napi_struct *napi, int budget)
+{
+	struct xrx200_chan *ch = container_of(napi,
+				struct xrx200_chan, napi);
+	struct net_device *net_dev = ch->priv->net_dev;
+	int pkts = 0;
+	int bytes = 0;
+
+	while (pkts < budget) {
+		struct ltq_dma_desc *desc = &ch->dma.desc_base[ch->tx_free];
+
+		if ((desc->ctl & (LTQ_DMA_OWN | LTQ_DMA_C)) == LTQ_DMA_C) {
+			struct sk_buff *skb = ch->skb[ch->tx_free];
+
+			pkts++;
+			bytes += skb->len;
+			ch->skb[ch->tx_free] = NULL;
+			consume_skb(skb);
+			memset(&ch->dma.desc_base[ch->tx_free], 0,
+			       sizeof(struct ltq_dma_desc));
+			ch->tx_free++;
+			ch->tx_free %= LTQ_DESC_NUM;
+		} else {
+			break;
+		}
+	}
+
+	net_dev->stats.tx_packets += pkts;
+	net_dev->stats.tx_bytes += bytes;
+	netdev_completed_queue(ch->priv->net_dev, pkts, bytes);
+
+	if (pkts < budget) {
+		napi_complete(&ch->napi);
+		ltq_dma_enable_irq(&ch->dma);
+	}
+
+	return pkts;
+}
+
+static int xrx200_start_xmit(struct sk_buff *skb, struct net_device *net_dev)
+{
+	struct xrx200_priv *priv = netdev_priv(net_dev);
+	struct xrx200_chan *ch = &priv->chan_tx;
+	struct ltq_dma_desc *desc = &ch->dma.desc_base[ch->dma.desc];
+	u32 byte_offset;
+	dma_addr_t mapping;
+	int len;
+
+	skb->dev = net_dev;
+	if (skb_put_padto(skb, ETH_ZLEN)) {
+		net_dev->stats.tx_dropped++;
+		return NETDEV_TX_OK;
+	}
+
+	len = skb->len;
+
+	if ((desc->ctl & (LTQ_DMA_OWN | LTQ_DMA_C)) || ch->skb[ch->dma.desc]) {
+		netdev_err(net_dev, "tx ring full\n");
+		netif_stop_queue(net_dev);
+		return NETDEV_TX_BUSY;
+	}
+
+	ch->skb[ch->dma.desc] = skb;
+
+	mapping = dma_map_single(priv->dev, skb->data, len, DMA_TO_DEVICE);
+	if (unlikely(dma_mapping_error(priv->dev, mapping)))
+		goto err_drop;
+
+	/* dma needs to start on a 16 byte aligned address */
+	byte_offset = mapping % 16;
+
+	desc->addr = mapping - byte_offset;
+	/* Make sure the address is written before we give it to HW */
+	wmb();
+	desc->ctl = LTQ_DMA_OWN | LTQ_DMA_SOP | LTQ_DMA_EOP |
+		LTQ_DMA_TX_OFFSET(byte_offset) | (len & LTQ_DMA_SIZE_MASK);
+	ch->dma.desc++;
+	ch->dma.desc %= LTQ_DESC_NUM;
+	if (ch->dma.desc == ch->tx_free)
+		netif_stop_queue(net_dev);
+
+	netdev_sent_queue(net_dev, len);
+
+	return NETDEV_TX_OK;
+
+err_drop:
+	dev_kfree_skb(skb);
+	net_dev->stats.tx_dropped++;
+	net_dev->stats.tx_errors++;
+	return NETDEV_TX_OK;
+}
+
+static const struct net_device_ops xrx200_netdev_ops = {
+	.ndo_open		= xrx200_open,
+	.ndo_stop		= xrx200_close,
+	.ndo_start_xmit		= xrx200_start_xmit,
+	.ndo_set_mac_address	= eth_mac_addr,
+	.ndo_validate_addr	= eth_validate_addr,
+	.ndo_change_mtu		= eth_change_mtu,
+};
+
+static irqreturn_t xrx200_dma_irq(int irq, void *ptr)
+{
+	struct xrx200_chan *ch = ptr;
+
+	ltq_dma_disable_irq(&ch->dma);
+	ltq_dma_ack_irq(&ch->dma);
+
+	napi_schedule(&ch->napi);
+
+	return IRQ_HANDLED;
+}
+
+static int xrx200_dma_init(struct xrx200_priv *priv)
+{
+	struct xrx200_chan *ch_rx = &priv->chan_rx;
+	struct xrx200_chan *ch_tx = &priv->chan_tx;
+	int ret = 0;
+	int i;
+
+	ltq_dma_init_port(DMA_PORT_ETOP);
+
+	ch_rx->dma.nr = XRX200_DMA_RX;
+	ch_rx->dma.dev = priv->dev;
+	ch_rx->priv = priv;
+
+	ltq_dma_alloc_rx(&ch_rx->dma);
+	for (ch_rx->dma.desc = 0; ch_rx->dma.desc < LTQ_DESC_NUM;
+	     ch_rx->dma.desc++) {
+		ret = xrx200_alloc_skb(ch_rx);
+		if (ret)
+			goto rx_free;
+	}
+	ch_rx->dma.desc = 0;
+	ret = devm_request_irq(priv->dev, ch_rx->dma.irq, xrx200_dma_irq, 0,
+			       "xrx200_net_rx", &priv->chan_rx);
+	if (ret) {
+		dev_err(priv->dev, "failed to request RX irq %d\n",
+			ch_rx->dma.irq);
+		goto rx_ring_free;
+	}
+
+	ch_tx->dma.nr = XRX200_DMA_TX;
+	ch_tx->dma.dev = priv->dev;
+	ch_tx->priv = priv;
+
+	ltq_dma_alloc_tx(&ch_tx->dma);
+	ret = devm_request_irq(priv->dev, ch_tx->dma.irq, xrx200_dma_irq, 0,
+			       "xrx200_net_tx", &priv->chan_tx);
+	if (ret) {
+		dev_err(priv->dev, "failed to request TX irq %d\n",
+			ch_tx->dma.irq);
+		goto tx_free;
+	}
+
+	return ret;
+
+tx_free:
+	ltq_dma_free(&ch_tx->dma);
+
+rx_ring_free:
+	/* free the allocated RX ring */
+	for (i = 0; i < LTQ_DESC_NUM; i++) {
+		if (priv->chan_rx.skb[i])
+			dev_kfree_skb_any(priv->chan_rx.skb[i]);
+	}
+
+rx_free:
+	ltq_dma_free(&ch_rx->dma);
+	return ret;
+}
+
+static void xrx200_hw_cleanup(struct xrx200_priv *priv)
+{
+	int i;
+
+	ltq_dma_free(&priv->chan_tx.dma);
+	ltq_dma_free(&priv->chan_rx.dma);
+
+	/* free the allocated RX ring */
+	for (i = 0; i < LTQ_DESC_NUM; i++)
+		dev_kfree_skb_any(priv->chan_rx.skb[i]);
+}
+
+static int xrx200_probe(struct platform_device *pdev)
+{
+	struct device *dev = &pdev->dev;
+	struct device_node *np = dev->of_node;
+	struct resource *res;
+	struct xrx200_priv *priv;
+	struct net_device *net_dev;
+	const u8 *mac;
+	int err;
+
+	/* alloc the network device */
+	net_dev = devm_alloc_etherdev(dev, sizeof(struct xrx200_priv));
+	if (!net_dev)
+		return -ENOMEM;
+
+	priv = netdev_priv(net_dev);
+	priv->net_dev = net_dev;
+	priv->dev = dev;
+
+	net_dev->netdev_ops = &xrx200_netdev_ops;
+	SET_NETDEV_DEV(net_dev, dev);
+	net_dev->min_mtu = ETH_ZLEN;
+	net_dev->max_mtu = XRX200_DMA_DATA_LEN;
+
+	/* load the memory ranges */
+	res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
+	if (!res) {
+		dev_err(dev, "failed to get resources\n");
+		return -ENOENT;
+	}
+
+	priv->pmac_reg = devm_ioremap_resource(dev, res);
+	if (!priv->pmac_reg) {
+		dev_err(dev, "failed to request and remap io ranges\n");
+		return -ENOMEM;
+	}
+
+	priv->chan_rx.dma.irq = platform_get_irq_byname(pdev, "rx");
+	if (priv->chan_rx.dma.irq < 0) {
+		dev_err(dev, "failed to get RX IRQ, %i\n",
+			priv->chan_rx.dma.irq);
+		return -ENOENT;
+	}
+	priv->chan_tx.dma.irq = platform_get_irq_byname(pdev, "tx");
+	if (priv->chan_tx.dma.irq < 0) {
+		dev_err(dev, "failed to get TX IRQ, %i\n",
+			priv->chan_tx.dma.irq);
+		return -ENOENT;
+	}
+
+	/* get the clock */
+	priv->clk = devm_clk_get(dev, NULL);
+	if (IS_ERR(priv->clk)) {
+		dev_err(dev, "failed to get clock\n");
+		return PTR_ERR(priv->clk);
+	}
+
+	mac = of_get_mac_address(np);
+	if (mac && is_valid_ether_addr(mac))
+		ether_addr_copy(net_dev->dev_addr, mac);
+	else
+		eth_hw_addr_random(net_dev);
+
+	/* bring up the dma engine and IP core */
+	err = xrx200_dma_init(priv);
+	if (err)
+		return err;
+
+	/* set IPG to 12 */
+	xrx200_pmac_mask(priv, PMAC_RX_IPG_MASK, 0xb, PMAC_RX_IPG);
+
+	/* enable status header, enable CRC */
+	xrx200_pmac_mask(priv, 0,
+			 PMAC_HD_CTL_RST | PMAC_HD_CTL_AST | PMAC_HD_CTL_RXSH |
+			 PMAC_HD_CTL_AS | PMAC_HD_CTL_AC | PMAC_HD_CTL_RC,
+			 PMAC_HD_CTL);
+
+	/* setup NAPI */
+	netif_napi_add(net_dev, &priv->chan_rx.napi, xrx200_poll_rx, 32);
+	netif_napi_add(net_dev, &priv->chan_tx.napi, xrx200_tx_housekeeping, 32);
+
+	platform_set_drvdata(pdev, priv);
+
+	err = register_netdev(net_dev);
+	if (err)
+		goto err_uninit_dma;
+	return err;
+
+err_uninit_dma:
+	xrx200_hw_cleanup(priv);
+
+	return 0;
+}
+
+static int xrx200_remove(struct platform_device *pdev)
+{
+	struct xrx200_priv *priv = platform_get_drvdata(pdev);
+	struct net_device *net_dev = priv->net_dev;
+
+	/* free stack related instances */
+	netif_stop_queue(net_dev);
+	netif_napi_del(&priv->chan_tx.napi);
+	netif_napi_del(&priv->chan_rx.napi);
+
+	/* remove the actual device */
+	unregister_netdev(net_dev);
+
+	/* shut down hardware */
+	xrx200_hw_cleanup(priv);
+
+	return 0;
+}
+
+static const struct of_device_id xrx200_match[] = {
+	{ .compatible = "lantiq,xrx200-net" },
+	{},
+};
+MODULE_DEVICE_TABLE(of, xrx200_match);
+
+static struct platform_driver xrx200_driver = {
+	.probe = xrx200_probe,
+	.remove = xrx200_remove,
+	.driver = {
+		.name = "lantiq,xrx200-net",
+		.of_match_table = xrx200_match,
+	},
+};
+
+module_platform_driver(xrx200_driver);
+
+MODULE_AUTHOR("John Crispin <john@phrozen.org>");
+MODULE_DESCRIPTION("Lantiq SoC XRX200 ethernet");
+MODULE_LICENSE("GPL");
-- 
2.11.0

^ permalink raw reply related

* [PATCH v3 net-next 3/6] dt-bindings: net: Add lantiq,xrx200-net DT bindings
From: Hauke Mehrtens @ 2018-09-09 20:16 UTC (permalink / raw)
  To: davem
  Cc: netdev, andrew, vivien.didelot, f.fainelli, john, linux-mips, dev,
	hauke.mehrtens, devicetree, Hauke Mehrtens
In-Reply-To: <20180909201647.32727-1-hauke@hauke-m.de>

This adds the binding for the PMAC core between the CPU and the GSWIP
switch found on the xrx200 / VR9 Lantiq / Intel SoC.

Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de>
Cc: devicetree@vger.kernel.org
---
 .../devicetree/bindings/net/lantiq,xrx200-net.txt   | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/net/lantiq,xrx200-net.txt

diff --git a/Documentation/devicetree/bindings/net/lantiq,xrx200-net.txt b/Documentation/devicetree/bindings/net/lantiq,xrx200-net.txt
new file mode 100644
index 000000000000..8a2fe5200cdc
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/lantiq,xrx200-net.txt
@@ -0,0 +1,21 @@
+Lantiq xRX200 GSWIP PMAC Ethernet driver
+==================================
+
+Required properties:
+
+- compatible	: "lantiq,xrx200-net" for the PMAC of the embedded
+		: GSWIP in the xXR200
+- reg		: memory range of the PMAC core inside of the GSWIP core
+- interrupts	: TX and RX DMA interrupts. Use interrupt-names "tx" for
+		: the TX interrupt and "rx" for the RX interrupt.
+
+Example:
+
+eth0: eth@E10B308 {
+	#address-cells = <1>;
+	#size-cells = <0>;
+	compatible = "lantiq,xrx200-net";
+	reg = <0xE10B308 0x30>;
+	interrupts = <73>, <72>;
+	interrupt-names = "tx", "rx";
+};
-- 
2.11.0

^ permalink raw reply related

* [PATCH v3 net-next 2/6] net: dsa: Add Lantiq / Intel GSWIP tag support
From: Hauke Mehrtens @ 2018-09-09 20:16 UTC (permalink / raw)
  To: davem
  Cc: netdev, andrew, vivien.didelot, f.fainelli, john, linux-mips, dev,
	hauke.mehrtens, devicetree, Hauke Mehrtens
In-Reply-To: <20180909201647.32727-1-hauke@hauke-m.de>

This handles the tag added by the PMAC on the VRX200 SoC line.

The GSWIP uses internally a GSWIP special tag which is located after the
Ethernet header. The PMAC which connects the GSWIP to the CPU converts
this special tag used by the GSWIP into the PMAC special tag which is
added in front of the Ethernet header.

This was tested with GSWIP 2.1 found in the VRX200 SoCs, other GSWIP
versions use slightly different PMAC special tags.

Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
---
 MAINTAINERS         |   6 +++
 include/net/dsa.h   |   1 +
 net/dsa/Kconfig     |   3 ++
 net/dsa/Makefile    |   1 +
 net/dsa/dsa.c       |   3 ++
 net/dsa/dsa_priv.h  |   3 ++
 net/dsa/tag_gswip.c | 109 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 7 files changed, 126 insertions(+)
 create mode 100644 net/dsa/tag_gswip.c

diff --git a/MAINTAINERS b/MAINTAINERS
index a5b256b25905..4b2ee65f6086 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8166,6 +8166,12 @@ S:	Maintained
 F:	net/l3mdev
 F:	include/net/l3mdev.h
 
+LANTIQ / INTEL Ethernet drivers
+M:	Hauke Mehrtens <hauke@hauke-m.de>
+L:	netdev@vger.kernel.org
+S:	Maintained
+F:	net/dsa/tag_gswip.c
+
 LANTIQ MIPS ARCHITECTURE
 M:	John Crispin <john@phrozen.org>
 L:	linux-mips@linux-mips.org
diff --git a/include/net/dsa.h b/include/net/dsa.h
index 461e8a7661b7..23690c44e167 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -35,6 +35,7 @@ enum dsa_tag_protocol {
 	DSA_TAG_PROTO_BRCM_PREPEND,
 	DSA_TAG_PROTO_DSA,
 	DSA_TAG_PROTO_EDSA,
+	DSA_TAG_PROTO_GSWIP,
 	DSA_TAG_PROTO_KSZ,
 	DSA_TAG_PROTO_LAN9303,
 	DSA_TAG_PROTO_MTK,
diff --git a/net/dsa/Kconfig b/net/dsa/Kconfig
index 4183e4ba27a5..48c41918fb35 100644
--- a/net/dsa/Kconfig
+++ b/net/dsa/Kconfig
@@ -38,6 +38,9 @@ config NET_DSA_TAG_DSA
 config NET_DSA_TAG_EDSA
 	bool
 
+config NET_DSA_TAG_GSWIP
+	bool
+
 config NET_DSA_TAG_KSZ
 	bool
 
diff --git a/net/dsa/Makefile b/net/dsa/Makefile
index 9e4d3536f977..6e721f7a2947 100644
--- a/net/dsa/Makefile
+++ b/net/dsa/Makefile
@@ -9,6 +9,7 @@ dsa_core-$(CONFIG_NET_DSA_TAG_BRCM) += tag_brcm.o
 dsa_core-$(CONFIG_NET_DSA_TAG_BRCM_PREPEND) += tag_brcm.o
 dsa_core-$(CONFIG_NET_DSA_TAG_DSA) += tag_dsa.o
 dsa_core-$(CONFIG_NET_DSA_TAG_EDSA) += tag_edsa.o
+dsa_core-$(CONFIG_NET_DSA_TAG_GSWIP) += tag_gswip.o
 dsa_core-$(CONFIG_NET_DSA_TAG_KSZ) += tag_ksz.o
 dsa_core-$(CONFIG_NET_DSA_TAG_LAN9303) += tag_lan9303.o
 dsa_core-$(CONFIG_NET_DSA_TAG_MTK) += tag_mtk.o
diff --git a/net/dsa/dsa.c b/net/dsa/dsa.c
index e63c554e0623..81212109c507 100644
--- a/net/dsa/dsa.c
+++ b/net/dsa/dsa.c
@@ -54,6 +54,9 @@ const struct dsa_device_ops *dsa_device_ops[DSA_TAG_LAST] = {
 #ifdef CONFIG_NET_DSA_TAG_EDSA
 	[DSA_TAG_PROTO_EDSA] = &edsa_netdev_ops,
 #endif
+#ifdef CONFIG_NET_DSA_TAG_GSWIP
+	[DSA_TAG_PROTO_GSWIP] = &gswip_netdev_ops,
+#endif
 #ifdef CONFIG_NET_DSA_TAG_KSZ
 	[DSA_TAG_PROTO_KSZ] = &ksz_netdev_ops,
 #endif
diff --git a/net/dsa/dsa_priv.h b/net/dsa/dsa_priv.h
index 3964c6f7a7c0..824ca07a30aa 100644
--- a/net/dsa/dsa_priv.h
+++ b/net/dsa/dsa_priv.h
@@ -205,6 +205,9 @@ extern const struct dsa_device_ops dsa_netdev_ops;
 /* tag_edsa.c */
 extern const struct dsa_device_ops edsa_netdev_ops;
 
+/* tag_gswip.c */
+extern const struct dsa_device_ops gswip_netdev_ops;
+
 /* tag_ksz.c */
 extern const struct dsa_device_ops ksz_netdev_ops;
 
diff --git a/net/dsa/tag_gswip.c b/net/dsa/tag_gswip.c
new file mode 100644
index 000000000000..49e9b73f1be3
--- /dev/null
+++ b/net/dsa/tag_gswip.c
@@ -0,0 +1,109 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Intel / Lantiq GSWIP V2.0 PMAC tag support
+ *
+ * Copyright (C) 2017 - 2018 Hauke Mehrtens <hauke@hauke-m.de>
+ */
+
+#include <linux/bitops.h>
+#include <linux/etherdevice.h>
+#include <linux/skbuff.h>
+#include <net/dsa.h>
+
+#include "dsa_priv.h"
+
+#define GSWIP_TX_HEADER_LEN		4
+
+/* special tag in TX path header */
+/* Byte 0 */
+#define GSWIP_TX_SLPID_SHIFT		0	/* source port ID */
+#define  GSWIP_TX_SLPID_CPU		2
+#define  GSWIP_TX_SLPID_APP1		3
+#define  GSWIP_TX_SLPID_APP2		4
+#define  GSWIP_TX_SLPID_APP3		5
+#define  GSWIP_TX_SLPID_APP4		6
+#define  GSWIP_TX_SLPID_APP5		7
+
+/* Byte 1 */
+#define GSWIP_TX_CRCGEN_DIS		BIT(7)
+#define GSWIP_TX_DPID_SHIFT		0	/* destination group ID */
+#define  GSWIP_TX_DPID_ELAN		0
+#define  GSWIP_TX_DPID_EWAN		1
+#define  GSWIP_TX_DPID_CPU		2
+#define  GSWIP_TX_DPID_APP1		3
+#define  GSWIP_TX_DPID_APP2		4
+#define  GSWIP_TX_DPID_APP3		5
+#define  GSWIP_TX_DPID_APP4		6
+#define  GSWIP_TX_DPID_APP5		7
+
+/* Byte 2 */
+#define GSWIP_TX_PORT_MAP_EN		BIT(7)
+#define GSWIP_TX_PORT_MAP_SEL		BIT(6)
+#define GSWIP_TX_LRN_DIS		BIT(5)
+#define GSWIP_TX_CLASS_EN		BIT(4)
+#define GSWIP_TX_CLASS_SHIFT		0
+#define GSWIP_TX_CLASS_MASK		GENMASK(3, 0)
+
+/* Byte 3 */
+#define GSWIP_TX_DPID_EN		BIT(0)
+#define GSWIP_TX_PORT_MAP_SHIFT		1
+#define GSWIP_TX_PORT_MAP_MASK		GENMASK(6, 1)
+
+#define GSWIP_RX_HEADER_LEN	8
+
+/* special tag in RX path header */
+/* Byte 7 */
+#define GSWIP_RX_SPPID_SHIFT		4
+#define GSWIP_RX_SPPID_MASK		GENMASK(6, 4)
+
+static struct sk_buff *gswip_tag_xmit(struct sk_buff *skb,
+				      struct net_device *dev)
+{
+	struct dsa_port *dp = dsa_slave_to_port(dev);
+	int err;
+	u8 *gswip_tag;
+
+	err = skb_cow_head(skb, GSWIP_TX_HEADER_LEN);
+	if (err)
+		return NULL;
+
+	skb_push(skb, GSWIP_TX_HEADER_LEN);
+
+	gswip_tag = skb->data;
+	gswip_tag[0] = GSWIP_TX_SLPID_CPU;
+	gswip_tag[1] = GSWIP_TX_DPID_ELAN;
+	gswip_tag[2] = GSWIP_TX_PORT_MAP_EN | GSWIP_TX_PORT_MAP_SEL;
+	gswip_tag[3] = BIT(dp->index + GSWIP_TX_PORT_MAP_SHIFT) & GSWIP_TX_PORT_MAP_MASK;
+	gswip_tag[3] |= GSWIP_TX_DPID_EN;
+
+	return skb;
+}
+
+static struct sk_buff *gswip_tag_rcv(struct sk_buff *skb,
+				     struct net_device *dev,
+				     struct packet_type *pt)
+{
+	int port;
+	u8 *gswip_tag;
+
+	if (unlikely(!pskb_may_pull(skb, GSWIP_RX_HEADER_LEN)))
+		return NULL;
+
+	gswip_tag = skb->data - ETH_HLEN;
+
+	/* Get source port information */
+	port = (gswip_tag[7] & GSWIP_RX_SPPID_MASK) >> GSWIP_RX_SPPID_SHIFT;
+	skb->dev = dsa_master_find_slave(dev, 0, port);
+	if (!skb->dev)
+		return NULL;
+
+	/* remove GSWIP tag */
+	skb_pull_rcsum(skb, GSWIP_RX_HEADER_LEN);
+
+	return skb;
+}
+
+const struct dsa_device_ops gswip_netdev_ops = {
+	.xmit = gswip_tag_xmit,
+	.rcv = gswip_tag_rcv,
+};
-- 
2.11.0

^ permalink raw reply related

* [PATCH v3 net-next 1/6] MIPS: lantiq: Do not enable IRQs in dma open
From: Hauke Mehrtens @ 2018-09-09 20:16 UTC (permalink / raw)
  To: davem
  Cc: netdev, andrew, vivien.didelot, f.fainelli, john, linux-mips, dev,
	hauke.mehrtens, devicetree, Hauke Mehrtens
In-Reply-To: <20180909201647.32727-1-hauke@hauke-m.de>

When a DMA channel is opened the IRQ should not get activated
automatically, this allows it to pull data out manually without the help
of interrupts. This is needed for a workaround in the vrx200 Ethernet
driver.

Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de>
Acked-by: Paul Burton <paul.burton@mips.com>
---
 arch/mips/lantiq/xway/dma.c        | 1 -
 drivers/net/ethernet/lantiq_etop.c | 1 +
 2 files changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/mips/lantiq/xway/dma.c b/arch/mips/lantiq/xway/dma.c
index 664f2f7f55c1..982859f2b2a3 100644
--- a/arch/mips/lantiq/xway/dma.c
+++ b/arch/mips/lantiq/xway/dma.c
@@ -106,7 +106,6 @@ ltq_dma_open(struct ltq_dma_channel *ch)
 	spin_lock_irqsave(&ltq_dma_lock, flag);
 	ltq_dma_w32(ch->nr, LTQ_DMA_CS);
 	ltq_dma_w32_mask(0, DMA_CHAN_ON, LTQ_DMA_CCTRL);
-	ltq_dma_w32_mask(0, 1 << ch->nr, LTQ_DMA_IRNEN);
 	spin_unlock_irqrestore(&ltq_dma_lock, flag);
 }
 EXPORT_SYMBOL_GPL(ltq_dma_open);
diff --git a/drivers/net/ethernet/lantiq_etop.c b/drivers/net/ethernet/lantiq_etop.c
index e08301d833e2..379db19a303c 100644
--- a/drivers/net/ethernet/lantiq_etop.c
+++ b/drivers/net/ethernet/lantiq_etop.c
@@ -439,6 +439,7 @@ ltq_etop_open(struct net_device *dev)
 		if (!IS_TX(i) && (!IS_RX(i)))
 			continue;
 		ltq_dma_open(&ch->dma);
+		ltq_dma_enable_irq(&ch->dma);
 		napi_enable(&ch->napi);
 	}
 	phy_start(dev->phydev);
-- 
2.11.0

^ permalink raw reply related

* [PATCH v3 net-next 0/6] Add support for Lantiq / Intel vrx200 network
From: Hauke Mehrtens @ 2018-09-09 20:16 UTC (permalink / raw)
  To: davem
  Cc: netdev, andrew, vivien.didelot, f.fainelli, john, linux-mips, dev,
	hauke.mehrtens, devicetree, Hauke Mehrtens

This adds basic support for the GSWIP (Gigabit Switch) found in the
VRX200 SoC.
There are different versions of this IP core used in different SoCs, but
this driver was currently only tested on the VRX200 SoC line, for other
SoCs this driver probably need some adoptions to work.

I also plan to add Layer 2 offloading to the DSA driver and later also
layer 3 offloading which is supported by the PPE HW block.

All these patches should go through the net-next tree.

This depends on the patch "MIPS: lantiq: dma: add dev pointer" which 
should go into 4.19.

Changes since:
v2:
 * Send patch "MIPS: lantiq: dma: add dev pointer" separately
 * all: removed return in register write functions
 * switch: uses phylink
 * switch: uses hardware MDIO auto polling
 * switch: use usleep_range() in MDIO busy check
 * switch: configure MDIO bus to 2.5 MHz
 * switch: disable xMII link when it is not used
 * Ethernet: use NAPI for TX cleanups
 * Ethernet: enable clock in open callback
 * Ethernet: improve skb allocation
 * Ethernet: use net_dev->stats

v1:
 * Add "MIPS: lantiq: dma: add dev pointer"
 * checkpatch fixes a all patches
 * Added binding documentation
 * use readx_poll_timeout function and ETIMEOUT error code
 * integrate GPHY firmware loading into DSA driver
 * renamed to NET_DSA_LANTIQ_GSWIP
 * removed some needed casts
 * added of_device_id.data information about the detected switch
 * fixed John's email address

Hauke Mehrtens (6):
  MIPS: lantiq: Do not enable IRQs in dma open
  net: dsa: Add Lantiq / Intel GSWIP tag support
  dt-bindings: net: Add lantiq,xrx200-net DT bindings
  net: lantiq: Add Lantiq / Intel VRX200 Ethernet driver
  dt-bindings: net: dsa: Add lantiq,xrx200-gswip DT bindings
  net: dsa: Add Lantiq / Intel DSA driver for vrx200

 .../devicetree/bindings/net/dsa/lantiq-gswip.txt   |  141 +++
 .../devicetree/bindings/net/lantiq,xrx200-net.txt  |   21 +
 MAINTAINERS                                        |    9 +
 arch/mips/lantiq/xway/dma.c                        |    1 -
 arch/mips/lantiq/xway/sysctrl.c                    |   14 +-
 drivers/net/dsa/Kconfig                            |    8 +
 drivers/net/dsa/Makefile                           |    1 +
 drivers/net/dsa/lantiq_gswip.c                     | 1169 ++++++++++++++++++++
 drivers/net/dsa/lantiq_pce.h                       |  153 +++
 drivers/net/ethernet/Kconfig                       |    7 +
 drivers/net/ethernet/Makefile                      |    1 +
 drivers/net/ethernet/lantiq_etop.c                 |    1 +
 drivers/net/ethernet/lantiq_xrx200.c               |  564 ++++++++++
 include/net/dsa.h                                  |    1 +
 net/dsa/Kconfig                                    |    3 +
 net/dsa/Makefile                                   |    1 +
 net/dsa/dsa.c                                      |    3 +
 net/dsa/dsa_priv.h                                 |    3 +
 net/dsa/tag_gswip.c                                |  109 ++
 19 files changed, 2202 insertions(+), 8 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/net/dsa/lantiq-gswip.txt
 create mode 100644 Documentation/devicetree/bindings/net/lantiq,xrx200-net.txt
 create mode 100644 drivers/net/dsa/lantiq_gswip.c
 create mode 100644 drivers/net/dsa/lantiq_pce.h
 create mode 100644 drivers/net/ethernet/lantiq_xrx200.c
 create mode 100644 net/dsa/tag_gswip.c

-- 
2.11.0

^ permalink raw reply

* [PATCH v2 net] MIPS: lantiq: dma: add dev pointer
From: Hauke Mehrtens @ 2018-09-09 19:26 UTC (permalink / raw)
  To: davem
  Cc: netdev, andrew, f.fainelli, john, linux-mips, hauke.mehrtens,
	paul.burton, Hauke Mehrtens

dma_zalloc_coherent() now crashes if no dev pointer is given.
Add a dev pointer to the ltq_dma_channel structure and fill it in the
driver using it.

This fixes a bug introduced in kernel 4.19.

Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de>
---

no changes since v1.

This should go into kernel 4.19 and I have some other patches adding new 
features for kernel 4.20 which are depending on this, so I would prefer 
if this goes through the net tree. 

 arch/mips/include/asm/mach-lantiq/xway/xway_dma.h | 1 +
 arch/mips/lantiq/xway/dma.c                       | 4 ++--
 drivers/net/ethernet/lantiq_etop.c                | 1 +
 3 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/mips/include/asm/mach-lantiq/xway/xway_dma.h b/arch/mips/include/asm/mach-lantiq/xway/xway_dma.h
index 4901833498f7..8441b2698e64 100644
--- a/arch/mips/include/asm/mach-lantiq/xway/xway_dma.h
+++ b/arch/mips/include/asm/mach-lantiq/xway/xway_dma.h
@@ -40,6 +40,7 @@ struct ltq_dma_channel {
 	int desc;			/* the current descriptor */
 	struct ltq_dma_desc *desc_base; /* the descriptor base */
 	int phys;			/* physical addr */
+	struct device *dev;
 };
 
 enum {
diff --git a/arch/mips/lantiq/xway/dma.c b/arch/mips/lantiq/xway/dma.c
index 4b9fbb6744ad..664f2f7f55c1 100644
--- a/arch/mips/lantiq/xway/dma.c
+++ b/arch/mips/lantiq/xway/dma.c
@@ -130,7 +130,7 @@ ltq_dma_alloc(struct ltq_dma_channel *ch)
 	unsigned long flags;
 
 	ch->desc = 0;
-	ch->desc_base = dma_zalloc_coherent(NULL,
+	ch->desc_base = dma_zalloc_coherent(ch->dev,
 				LTQ_DESC_NUM * LTQ_DESC_SIZE,
 				&ch->phys, GFP_ATOMIC);
 
@@ -182,7 +182,7 @@ ltq_dma_free(struct ltq_dma_channel *ch)
 	if (!ch->desc_base)
 		return;
 	ltq_dma_close(ch);
-	dma_free_coherent(NULL, LTQ_DESC_NUM * LTQ_DESC_SIZE,
+	dma_free_coherent(ch->dev, LTQ_DESC_NUM * LTQ_DESC_SIZE,
 		ch->desc_base, ch->phys);
 }
 EXPORT_SYMBOL_GPL(ltq_dma_free);
diff --git a/drivers/net/ethernet/lantiq_etop.c b/drivers/net/ethernet/lantiq_etop.c
index 7a637b51c7d2..e08301d833e2 100644
--- a/drivers/net/ethernet/lantiq_etop.c
+++ b/drivers/net/ethernet/lantiq_etop.c
@@ -274,6 +274,7 @@ ltq_etop_hw_init(struct net_device *dev)
 		struct ltq_etop_chan *ch = &priv->ch[i];
 
 		ch->idx = ch->dma.nr = i;
+		ch->dma.dev = &priv->pdev->dev;
 
 		if (IS_TX(i)) {
 			ltq_dma_alloc_tx(&ch->dma);
-- 
2.11.0

^ permalink raw reply related

* Re: [PATCH net V2] ip: frags: fix crash in ip_do_fragment()
From: Taehee Yoo @ 2018-09-09 18:54 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, Peter Oskolkov, netdev, Pablo Neira Ayuso,
	Florian Westphal
In-Reply-To: <CANn89i+pBEuH-dg=f+3rxiheQyjJfg2=SqgoRmB4S7VEaVgeow@mail.gmail.com>

2018-09-10 3:24 GMT+09:00 Eric Dumazet <edumazet@google.com>:
> On Sun, Sep 9, 2018 at 10:47 AM Taehee Yoo <ap420073@gmail.com> wrote:
>>
>> A kernel crash occurrs when defragmented packet is fragmented
>> in ip_do_fragment().
>> In defragment routine, skb_orphan() is called and
>> skb->ip_defrag_offset is set. but skb->sk and
>> skb->ip_defrag_offset are same union member. so that
>> frag->sk is not NULL.
>> Hence crash occurrs in skb->sk check routine in ip_do_fragment() when
>> defragmented packet is fragmented.
>>
>> test commands:
>>    %iptables -t nat -I POSTROUTING -j MASQUERADE
>>    %hping3 192.168.4.2 -s 1000 -p 2000 -d 60000
>>
>> splat looks like:
>>
>> v2:
>>  - clear skb->sk at reassembly routine.(Eric Dumarzet)
>>
>> Fixes: fa0f527358bd ("ip: use rb trees for IP frag queue.")
>> Suggested-by: Eric Dumazet <edumazet@google.com>
>> Signed-off-by: Taehee Yoo <ap420073@gmail.com>
>> ---
>>  net/ipv4/ip_fragment.c                  | 1 +
>>  net/ipv6/netfilter/nf_conntrack_reasm.c | 1 +
>>  2 files changed, 2 insertions(+)
>>
>> diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
>> index 88281fbce88c..e7227128df2c 100644
>> --- a/net/ipv4/ip_fragment.c
>> +++ b/net/ipv4/ip_fragment.c
>> @@ -599,6 +599,7 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *skb,
>>                         nextp = &fp->next;
>>                         fp->prev = NULL;
>>                         memset(&fp->rbnode, 0, sizeof(fp->rbnode));
>> +                       fp->sk = NULL;
>>                         head->data_len += fp->len;
>>                         head->len += fp->len;
>>                         if (head->ip_summed != fp->ip_summed)
>> diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c
>> index 2a14d8b65924..8f68a518d9db 100644
>> --- a/net/ipv6/netfilter/nf_conntrack_reasm.c
>> +++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
>> @@ -445,6 +445,7 @@ nf_ct_frag6_reasm(struct frag_queue *fq, struct sk_buff *prev,  struct net_devic
>>                 else if (head->ip_summed == CHECKSUM_COMPLETE)
>>                         head->csum = csum_add(head->csum, fp->csum);
>>                 head->truesize += fp->truesize;
>> +               fp->sk = NULL;
>
> This is not needed. IPv6 paths were not changed by recent commits.
>
>>         }
>>         sub_frag_mem_limit(fq->q.net, head->truesize);
>>
>> --
>> 2.17.1
>>

Hi Eric,
Thank you for review!

I think netfilter side ipv6 code change is needed
because netfilter ipv6 defrag routine also set fp->ip_defrag_offset value
so that fp->sk will not be NULL.
And I think these crash in ip_do_fragment() and ip6_fragment() are
actually same bug.

My ipv6 test environment is below

PC1<----------------->FW<----------------->PC2
cd::02/64     cd::01/64   ab::01/64      ab::02/64

FW command:
   %ip6tables -t nat -I POSTROUTING -j MASQUERADE

PC2 command:
   %ping6 cd::02 -s 60000

FW crash message:
[  502.676552] kernel BUG at net/ipv6/ip6_output.c:658!
[  502.682641] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
[  502.683545] CPU: 1 PID: 17 Comm: ksoftirqd/1 Not tainted 4.19.0-rc2+ #12
[  502.692231] Hardware name: To be filled by O.E.M. To be filled by
O.E.M./Aptio CRB, BIOS 5.6.5 07/08/2015
[  502.692231] RIP: 0010:ip6_fragment+0x16fa/0x3580
[  502.692231] Code: 45 00 f8 0f 85 c3 03 00 00 49 8d 7f 18 48 89 f8
48 c1 e8 03 42 80 3c 28 00 0f 85 23 18 00 00 49 83 7f 18 00 0f 84 22
fe ff ff <0f> 0b 48 85 c0 0f 85 43 02 00 00 49 83 e6 fe 48 b8 00 00 00
00 00
[  502.692231] RSP: 0018:ffff88011561eb58 EFLAGS: 00010202
[  502.692231] RAX: 1ffff1002142e37b RBX: ffff8801062c16c0 RCX: 0000000000000004
[  502.692231] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88010a171bd8
[  502.692231] RBP: ffffed0022ac3d89 R08: ffffed002142e395 R09: ffffed002142e395
[  502.692231] R10: 0000000000000001 R11: ffffed002142e394 R12: 0000000000000028
[  502.692231] R13: dffffc0000000000 R14: ffff88010a171ca4 R15: ffff88010a171bc0
[  502.692231] FS:  0000000000000000(0000) GS:ffff880116600000(0000)
knlGS:0000000000000000
[  502.692231] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  502.692231] CR2: 0000000002452c64 CR3: 0000000105fea000 CR4: 00000000001006e0
[  502.692231] Call Trace:
[  502.692231]  ? ip6_forward_finish+0x370/0x370
[  502.692231]  ? ip6_forward+0x34e0/0x34e0
[  502.813859]  ? ip6_mtu+0x24d/0x310
[  502.813859]  ? ip6_negative_advice+0x160/0x160
[  502.813859]  ? rcu_is_watching+0x77/0x120
[  502.813859]  ? nf_nat_ipv6_out+0x206/0x4d0 [nf_nat_ipv6]
[  502.813859]  ? ip6_finish_output+0x35e/0x920
[  502.813859]  ip6_output+0x1e5/0x870
[  502.813859]  ? ip6_finish_output+0x920/0x920
[  502.813859]  ? __lock_acquire+0x4500/0x4500
[  502.813859]  ? __nf_nat_alloc_null_binding+0x20c/0x340 [nf_nat]
[  502.813859]  ? ip6_fragment+0x3580/0x3580
[  502.813859]  ip6_forward+0x1328/0x34e0
[  502.813859]  ? nf_nat_inet_fn+0x446/0x970 [nf_nat]
[  502.813859]  ? ip6_autoflowlabel+0xb0/0xb0
[  502.813859]  ? ip6_route_input+0x639/0xc80
[  502.813859]  ? save_trace+0x300/0x300
[  502.813859]  ? ip6_route_info_create+0x35e0/0x35e0
[  502.813859]  ? ip6_rcv_finish_core.isra.10+0x152/0x520
[  502.813859]  ? nf_hook.constprop.19+0x780/0x780
[  502.813859]  ? nf_nat_ipv6_out+0x4d0/0x4d0 [nf_nat_ipv6]
[  502.813859]  ? ipv6_rcv+0x40b/0x500
[  502.813859]  ? ip6_autoflowlabel+0xb0/0xb0
[  502.813859]  ipv6_rcv+0x40b/0x500
[  502.813859]  ? ip6_rcv_core.isra.14+0x1dd0/0x1dd0
[  502.813859]  ? lock_acquire+0x196/0x470
[  502.813859]  ? ip6_rcv_finish_core.isra.10+0x520/0x520
[  502.813859]  ? ip6_rcv_core.isra.14+0x1dd0/0x1dd0
[  502.813859]  __netif_receive_skb_one_core+0x115/0x1a0
[  502.813859]  ? rcu_is_watching+0x77/0x120
[  502.813859]  ? __netif_receive_skb_core+0x2ac0/0x2ac0
[  502.813859]  ? _rcu_barrier_trace+0x400/0x400
[  502.813859]  ? rcu_pm_notify+0xc0/0xc0
[  502.813859]  netif_receive_skb_internal+0xd8/0x570
[  502.813859]  ? dev_cpu_dead+0x980/0x980
[  502.813859]  ? rcu_read_lock_sched_held+0x114/0x130
[  502.813859]  ? kmem_cache_alloc+0x1ea/0x280
[  502.813859]  ? memset+0x1f/0x40
[  502.813859]  ? rcu_pm_notify+0xc0/0xc0
[  502.813859]  napi_gro_receive+0x344/0x410
[  502.813859]  ? dev_gro_receive+0x29b0/0x29b0
[  502.813859]  ? build_skb+0x63/0x2b0
[  502.813859]  ? eth_gro_receive+0x8f0/0x8f0
[  502.813859]  ? __build_skb+0x3b0/0x3b0
[  502.813859]  igb_poll+0x16e1/0x4650
[  502.813859]  ? igb_alloc_rx_buffers+0xa80/0xa80
[  502.813859]  ? sched_clock_cpu+0x126/0x170
[  502.813859]  ? stop_critical_timings+0x420/0x420
[  502.813859]  ? net_rx_action+0x3f0/0x1440
[  502.813859]  ? net_rx_action+0x3f0/0x1440
[  502.813859]  ? rcu_pm_notify+0xc0/0xc0
[  502.813859]  net_rx_action+0x5db/0x1440
[  502.813859]  ? _raw_spin_unlock+0x24/0x30
[  502.813859]  ? napi_complete_done+0x4c0/0x4c0
[  502.813859]  ? pfifo_fast_peek+0x1c0/0x1c0
[  502.813859]  ? cyc2ns_read_end+0x10/0x10
[  502.813859]  ? save_trace+0x300/0x300
[  502.813859]  ? save_trace+0x300/0x300
[  502.813859]  ? sched_clock_cpu+0x126/0x170
[  502.813859]  ? find_held_lock+0x39/0x1c0
[  502.813859]  ? lock_acquire+0x196/0x470
[  502.813859]  ? check_flags.part.36+0x450/0x450
[  502.813859]  ? check_flags.part.36+0x450/0x450
[  502.813859]  ? __run_timers+0x6ff/0x990
[  502.813859]  ? do_raw_spin_unlock+0xa5/0x330
[  502.813859]  ? do_raw_spin_trylock+0x1a0/0x1a0
[  502.813859]  ? do_raw_spin_trylock+0x101/0x1a0
[  502.813859]  ? _raw_spin_unlock+0x24/0x30
[  502.813859]  ? net_tx_action+0x6cb/0xad0
[  502.813859]  ? dev_queue_xmit_nit+0xe70/0xe70
[  502.813859]  ? _raw_spin_unlock_irq+0x31/0x40
[  502.813859]  ? stop_critical_timings+0x420/0x420
[  502.813859]  ? finish_task_switch+0x183/0x740
[  502.813859]  ? __switch_to_asm+0x30/0x60
[  502.813859]  ? __switch_to_asm+0x24/0x60
[  502.813859]  ? __do_softirq+0x263/0xa11
[  502.813859]  ? __do_softirq+0x263/0xa11
[  502.813859]  ? rcu_pm_notify+0xc0/0xc0
[  502.813859]  __do_softirq+0x2a5/0xa11
[  502.813859]  ? __sched_text_start+0x8/0x8
[  502.813859]  ? __irqentry_text_end+0x1fa51c/0x1fa51c
[  502.813859]  ? smpboot_thread_fn+0xd5/0x700
[  502.813859]  ? tracer_hardirqs_on+0x420/0x420
[  502.813859]  ? trace_hardirqs_on_thunk+0x1a/0x1c
[  502.813859]  ? run_ksoftirqd+0xb/0x50
[  502.813859]  ? trace_hardirqs_off+0x6b/0x210
[  502.813859]  ? trace_hardirqs_on_caller+0x210/0x210
[  502.813859]  ? takeover_tasklets+0x860/0x860
[  502.813859]  ? takeover_tasklets+0x860/0x860
[  502.813859]  run_ksoftirqd+0x24/0x50
[  502.813859]  smpboot_thread_fn+0x3cd/0x700
[  502.813859]  ? sort_range+0x20/0x20
[  502.813859]  ? __kthread_parkme+0xb6/0x180
[  502.813859]  ? sort_range+0x20/0x20
[  502.813859]  kthread+0x322/0x3e0
[  502.813859]  ? kthread_create_worker_on_cpu+0xc0/0xc0
[  502.813859]  ret_from_fork+0x3a/0x50
[  502.813859] Modules linked in: ip6t_MASQUERADE ip6table_nat
nf_nat_ipv6 nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4
ip6_tables ip_tables x_tables
[  503.279587] ---[ end trace 49127255558e40c3 ]---
[  503.284881] RIP: 0010:ip6_fragment+0x16fa/0x3580
[  503.290150] Code: 45 00 f8 0f 85 c3 03 00 00 49 8d 7f 18 48 89 f8
48 c1 e8 03 42 80 3c 28 00 0f 85 23 18 00 00 49 83 7f 18 00 0f 84 22
fe ff ff <0f> 0b 48 85 c0 0f 85 43 02 00 00 49 83 e6 fe 48 b8 00 00 00
00 00
[  503.311356] RSP: 0018:ffff88011561eb58 EFLAGS: 00010202
[  503.317328] RAX: 1ffff1002142e37b RBX: ffff8801062c16c0 RCX: 0000000000000004
[  503.325438] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88010a171bd8
[  503.333574] RBP: ffffed0022ac3d89 R08: ffffed002142e395 R09: ffffed002142e395
[  503.341686] R10: 0000000000000001 R11: ffffed002142e394 R12: 0000000000000028
[  503.349803] R13: dffffc0000000000 R14: ffff88010a171ca4 R15: ffff88010a171bc0
[  503.357916] FS:  0000000000000000(0000) GS:ffff880116600000(0000)
knlGS:0000000000000000
[  503.367102] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  503.373642] CR2: 0000000002452c64 CR3: 0000000105fea000 CR4: 00000000001006e0
[  503.381803] Kernel panic - not syncing: Fatal exception in interrupt
[  503.382721] Kernel Offset: 0xa000000 from 0xffffffff81000000
(relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[  503.382721] Rebooting in 5 seconds..

^ permalink raw reply

* Re: [PATCH net V2] ip: frags: fix crash in ip_do_fragment()
From: Eric Dumazet @ 2018-09-09 18:24 UTC (permalink / raw)
  To: Taehee Yoo
  Cc: David Miller, Peter Oskolkov, netdev, Pablo Neira Ayuso,
	Florian Westphal
In-Reply-To: <20180909174705.3677-1-ap420073@gmail.com>

On Sun, Sep 9, 2018 at 10:47 AM Taehee Yoo <ap420073@gmail.com> wrote:
>
> A kernel crash occurrs when defragmented packet is fragmented
> in ip_do_fragment().
> In defragment routine, skb_orphan() is called and
> skb->ip_defrag_offset is set. but skb->sk and
> skb->ip_defrag_offset are same union member. so that
> frag->sk is not NULL.
> Hence crash occurrs in skb->sk check routine in ip_do_fragment() when
> defragmented packet is fragmented.
>
> test commands:
>    %iptables -t nat -I POSTROUTING -j MASQUERADE
>    %hping3 192.168.4.2 -s 1000 -p 2000 -d 60000
>
> splat looks like:
>
> v2:
>  - clear skb->sk at reassembly routine.(Eric Dumarzet)
>
> Fixes: fa0f527358bd ("ip: use rb trees for IP frag queue.")
> Suggested-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Taehee Yoo <ap420073@gmail.com>
> ---
>  net/ipv4/ip_fragment.c                  | 1 +
>  net/ipv6/netfilter/nf_conntrack_reasm.c | 1 +
>  2 files changed, 2 insertions(+)
>
> diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
> index 88281fbce88c..e7227128df2c 100644
> --- a/net/ipv4/ip_fragment.c
> +++ b/net/ipv4/ip_fragment.c
> @@ -599,6 +599,7 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *skb,
>                         nextp = &fp->next;
>                         fp->prev = NULL;
>                         memset(&fp->rbnode, 0, sizeof(fp->rbnode));
> +                       fp->sk = NULL;
>                         head->data_len += fp->len;
>                         head->len += fp->len;
>                         if (head->ip_summed != fp->ip_summed)
> diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c
> index 2a14d8b65924..8f68a518d9db 100644
> --- a/net/ipv6/netfilter/nf_conntrack_reasm.c
> +++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
> @@ -445,6 +445,7 @@ nf_ct_frag6_reasm(struct frag_queue *fq, struct sk_buff *prev,  struct net_devic
>                 else if (head->ip_summed == CHECKSUM_COMPLETE)
>                         head->csum = csum_add(head->csum, fp->csum);
>                 head->truesize += fp->truesize;
> +               fp->sk = NULL;

This is not needed. IPv6 paths were not changed by recent commits.

>         }
>         sub_frag_mem_limit(fq->q.net, head->truesize);
>
> --
> 2.17.1
>

^ permalink raw reply

* Re: [PATCH v4 3/3] IB/ipoib: Log sysfs 'dev_id' accesses from userspace
From: Arseny Maslennikov @ 2018-09-09 18:11 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: linux-rdma, Doug Ledford, netdev
In-Reply-To: <20180907154359.GA14627@ziepe.ca>

[-- Attachment #1: Type: text/plain, Size: 3223 bytes --]

On Fri, Sep 07, 2018 at 09:43:59AM -0600, Jason Gunthorpe wrote:
> On Thu, Sep 06, 2018 at 05:51:12PM +0300, Arseny Maslennikov wrote:
> > Some tools may currently be using only the deprecated attribute;
> > let's print an elaborate and clear deprecation notice to kmsg.
> > 
> > To do that, we have to replace the whole sysfs file, since we inherit
> > the original one from netdev.
> > 
> > Signed-off-by: Arseny Maslennikov <ar@cs.msu.ru>
> >  drivers/infiniband/ulp/ipoib/ipoib_main.c | 31 +++++++++++++++++++++++
> >  1 file changed, 31 insertions(+)
> > 
> > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> > index 30f840f874b3..74732726ec6f 100644
> > +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> > @@ -2386,6 +2386,35 @@ int ipoib_add_pkey_attr(struct net_device *dev)
> >  	return device_create_file(&dev->dev, &dev_attr_pkey);
> >  }
> >  
> > +/*
> > + * We erroneously exposed the iface's port number in the dev_id
> > + * sysfs field long after dev_port was introduced for that purpose[1],
> > + * and we need to stop everyone from relying on that.
> > + * Let's overload the shower routine for the dev_id file here
> > + * to gently bring the issue up.
> > + *
> > + * [1] https://www.spinics.net/lists/netdev/msg272123.html
> > + */
> > +static ssize_t dev_id_show(struct device *dev,
> > +			   struct device_attribute *attr, char *buf)
> > +{
> > +	struct net_device *ndev = to_net_dev(dev);
> > +
> > +	if (ndev->dev_id == ndev->dev_port)
> > +		netdev_info_once(ndev,
> > +			"\"%s\" wants to know my dev_id. Should it look at dev_port instead? See Documentation/ABI/testing/sysfs-class-net for more info.\n",
> > +			current->comm);
> > +
> > +	return sprintf(buf, "%#x\n", ndev->dev_id);
> > +}
> > +static DEVICE_ATTR_RO(dev_id);
> > +
> > +int ipoib_intercept_dev_id_attr(struct net_device *dev)
> > +{
> > +	device_remove_file(&dev->dev, &dev_attr_dev_id);
> > +	return device_create_file(&dev->dev, &dev_attr_dev_id);
> > +}
> 
> Isn't this racey with userspace? Ie what happens if udev is querying
> the dev_id right here?

udev in particular does not use dev_id at all since 2014, because "why
would we keep using dev_id if it is not the right thing to use?".

> 
> Do we know there is no userspace doing this?
> 

Not for sure.

If we move all the sysfs handling stuff we introduce in _add_port():
 - pkey
 - umcast
 - {create,delete}_child
 - connected/datagram mode
to _ndo_init(), which is called by register_netdev before it sends
the netlink message, would that suffice to eliminate the race?
(Sysfs files for {create,delete}_child go to _parent_init() then).

> >  static struct net_device *ipoib_add_port(const char *format,
> >  					 struct ib_device *hca, u8 port)
> >  {
> > @@ -2427,6 +2456,8 @@ static struct net_device *ipoib_add_port(const char *format,
> >  	 */
> >  	ndev->priv_destructor = ipoib_intf_free;
> >  
> > +	if (ipoib_intercept_dev_id_attr(ndev))
> > +		goto sysfs_failed;
> 
> No device_remove_file needed?
> 

As far as I understand, unregister_netdevice takes care of it
while destroying the related kobject.

> Jason

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* [PATCH net V2] ip: frags: fix crash in ip_do_fragment()
From: Taehee Yoo @ 2018-09-09 17:47 UTC (permalink / raw)
  To: davem, posk, netdev; +Cc: ap420073, pablo, fw, edumazet

A kernel crash occurrs when defragmented packet is fragmented
in ip_do_fragment().
In defragment routine, skb_orphan() is called and
skb->ip_defrag_offset is set. but skb->sk and
skb->ip_defrag_offset are same union member. so that
frag->sk is not NULL.
Hence crash occurrs in skb->sk check routine in ip_do_fragment() when
defragmented packet is fragmented.

test commands:
   %iptables -t nat -I POSTROUTING -j MASQUERADE
   %hping3 192.168.4.2 -s 1000 -p 2000 -d 60000

splat looks like:
[  261.069429] kernel BUG at net/ipv4/ip_output.c:636!
[  261.075753] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
[  261.083854] CPU: 1 PID: 1349 Comm: hping3 Not tainted 4.19.0-rc2+ #3
[  261.100977] RIP: 0010:ip_do_fragment+0x1613/0x2600
[  261.106945] Code: e8 e2 38 e3 fe 4c 8b 44 24 18 48 8b 74 24 08 e9 92 f6 ff ff 80 3c 02 00 0f 85 da 07 00 00 48 8b b5 d0 00 00 00 e9 25 f6 ff ff <0f> 0b 0f 0b 44 8b 54 24 58 4c 8b 4c 24 18 4c 8b 5c 24 60 4c 8b 6c
[  261.127015] RSP: 0018:ffff8801031cf2c0 EFLAGS: 00010202
[  261.134156] RAX: 1ffff1002297537b RBX: ffffed0020639e6e RCX: 0000000000000004
[  261.142156] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff880114ba9bd8
[  261.150157] RBP: ffff880114ba8a40 R08: ffffed0022975395 R09: ffffed0022975395
[  261.158157] R10: 0000000000000001 R11: ffffed0022975394 R12: ffff880114ba9ca4
[  261.166159] R13: 0000000000000010 R14: ffff880114ba9bc0 R15: dffffc0000000000
[  261.174169] FS:  00007fbae2199700(0000) GS:ffff88011b400000(0000) knlGS:0000000000000000
[  261.183012] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  261.189013] CR2: 00005579244fe000 CR3: 0000000119bf4000 CR4: 00000000001006e0
[  261.198158] Call Trace:
[  261.199018]  ? dst_output+0x180/0x180
[  261.205011]  ? save_trace+0x300/0x300
[  261.209018]  ? ip_copy_metadata+0xb00/0xb00
[  261.213034]  ? sched_clock_local+0xd4/0x140
[  261.218158]  ? kill_l4proto+0x120/0x120 [nf_conntrack]
[  261.223014]  ? rt_cpu_seq_stop+0x10/0x10
[  261.227014]  ? find_held_lock+0x39/0x1c0
[  261.233008]  ip_finish_output+0x51d/0xb50
[  261.237006]  ? ip_fragment.constprop.56+0x220/0x220
[  261.243011]  ? nf_ct_l4proto_register_one+0x5b0/0x5b0 [nf_conntrack]
[  261.250152]  ? rcu_is_watching+0x77/0x120
[  261.255010]  ? nf_nat_ipv4_out+0x1e/0x2b0 [nf_nat_ipv4]
[  261.261033]  ? nf_hook_slow+0xb1/0x160
[  261.265007]  ip_output+0x1c7/0x710
[  261.269005]  ? ip_mc_output+0x13f0/0x13f0
[  261.273002]  ? __local_bh_enable_ip+0xe9/0x1b0
[  261.278152]  ? ip_fragment.constprop.56+0x220/0x220
[  261.282996]  ? nf_hook_slow+0xb1/0x160
[  261.287007]  raw_sendmsg+0x21f9/0x4420
[  261.291008]  ? dst_output+0x180/0x180
[  261.297003]  ? sched_clock_cpu+0x126/0x170
[  261.301003]  ? find_held_lock+0x39/0x1c0
[  261.306155]  ? stop_critical_timings+0x420/0x420
[  261.311004]  ? check_flags.part.36+0x450/0x450
[  261.315005]  ? _raw_spin_unlock_irq+0x29/0x40
[  261.320995]  ? _raw_spin_unlock_irq+0x29/0x40
[  261.326142]  ? cyc2ns_read_end+0x10/0x10
[  261.330139]  ? raw_bind+0x280/0x280
[  261.334138]  ? sched_clock_cpu+0x126/0x170
[  261.338995]  ? check_flags.part.36+0x450/0x450
[  261.342991]  ? __lock_acquire+0x4500/0x4500
[  261.348994]  ? inet_sendmsg+0x11c/0x500
[  261.352989]  ? dst_output+0x180/0x180
[  261.357012]  inet_sendmsg+0x11c/0x500
[ ... ]

v2:
 - clear skb->sk at reassembly routine.(Eric Dumarzet)

Fixes: fa0f527358bd ("ip: use rb trees for IP frag queue.")
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
---
 net/ipv4/ip_fragment.c                  | 1 +
 net/ipv6/netfilter/nf_conntrack_reasm.c | 1 +
 2 files changed, 2 insertions(+)

diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 88281fbce88c..e7227128df2c 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -599,6 +599,7 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *skb,
 			nextp = &fp->next;
 			fp->prev = NULL;
 			memset(&fp->rbnode, 0, sizeof(fp->rbnode));
+			fp->sk = NULL;
 			head->data_len += fp->len;
 			head->len += fp->len;
 			if (head->ip_summed != fp->ip_summed)
diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c
index 2a14d8b65924..8f68a518d9db 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -445,6 +445,7 @@ nf_ct_frag6_reasm(struct frag_queue *fq, struct sk_buff *prev,  struct net_devic
 		else if (head->ip_summed == CHECKSUM_COMPLETE)
 			head->csum = csum_add(head->csum, fp->csum);
 		head->truesize += fp->truesize;
+		fp->sk = NULL;
 	}
 	sub_frag_mem_limit(fq->q.net, head->truesize);
 
-- 
2.17.1

^ permalink raw reply related

* Re: Containers and checkpoint/restart micro-conference at LPC2018
From: Steve French @ 2018-09-09 17:38 UTC (permalink / raw)
  To: James Bottomley
  Cc: stgraber, containers, linux-security-module, linux-fsdevel,
	Network Development
In-Reply-To: <1536507039.3192.5.camel@HansenPartnership.com>

On Sun, Sep 9, 2018 at 10:32 AM James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:
>
> On Mon, 2018-08-13 at 12:10 -0400, Stéphane Graber wrote:
> > Details can be found here:
> >
> >   https://discuss.linuxcontainers.org/t/containers-micro-conference-a
> > t-linux-plumbers-2018/2417
>
> This website was giving a 503 error when I looked.
>
> However, if you want a discussion on the requirements for shiftfs (and
> whether we still need it), I'm up for that.

I also put in a proposal for a fs miniconference if we have any file
system developers that will be at plumbers


-- 
Thanks,

Steve

^ permalink raw reply

* Re: [PATCH] r8169: inform about CLKRUN protocol issue when behind a CardBus bridge
From: Maciej S. Szmigiero @ 2018-09-09 22:01 UTC (permalink / raw)
  To: David Miller; +Cc: nic_swsd, netdev, linux-kernel
In-Reply-To: <20180909.080907.992434701890231030.davem@davemloft.net>

On 09.09.2018 17:09, David Miller wrote:
> From: "Maciej S. Szmigiero" <mail@maciej.szmigiero.name>
> Date: Thu, 6 Sep 2018 18:10:53 +0200
> 
>> It turns out that at least some r8169 CardBus cards don't operate correctly
>> when CLKRUN protocol is enabled - the symptoms are recurring timeouts
>> during PHY reads / writes and a very high packet drop rate.
>> This is true of at least RTL8169sc/8110sc (XID 18000000) chip in
>> Sunrich C-160 CardBus NIC.
>>
>> Such behavior was observed on two separate laptops, the first one has
>> TI PCIxx12 CardBus bridge, while the second one has Ricoh RL5c476II.
>>
>> Setting CLKRUN_En bit in CONFIG 3 register via an EEPROM write didn't
>> improve things in either case (this is probably why it wasn't set by the
>> card manufacturer).
>> The only way to fix the issue was to disable the CLKRUN protocol either
>> in the CardBus bridge (only possible in the TI one) or in the southbridge.
>>
>> Since the problem takes some time to debug let's warn people that have
>> the suspect configuration (Conventional PCI r8169 NIC behind a CardBus
>> bridge) so they know what they can do if they encounter it.
>>
>> Signed-off-by: Maciej S. Szmigiero <mail@maciej.szmigiero.name>
> 
> I don't know about this.
> 
> Barking at the user in the kernel log about an obscure knob (which btw
> doesn't exist for all cardbus bridges without other patches you are
> posting elsewhere) is rarely effective.
> 
> We should just disable clkrun automatically we know it causes problems.

Unfortunately, as you wrote above, this workaround is only available on
TI CardBus bridges (and I hope will be available for two Ricoh ones soon,
too), while for other CardBus bridges it is either not implemented or
not available at all.
So we can't reliably just turn it on automatically when needed.

BTW, it seems that my RTL8169 card isn't the only model affected.
In fact, the original CLKRUN protocol disabling workaround on TI bridges
was implemented in 2005 because somebody's RTL8139 also had this
problem: https://lkml.org/lkml/2005/2/5/129

The main reason I wanted to add this warning is to save people time
debugging this issue, as it is rather unobvious.
But if this solution is unacceptable then I hope at least this
description will pop out in search results when searching for some
related keywords.

Thanks,
Maciej

^ permalink raw reply

* Re: [PATCH net V2 0/7] bug fixes for ENA Ethernet driver
From: David Miller @ 2018-09-09 16:41 UTC (permalink / raw)
  To: netanel
  Cc: netdev, dwmw, zorik, matua, saeedb, msw, aliguori, nafea, evgenys,
	gtzalik
In-Reply-To: <20180909081526.83979-1-netanel@amazon.com>

From: <netanel@amazon.com>
Date: Sun, 9 Sep 2018 08:15:19 +0000

> From: Netanel Belgazal <netanel@amazon.com>
> 
> Netanel Belgazal (7):
>   net: ena: fix surprise unplug NULL dereference kernel crash
>   net: ena: fix driver when PAGE_SIZE == 64kB
>   net: ena: fix device destruction to gracefully free resources
>   net: ena: fix potential double ena_destroy_device()
>   net: ena: fix missing lock during device destruction
>   net: ena: fix missing calls to READ_ONCE
>   net: ena: fix incorrect usage of memory barriers

Series applied, thanks.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox