Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net] failover: eliminate callback hell
From: Michael S. Tsirkin @ 2018-06-06 21:30 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Samudrala, Sridhar, kys, haiyangz, davem, netdev,
	Stephen Hemminger
In-Reply-To: <20180606141620.0b333dff@xeon-e3>

On Wed, Jun 06, 2018 at 02:16:20PM -0700, Stephen Hemminger wrote:
> On Tue, 5 Jun 2018 23:11:37 -0700
> "Samudrala, Sridhar" <sridhar.samudrala@intel.com> wrote:
> 
> > On 6/5/2018 11:00 PM, Stephen Hemminger wrote:
> > > On Tue, 5 Jun 2018 22:39:12 -0700
> > > "Samudrala, Sridhar" <sridhar.samudrala@intel.com> wrote:
> > >  
> > >> On 6/5/2018 8:51 PM, Stephen Hemminger wrote:  
> > >>> On Tue, 5 Jun 2018 16:52:22 -0700
> > >>> "Samudrala, Sridhar" <sridhar.samudrala@intel.com> wrote:
> > >>>     
> > >>>> On 6/5/2018 2:52 PM, Stephen Hemminger wrote:  
> > >>>>> On Tue, 5 Jun 2018 22:38:43 +0300
> > >>>>> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > >>>>>        
> > >>>>>>> See:
> > >>>>>>>       https://patchwork.ozlabs.org/patch/851711/  
> > >>>>>> Let me try to summarize that:
> > >>>>>>
> > >>>>>> 	You wanted to speed up the delayed link up.  You had an idea to
> > >>>>>> 	additionally take link up when userspace renames the interface (standby
> > >>>>>> 	one which is also the failover for netvsc).
> > >>>>>>
> > >>>>>> 	But userspace might not do any renames, in which case there will
> > >>>>>> 	still be the delay, and so this never got applied.
> > >>>>>>
> > >>>>>> 	Is this a good summary?
> > >>>>>>
> > >>>>>> Davem said delay should go away completely as it's not robust, and I
> > >>>>>> think I agree.  So I don't think we should make all failover users use
> > >>>>>> delay. IIUC failover kept a delay option especially for netvsc to
> > >>>>>> minimize the surprise factor. Hopefully we can come up with
> > >>>>>> something more robust and drop that option completely.  
> > >>>>> The timeout was the original solution to how to complete setup after
> > >>>>> userspace has had a chance to rename the device. Unfortunately, the whole network
> > >>>>> device initialization (cooperation with udev and userspace) is a a mess because
> > >>>>> there is no well defined specification, and there are multiple ways userspace
> > >>>>> does this in old and new distributions.  The timeout has its own issues
> > >>>>> (how long, handling errors during that window, what if userspace modifies other
> > >>>>> device state); and open to finding a better solution.
> > >>>>>
> > >>>>> My point was that if name change can not be relied on (or used) by netvsc,
> > >>>>> then we can't allow it for net_failover either.  
> > >>>> I think the push back was with the usage of the delay, not bringing up the primary/standby
> > >>>> device in the name change event handler.
> > >>>> Can't netvsc use this mechanism instead of depending on the delay?
> > >>>>
> > >>>>     
> > >>> The patch that was rejected for netvsc was about using name change.
> > >>> Also, you can't depend on name change; you still need a timer. Not all distributions
> > >>> change name of devices. Or user has blocked that by udev rules.  
> > >> In the net_failover_slave_register() we do a dev_open() and ignore any failure due to
> > >> EBUSY and do another dev_open() in the name change event handler.
> > >> If the name is not expected to change, i would think the dev_open() at the time of
> > >> register will succeed.  
> > > The problem is your first dev_open will bring device up and lockout
> > > udev from changing the network device name.  
> > 
> > I have tried with/without udev and didn't see any issue with the naming of the primary/standby
> > devices in my testing.
> > 
> > With the 3-netdev failover model, we are only interested in setting the right name for the failover
> > netdev and that is the reason we do SET_NETDEV_DEV on that netdev. Does it really matter if udev fails
> > to rename the lower primary/standby netdevs, i don't think it will matter? The user is not expected
> > to touch the lower netdevs.
> 
> Renaming matters to some users and the udev maintainers. They want the VF to be named enpXXX
> The primary in virtio case needs to be ens3 with some cloud platforms.

Confused. VF can't be a standby, of it's used in a failover it's a
primary, you can't call it both enpXXX amd ens3. Could you describe the
use case in a bit more detail?

> 
> I think you need to get rid of triggering off of the name change.

Worth considering down the road since it's a bit of a hack but it's one
we won't have trouble supporting, unlike the delayed uplink.

> Long term, let's open the discussion about allowing network devices to change name when up?
> Maybe with NETIF_LIVENAME_CHANGE flag?

That's probably the cleanest approach assuming it can be made to work
without races. I suspect we can just live with what we have until then.


-- 
MST

^ permalink raw reply

* [PATCH v2] hv_netvsc: Fix a network regression after ifdown/ifup
From: Dexuan Cui @ 2018-06-06 21:32 UTC (permalink / raw)
  To: KY Srinivasan, Haiyang Zhang, Stephen Hemminger,
	'David S. Miller', 'netdev@vger.kernel.org'
  Cc: 'devel@linuxdriverproject.org',
	'linux-kernel@vger.kernel.org', 'olaf@aepfle.de',
	'apw@canonical.com', 'jasowang@redhat.com',
	'vkuznets@redhat.com',
	'marcelo.cerri@canonical.com', Josh Poulson,
	Stephen Zarkos


Recently people reported the NIC stops working after
"ifdown eth0; ifup eth0". It turns out in this case the TX queues are not
enabled, after the refactoring of the common detach logic: when the NIC
has sub-channels, usually we enable all the TX queues after all
sub-channels are set up: see rndis_set_subchannel() ->
netif_device_attach(), but in the case of "ifdown eth0; ifup eth0" where
the number of channels doesn't change, we also must make sure the TX queues
are enabled. The patch fixes the regression.

Fixes: 7b2ee50c0cd5 ("hv_netvsc: common detach logic")
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Cc: stable@vger.kernel.org
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
---

Changes since v1:
Removed the check and the comment from the code [Stephen Hemminger]

 drivers/net/hyperv/netvsc_drv.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index da07ccd..eb8dccd 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -126,8 +126,10 @@ static int netvsc_open(struct net_device *net)
 	}
 
 	rdev = nvdev->extension;
-	if (!rdev->link_state)
+	if (!rdev->link_state) {
 		netif_carrier_on(net);
+		netif_tx_wake_all_queues(net);
+	}
 
 	if (vf_netdev) {
 		/* Setting synthetic device up transparently sets
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH net] failover: eliminate callback hell
From: Michael S. Tsirkin @ 2018-06-06 21:47 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Jiri Pirko, kys, haiyangz, davem, sridhar.samudrala, netdev,
	Stephen Hemminger
In-Reply-To: <20180606142447.3c5072d8@xeon-e3>

On Wed, Jun 06, 2018 at 02:24:47PM -0700, Stephen Hemminger wrote:
> On Wed, 6 Jun 2018 15:30:27 +0300
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:
> > > Tue, Jun 05, 2018 at 05:42:31AM CEST, stephen@networkplumber.org wrote:  
> > > >The net failover should be a simple library, not a virtual
> > > >object with function callbacks (see callback hell).  
> > > 
> > > Why just a library? It should do a common things. I think it should be a
> > > virtual object. Looks like your patch again splits the common
> > > functionality into multiple drivers. That is kind of backwards attitude.
> > > I don't get it. We should rather focus on fixing the mess the
> > > introduction of netvsc-bonding caused and switch netvsc to 3-netdev
> > > model.  
> > 
> > So it seems that at least one benefit for netvsc would be better
> > handling of renames.
> > 
> > Question is how can this change to 3-netdev happen?  Stephen is
> > concerned about risk of breaking some userspace.
> > 
> > Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
> > address, and you said then "why not use existing network namespaces
> > rather than inventing a new abstraction". So how about it then? Do you
> > want to find a way to use namespaces to hide the PV device for netvsc
> > compatibility?
> > 
> 
> Netvsc can't work with 3 dev model. MS has worked with enough distro's and
> startups that all demand eth0 always be present. And VF may come and go.

Well failover seems to maintain this invariant with the 3 dev model.

> After this history, there is a strong motivation not to change how kernel
> behaves. Switching to 3 device model would be perceived as breaking
> existing userspace.

I feel I'm misunderstood. I was asking whether a 3-rd device can be
hidden so that userspace does not know that you switched to a 3 device
model. It will think there are 2 devices and will keep working.

If you do that, then there won't be anything that
would be perceived as breaking existing userspace, will there?


> With virtio you can  work it out with the distro's yourself.
> There is no pre-existing semantics to deal with.
> 
> For the virtio, I don't see the need for IFF_HIDDEN.
> With 3-dev model as long as you mark the PV and VF devices
> as slaves, then userspace knows to leave them alone. Assuming userspace
> is already able to deal with team and bond devices.

That's clear enough.

> Any time you introduce new UAPI behavior something breaks.

Not if we do it right.

> On the rename front, I really don't care if VF can be renamed.

OK that's nice.

> And for
> netvsc want to allow the PV device to be renamed.

That's because of the 2 device model, right?  So that explains why even
if the delayed hack is good for the goose it might not be good for the
gander :)

> Udev developers want that
> but have not found a stable/persistent value to expose to userspace
> to allow it.

^ permalink raw reply

* Re: [PATCH net] failover: eliminate callback hell
From: Samudrala, Sridhar @ 2018-06-06 21:54 UTC (permalink / raw)
  To: Stephen Hemminger, Michael S. Tsirkin
  Cc: Jiri Pirko, kys, haiyangz, davem, netdev, Stephen Hemminger
In-Reply-To: <20180606142447.3c5072d8@xeon-e3>



On 6/6/2018 2:24 PM, Stephen Hemminger wrote:
> On Wed, 6 Jun 2018 15:30:27 +0300
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>
>> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:
>>> Tue, Jun 05, 2018 at 05:42:31AM CEST, stephen@networkplumber.org wrote:
>>>> The net failover should be a simple library, not a virtual
>>>> object with function callbacks (see callback hell).
>>> Why just a library? It should do a common things. I think it should be a
>>> virtual object. Looks like your patch again splits the common
>>> functionality into multiple drivers. That is kind of backwards attitude.
>>> I don't get it. We should rather focus on fixing the mess the
>>> introduction of netvsc-bonding caused and switch netvsc to 3-netdev
>>> model.
>> So it seems that at least one benefit for netvsc would be better
>> handling of renames.
>>
>> Question is how can this change to 3-netdev happen?  Stephen is
>> concerned about risk of breaking some userspace.
>>
>> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
>> address, and you said then "why not use existing network namespaces
>> rather than inventing a new abstraction". So how about it then? Do you
>> want to find a way to use namespaces to hide the PV device for netvsc
>> compatibility?
>>
> Netvsc can't work with 3 dev model. MS has worked with enough distro's and
> startups that all demand eth0 always be present. And VF may come and go.
> After this history, there is a strong motivation not to change how kernel
> behaves. Switching to 3 device model would be perceived as breaking
> existing userspace.

I think it should be possible for netvsc to work with 3 dev model if the only
requirement is that eth0 will always be present. With net_failover, you will
see eth0 and eth0nsby OR with older distros eth0 and eth1.  It may be an issue
if somehow there is userspace requirement that there can be only 2 netdevs, not 3
when VF is plugged.

eth0 will be the net_failover device and eth0nsby/eth1 will be the netvsc device
and the IP address gets configured on eth0. Will this be an issue?


>
> With virtio you can  work it out with the distro's yourself.
> There is no pre-existing semantics to deal with.
>
> For the virtio, I don't see the need for IFF_HIDDEN.
> With 3-dev model as long as you mark the PV and VF devices
> as slaves, then userspace knows to leave them alone. Assuming userspace
> is already able to deal with team and bond devices.
> Any time you introduce new UAPI behavior something breaks.
>
> On the rename front, I really don't care if VF can be renamed. And for
> netvsc want to allow the PV device to be renamed. Udev developers want that
> but have not found a stable/persistent value to expose to userspace
> to allow it.

^ permalink raw reply

* Re: [PATCH net] failover: eliminate callback hell
From: Stephen Hemminger @ 2018-06-06 22:21 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Samudrala, Sridhar, kys, haiyangz, davem, netdev,
	Stephen Hemminger
In-Reply-To: <20180607002047-mutt-send-email-mst@kernel.org>

On Thu, 7 Jun 2018 00:30:21 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Wed, Jun 06, 2018 at 02:16:20PM -0700, Stephen Hemminger wrote:
> > On Tue, 5 Jun 2018 23:11:37 -0700
> > "Samudrala, Sridhar" <sridhar.samudrala@intel.com> wrote:
> >   
> > > On 6/5/2018 11:00 PM, Stephen Hemminger wrote:  
> > > > On Tue, 5 Jun 2018 22:39:12 -0700
> > > > "Samudrala, Sridhar" <sridhar.samudrala@intel.com> wrote:
> > > >    
> > > >> On 6/5/2018 8:51 PM, Stephen Hemminger wrote:    
> > > >>> On Tue, 5 Jun 2018 16:52:22 -0700
> > > >>> "Samudrala, Sridhar" <sridhar.samudrala@intel.com> wrote:
> > > >>>       
> > > >>>> On 6/5/2018 2:52 PM, Stephen Hemminger wrote:    
> > > >>>>> On Tue, 5 Jun 2018 22:38:43 +0300
> > > >>>>> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > >>>>>          
> > > >>>>>>> See:
> > > >>>>>>>       https://patchwork.ozlabs.org/patch/851711/    
> > > >>>>>> Let me try to summarize that:
> > > >>>>>>
> > > >>>>>> 	You wanted to speed up the delayed link up.  You had an idea to
> > > >>>>>> 	additionally take link up when userspace renames the interface (standby
> > > >>>>>> 	one which is also the failover for netvsc).
> > > >>>>>>
> > > >>>>>> 	But userspace might not do any renames, in which case there will
> > > >>>>>> 	still be the delay, and so this never got applied.
> > > >>>>>>
> > > >>>>>> 	Is this a good summary?
> > > >>>>>>
> > > >>>>>> Davem said delay should go away completely as it's not robust, and I
> > > >>>>>> think I agree.  So I don't think we should make all failover users use
> > > >>>>>> delay. IIUC failover kept a delay option especially for netvsc to
> > > >>>>>> minimize the surprise factor. Hopefully we can come up with
> > > >>>>>> something more robust and drop that option completely.    
> > > >>>>> The timeout was the original solution to how to complete setup after
> > > >>>>> userspace has had a chance to rename the device. Unfortunately, the whole network
> > > >>>>> device initialization (cooperation with udev and userspace) is a a mess because
> > > >>>>> there is no well defined specification, and there are multiple ways userspace
> > > >>>>> does this in old and new distributions.  The timeout has its own issues
> > > >>>>> (how long, handling errors during that window, what if userspace modifies other
> > > >>>>> device state); and open to finding a better solution.
> > > >>>>>
> > > >>>>> My point was that if name change can not be relied on (or used) by netvsc,
> > > >>>>> then we can't allow it for net_failover either.    
> > > >>>> I think the push back was with the usage of the delay, not bringing up the primary/standby
> > > >>>> device in the name change event handler.
> > > >>>> Can't netvsc use this mechanism instead of depending on the delay?
> > > >>>>
> > > >>>>       
> > > >>> The patch that was rejected for netvsc was about using name change.
> > > >>> Also, you can't depend on name change; you still need a timer. Not all distributions
> > > >>> change name of devices. Or user has blocked that by udev rules.    
> > > >> In the net_failover_slave_register() we do a dev_open() and ignore any failure due to
> > > >> EBUSY and do another dev_open() in the name change event handler.
> > > >> If the name is not expected to change, i would think the dev_open() at the time of
> > > >> register will succeed.    
> > > > The problem is your first dev_open will bring device up and lockout
> > > > udev from changing the network device name.    
> > > 
> > > I have tried with/without udev and didn't see any issue with the naming of the primary/standby
> > > devices in my testing.
> > > 
> > > With the 3-netdev failover model, we are only interested in setting the right name for the failover
> > > netdev and that is the reason we do SET_NETDEV_DEV on that netdev. Does it really matter if udev fails
> > > to rename the lower primary/standby netdevs, i don't think it will matter? The user is not expected
> > > to touch the lower netdevs.  
> > 
> > Renaming matters to some users and the udev maintainers. They want the VF to be named enpXXX
> > The primary in virtio case needs to be ens3 with some cloud platforms.  
> 
> Confused. VF can't be a standby, of it's used in a failover it's a
> primary, you can't call it both enpXXX amd ens3. Could you describe the
> use case in a bit more detail?

Sorry, got things backwards.  The primary is VF and it should be possible
to have it renamed based on PCI info.
The standby PV (in 3 dev model) would get renamed by udev to ens3.
So the failover device would just be ethX right?

> > 
> > I think you need to get rid of triggering off of the name change.  
> 
> Worth considering down the road since it's a bit of a hack but it's one
> we won't have trouble supporting, unlike the delayed uplink.

You can't depend on userspace doing rename.

> 
> > Long term, let's open the discussion about allowing network devices to change name when up?
> > Maybe with NETIF_LIVENAME_CHANGE flag?  
> 
> That's probably the cleanest approach assuming it can be made to work
> without races. I suspect we can just live with what we have until then.
> 
> 

^ permalink raw reply

* Re: [PATCH net] failover: eliminate callback hell
From: Stephen Hemminger @ 2018-06-06 22:24 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jiri Pirko, kys, haiyangz, davem, sridhar.samudrala, netdev,
	Stephen Hemminger
In-Reply-To: <20180607003407-mutt-send-email-mst@kernel.org>

On Thu, 7 Jun 2018 00:47:52 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Wed, Jun 06, 2018 at 02:24:47PM -0700, Stephen Hemminger wrote:
> > On Wed, 6 Jun 2018 15:30:27 +0300
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >   
> > > On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:  
> > > > Tue, Jun 05, 2018 at 05:42:31AM CEST, stephen@networkplumber.org wrote:    
> > > > >The net failover should be a simple library, not a virtual
> > > > >object with function callbacks (see callback hell).    
> > > > 
> > > > Why just a library? It should do a common things. I think it should be a
> > > > virtual object. Looks like your patch again splits the common
> > > > functionality into multiple drivers. That is kind of backwards attitude.
> > > > I don't get it. We should rather focus on fixing the mess the
> > > > introduction of netvsc-bonding caused and switch netvsc to 3-netdev
> > > > model.    
> > > 
> > > So it seems that at least one benefit for netvsc would be better
> > > handling of renames.
> > > 
> > > Question is how can this change to 3-netdev happen?  Stephen is
> > > concerned about risk of breaking some userspace.
> > > 
> > > Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
> > > address, and you said then "why not use existing network namespaces
> > > rather than inventing a new abstraction". So how about it then? Do you
> > > want to find a way to use namespaces to hide the PV device for netvsc
> > > compatibility?
> > >   
> > 
> > Netvsc can't work with 3 dev model. MS has worked with enough distro's and
> > startups that all demand eth0 always be present. And VF may come and go.  
> 
> Well failover seems to maintain this invariant with the 3 dev model.
> 
> > After this history, there is a strong motivation not to change how kernel
> > behaves. Switching to 3 device model would be perceived as breaking
> > existing userspace.  
> 
> I feel I'm misunderstood. I was asking whether a 3-rd device can be
> hidden so that userspace does not know that you switched to a 3 device
> model. It will think there are 2 devices and will keep working.
> 
> If you do that, then there won't be anything that
> would be perceived as breaking existing userspace, will there?

DPDK now knows about the netvsc 2 device model and drivers in userspace
depend on it.

> 
> 
> > With virtio you can  work it out with the distro's yourself.
> > There is no pre-existing semantics to deal with.
> > 
> > For the virtio, I don't see the need for IFF_HIDDEN.
> > With 3-dev model as long as you mark the PV and VF devices
> > as slaves, then userspace knows to leave them alone. Assuming userspace
> > is already able to deal with team and bond devices.  
> 
> That's clear enough.
> 
> > Any time you introduce new UAPI behavior something breaks.  
> 
> Not if we do it right.
> 
> > On the rename front, I really don't care if VF can be renamed.  
> 
> OK that's nice.
> 
> > And for
> > netvsc want to allow the PV device to be renamed.  
> 
> That's because of the 2 device model, right?  So that explains why even
> if the delayed hack is good for the goose it might not be good for the
> gander :)

You are bringing up the VF right away. How does the 3-device initialization
state machine work?  Do you give a window for udev to possibly rename the
VF? Do you rely on that?

> 
> > Udev developers want that
> > but have not found a stable/persistent value to expose to userspace
> > to allow it.  

^ permalink raw reply

* Re: [PATCH net] failover: eliminate callback hell
From: Stephen Hemminger @ 2018-06-06 22:25 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Michael S. Tsirkin, Jiri Pirko, kys, haiyangz, davem, netdev,
	Stephen Hemminger
In-Reply-To: <e20a6cdf-34b4-cbc9-1dc9-75c436d6c2fe@intel.com>

On Wed, 6 Jun 2018 14:54:04 -0700
"Samudrala, Sridhar" <sridhar.samudrala@intel.com> wrote:

> On 6/6/2018 2:24 PM, Stephen Hemminger wrote:
> > On Wed, 6 Jun 2018 15:30:27 +0300
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >  
> >> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:  
> >>> Tue, Jun 05, 2018 at 05:42:31AM CEST, stephen@networkplumber.org wrote:  
> >>>> The net failover should be a simple library, not a virtual
> >>>> object with function callbacks (see callback hell).  
> >>> Why just a library? It should do a common things. I think it should be a
> >>> virtual object. Looks like your patch again splits the common
> >>> functionality into multiple drivers. That is kind of backwards attitude.
> >>> I don't get it. We should rather focus on fixing the mess the
> >>> introduction of netvsc-bonding caused and switch netvsc to 3-netdev
> >>> model.  
> >> So it seems that at least one benefit for netvsc would be better
> >> handling of renames.
> >>
> >> Question is how can this change to 3-netdev happen?  Stephen is
> >> concerned about risk of breaking some userspace.
> >>
> >> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
> >> address, and you said then "why not use existing network namespaces
> >> rather than inventing a new abstraction". So how about it then? Do you
> >> want to find a way to use namespaces to hide the PV device for netvsc
> >> compatibility?
> >>  
> > Netvsc can't work with 3 dev model. MS has worked with enough distro's and
> > startups that all demand eth0 always be present. And VF may come and go.
> > After this history, there is a strong motivation not to change how kernel
> > behaves. Switching to 3 device model would be perceived as breaking
> > existing userspace.  
> 
> I think it should be possible for netvsc to work with 3 dev model if the only
> requirement is that eth0 will always be present. With net_failover, you will
> see eth0 and eth0nsby OR with older distros eth0 and eth1.  It may be an issue
> if somehow there is userspace requirement that there can be only 2 netdevs, not 3
> when VF is plugged.
> 
> eth0 will be the net_failover device and eth0nsby/eth1 will be the netvsc device
> and the IP address gets configured on eth0. Will this be an issue?

DPDK drivers in 18.05 depend on 2 device model. Yes it is a bit of mess
but that is the way it is.

^ permalink raw reply

* [PATCH] hv_netvsc: Add per-cpu ethtool stats for netvsc
From: Yidong Ren @ 2018-06-06 22:27 UTC (permalink / raw)
  To: kys, haiyangz, sthemmin, davem; +Cc: devel, linux-kernel, netdev

From: Yidong Ren <yidren@microsoft.com>

This patch implements following ethtool stats fields for netvsc:
cpu<n>_tx/rx_packets/bytes
cpu<n>_vf_tx/rx_packets/bytes

Corresponding per-cpu counters exist in current code. Exposing these
counters will help troubleshooting performance issues.

Signed-off-by: Yidong Ren <yidren@microsoft.com>
---
 drivers/net/hyperv/hyperv_net.h |  11 ++++
 drivers/net/hyperv/netvsc_drv.c | 104 +++++++++++++++++++++++++++++++-
 2 files changed, 113 insertions(+), 2 deletions(-)

diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index 960f06141472..f8c798bf9418 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -710,6 +710,17 @@ struct netvsc_ethtool_stats {
 	unsigned long wake_queue;
 };
 
+struct netvsc_ethtool_pcpu_stats {
+	u64     rx_packets;
+	u64     rx_bytes;
+	u64     tx_packets;
+	u64     tx_bytes;
+	u64     vf_rx_packets;
+	u64     vf_rx_bytes;
+	u64     vf_tx_packets;
+	u64     vf_tx_bytes;
+};
+
 struct netvsc_vf_pcpu_stats {
 	u64     rx_packets;
 	u64     rx_bytes;
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index da07ccdf84bf..c43e64606c1a 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -1104,6 +1104,66 @@ static void netvsc_get_vf_stats(struct net_device *net,
 	}
 }
 
+static void netvsc_get_pcpu_stats(struct net_device *net,
+				  struct netvsc_ethtool_pcpu_stats
+					__percpu *pcpu_tot)
+{
+	struct net_device_context *ndev_ctx = netdev_priv(net);
+	struct netvsc_device *nvdev = rcu_dereference_rtnl(ndev_ctx->nvdev);
+	int i;
+
+	// fetch percpu stats of vf
+	for_each_possible_cpu(i) {
+		const struct netvsc_vf_pcpu_stats *stats =
+			per_cpu_ptr(ndev_ctx->vf_stats, i);
+		struct netvsc_ethtool_pcpu_stats *this_tot =
+			per_cpu_ptr(pcpu_tot, i);
+		unsigned int start;
+
+		do {
+			start = u64_stats_fetch_begin_irq(&stats->syncp);
+			this_tot->vf_rx_packets = stats->rx_packets;
+			this_tot->vf_tx_packets = stats->tx_packets;
+			this_tot->vf_rx_bytes = stats->rx_bytes;
+			this_tot->vf_tx_bytes = stats->tx_bytes;
+		} while (u64_stats_fetch_retry_irq(&stats->syncp, start));
+		this_tot->rx_packets = this_tot->vf_rx_packets;
+		this_tot->tx_packets = this_tot->vf_tx_packets;
+		this_tot->rx_bytes   = this_tot->vf_rx_bytes;
+		this_tot->tx_bytes   = this_tot->vf_tx_bytes;
+	}
+
+	// fetch percpu stats of netvsc
+	for (i = 0; i < nvdev->num_chn; i++) {
+		const struct netvsc_channel *nvchan = &nvdev->chan_table[i];
+		const struct netvsc_stats *stats;
+		struct netvsc_ethtool_pcpu_stats *this_tot =
+			per_cpu_ptr(pcpu_tot, nvchan->channel->target_cpu);
+		u64 packets, bytes;
+		unsigned int start;
+
+		stats = &nvchan->tx_stats;
+		do {
+			start = u64_stats_fetch_begin_irq(&stats->syncp);
+			packets = stats->packets;
+			bytes = stats->bytes;
+		} while (u64_stats_fetch_retry_irq(&stats->syncp, start));
+
+		this_tot->tx_bytes	+= bytes;
+		this_tot->tx_packets	+= packets;
+
+		stats = &nvchan->rx_stats;
+		do {
+			start = u64_stats_fetch_begin_irq(&stats->syncp);
+			packets = stats->packets;
+			bytes = stats->bytes;
+		} while (u64_stats_fetch_retry_irq(&stats->syncp, start));
+
+		this_tot->rx_bytes	+= bytes;
+		this_tot->rx_packets	+= packets;
+	}
+}
+
 static void netvsc_get_stats64(struct net_device *net,
 			       struct rtnl_link_stats64 *t)
 {
@@ -1201,6 +1261,23 @@ static const struct {
 	{ "rx_no_memory", offsetof(struct netvsc_ethtool_stats, rx_no_memory) },
 	{ "stop_queue", offsetof(struct netvsc_ethtool_stats, stop_queue) },
 	{ "wake_queue", offsetof(struct netvsc_ethtool_stats, wake_queue) },
+}, pcpu_stats[] = {
+	{ "cpu%u_rx_packets",
+		offsetof(struct netvsc_ethtool_pcpu_stats, rx_packets) },
+	{ "cpu%u_rx_bytes",
+		offsetof(struct netvsc_ethtool_pcpu_stats, rx_bytes) },
+	{ "cpu%u_tx_packets",
+		offsetof(struct netvsc_ethtool_pcpu_stats, tx_packets) },
+	{ "cpu%u_tx_bytes",
+		offsetof(struct netvsc_ethtool_pcpu_stats, tx_bytes) },
+	{ "cpu%u_vf_rx_packets",
+		offsetof(struct netvsc_ethtool_pcpu_stats, vf_rx_packets) },
+	{ "cpu%u_vf_rx_bytes",
+		offsetof(struct netvsc_ethtool_pcpu_stats, vf_rx_bytes) },
+	{ "cpu%u_vf_tx_packets",
+		offsetof(struct netvsc_ethtool_pcpu_stats, vf_tx_packets) },
+	{ "cpu%u_vf_tx_bytes",
+		offsetof(struct netvsc_ethtool_pcpu_stats, vf_tx_bytes) },
 }, vf_stats[] = {
 	{ "vf_rx_packets", offsetof(struct netvsc_vf_pcpu_stats, rx_packets) },
 	{ "vf_rx_bytes",   offsetof(struct netvsc_vf_pcpu_stats, rx_bytes) },
@@ -1212,6 +1289,9 @@ static const struct {
 #define NETVSC_GLOBAL_STATS_LEN	ARRAY_SIZE(netvsc_stats)
 #define NETVSC_VF_STATS_LEN	ARRAY_SIZE(vf_stats)
 
+/* statistics per queue (rx/tx packets/bytes) */
+#define NETVSC_PCPU_STATS_LEN (num_present_cpus() * ARRAY_SIZE(pcpu_stats))
+
 /* 4 statistics per queue (rx/tx packets/bytes) */
 #define NETVSC_QUEUE_STATS_LEN(dev) ((dev)->num_chn * 4)
 
@@ -1227,6 +1307,7 @@ static int netvsc_get_sset_count(struct net_device *dev, int string_set)
 	case ETH_SS_STATS:
 		return NETVSC_GLOBAL_STATS_LEN
 			+ NETVSC_VF_STATS_LEN
+			+ NETVSC_PCPU_STATS_LEN
 			+ NETVSC_QUEUE_STATS_LEN(nvdev);
 	default:
 		return -EINVAL;
@@ -1241,9 +1322,10 @@ static void netvsc_get_ethtool_stats(struct net_device *dev,
 	const void *nds = &ndc->eth_stats;
 	const struct netvsc_stats *qstats;
 	struct netvsc_vf_pcpu_stats sum;
+	struct netvsc_ethtool_pcpu_stats __percpu *pcpu_sum;
 	unsigned int start;
 	u64 packets, bytes;
-	int i, j;
+	int i, j, cpu;
 
 	if (!nvdev)
 		return;
@@ -1255,6 +1337,17 @@ static void netvsc_get_ethtool_stats(struct net_device *dev,
 	for (j = 0; j < NETVSC_VF_STATS_LEN; j++)
 		data[i++] = *(u64 *)((void *)&sum + vf_stats[j].offset);
 
+	pcpu_sum = alloc_percpu(struct netvsc_ethtool_pcpu_stats);
+	netvsc_get_pcpu_stats(dev, pcpu_sum);
+	for_each_present_cpu(cpu) {
+		struct netvsc_ethtool_pcpu_stats *this_sum =
+			per_cpu_ptr(pcpu_sum, cpu);
+		for (j = 0; j < ARRAY_SIZE(pcpu_stats); j++)
+			data[i++] = *(u64 *)((void *)this_sum
+					     + pcpu_stats[j].offset);
+	}
+	free_percpu(pcpu_sum);
+
 	for (j = 0; j < nvdev->num_chn; j++) {
 		qstats = &nvdev->chan_table[j].tx_stats;
 
@@ -1282,7 +1375,7 @@ static void netvsc_get_strings(struct net_device *dev, u32 stringset, u8 *data)
 	struct net_device_context *ndc = netdev_priv(dev);
 	struct netvsc_device *nvdev = rtnl_dereference(ndc->nvdev);
 	u8 *p = data;
-	int i;
+	int i, cpu;
 
 	if (!nvdev)
 		return;
@@ -1299,6 +1392,13 @@ static void netvsc_get_strings(struct net_device *dev, u32 stringset, u8 *data)
 			p += ETH_GSTRING_LEN;
 		}
 
+		for_each_present_cpu(cpu) {
+			for (i = 0; i < ARRAY_SIZE(pcpu_stats); i++) {
+				sprintf(p, pcpu_stats[i].name, cpu);
+				p += ETH_GSTRING_LEN;
+			}
+		}
+
 		for (i = 0; i < nvdev->num_chn; i++) {
 			sprintf(p, "tx_queue_%u_packets", i);
 			p += ETH_GSTRING_LEN;
-- 
2.17.1

^ permalink raw reply related

* Re: [PATCH net] failover: eliminate callback hell
From: Stephen Hemminger @ 2018-06-06 21:17 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Samudrala, Sridhar, kys, haiyangz, davem, netdev,
	Stephen Hemminger
In-Reply-To: <20180606151019-mutt-send-email-mst@kernel.org>

On Wed, 6 Jun 2018 15:19:30 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Tue, Jun 05, 2018 at 08:51:18PM -0700, Stephen Hemminger wrote:
> > > I think the push back was with the usage of the delay, not bringing up the primary/standby
> > > device in the name change event handler.
> > > Can't netvsc use this mechanism instead of depending on the delay?
> > > 
> > >   
> > 
> > The patch that was rejected for netvsc was about using name change.  
> 
> So failover is now doing exactly what you wanted netvsc to do.  Rather
> than reverting everyone to old behaviour how about using more pieces
> from failover?
> 
> > Also, you can't depend on name change; you still need a timer. Not all distributions
> > change name of devices.  
> 
> So failover chose not to implement the delayed open so far.
> If it does I suspect we'll have to keep it around forever -
> kind of like netvsc seems to be stuck with it.
> But let's see if there's enough actual demand from people running
> ancient distros with latest kernels to even start looking for
> a solution for failover.
> 
> And this kind of behaviour change really should be split out
> so we can discuss it separately.
> 
> > Or user has blocked that by udev rules.  
> 
> Don't do that then?
> 

If you don't want to allow udev to rename the device, then just pull
the name change hook.

^ permalink raw reply

* Re: [net-next PATCH v3 3/5] net: Enable Tx queue selection based on Rx queues
From: Nambiar, Amritha @ 2018-06-06 22:52 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Linux Kernel Network Developers, David S. Miller, Alexander Duyck,
	Willem de Bruijn, Sridhar Samudrala, Eric Dumazet,
	Hannes Frederic Sowa
In-Reply-To: <CALx6S35273zLb-WDs-5r+b_eWoB8jmZfUJvYRDHWNo4fjJLyag@mail.gmail.com>

On 6/5/2018 10:57 AM, Tom Herbert wrote:
> 
> 
> On Tue, Jun 5, 2018 at 1:38 AM, Amritha Nambiar
> <amritha.nambiar@intel.com <mailto:amritha.nambiar@intel.com>> wrote:
> 
>     This patch adds support to pick Tx queue based on the Rx queue(s) map
>     configuration set by the admin through the sysfs attribute
>     for each Tx queue. If the user configuration for receive queue(s) map
>     does not apply, then the Tx queue selection falls back to CPU(s) map
>     based selection and finally to hashing.
> 
>     Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com
>     <mailto:amritha.nambiar@intel.com>>
>     Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com
>     <mailto:sridhar.samudrala@intel.com>>
>     ---
>      include/net/busy_poll.h |    3 ++
>      include/net/sock.h      |   14 +++++++++++
>      net/core/dev.c          |   60
>     ++++++++++++++++++++++++++++++++---------------
>      net/core/sock.c         |    4 +++
>      net/ipv4/tcp_input.c    |    3 ++
>      5 files changed, 65 insertions(+), 19 deletions(-)
> 
>     diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
>     index 71c72a9..fc4fb68 100644
>     --- a/include/net/busy_poll.h
>     +++ b/include/net/busy_poll.h
>     @@ -136,6 +136,9 @@ static inline void sk_mark_napi_id(struct sock
>     *sk, const struct sk_buff *skb)
>      #ifdef CONFIG_NET_RX_BUSY_POLL
>             sk->sk_napi_id = skb->napi_id;
>      #endif
>     +#ifdef CONFIG_XPS
>     +       sk->sk_rx_queue_mapping = skb_get_rx_queue(skb);
>     +#endif
>      }
> 
>      /* variant used for unconnected sockets */
>     diff --git a/include/net/sock.h b/include/net/sock.h
>     index 4f7c584..12313653 100644
>     --- a/include/net/sock.h
>     +++ b/include/net/sock.h
>     @@ -139,6 +139,7 @@ typedef __u64 __bitwise __addrpair;
>       *     @skc_node: main hash linkage for various protocol lookup tables
>       *     @skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol
>       *     @skc_tx_queue_mapping: tx queue number for this connection
>     + *     @skc_rx_queue_mapping: rx queue number for this connection
>       *     @skc_flags: place holder for sk_flags
>       *             %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,
>       *             %SO_OOBINLINE settings, %SO_TIMESTAMPING settings
>     @@ -215,6 +216,9 @@ struct sock_common {
>                     struct hlist_nulls_node skc_nulls_node;
>             };
>             int                     skc_tx_queue_mapping;
>     +#ifdef CONFIG_XPS
>     +       int                     skc_rx_queue_mapping;
> 
> 
> This is still expensive cost to be adding an int field into sock_common
> for a relatively rare use case. Maybe there should be a CONFIG_XPS_RQS?
> Or maybe skc_tx_queue_mapping and skc_rx_queue_mapping could be shorts
> (so maximum queue mapping would then be 2^16-2).

Thanks for the review, Tom. I will fix up the code incorporating all
your feedback in the next version (v4). I could have a new config option
CONFIG_XPS_RXQS that would be default off, in addition to the CONFIG_XPS
option that's already there. With changing the 'skc_tx_queue_mapping' to
short, my concern is that the change would become extensive, there are a
lot of places where this gets filled with int or u32 values.

> 
>     +#endif
>             union {
>                     int             skc_incoming_cpu;
>                     u32             skc_rcv_wnd;
>     @@ -326,6 +330,9 @@ struct sock {
>      #define sk_nulls_node          __sk_common.skc_nulls_node
>      #define sk_refcnt              __sk_common.skc_refcnt
>      #define sk_tx_queue_mapping    __sk_common.skc_tx_queue_mapping
>     +#ifdef CONFIG_XPS
>     +#define sk_rx_queue_mapping    __sk_common.skc_rx_queue_mapping
>     +#endif
> 
>      #define sk_dontcopy_begin      __sk_common.skc_dontcopy_begin
>      #define sk_dontcopy_end                __sk_common.skc_dontcopy_end
>     @@ -1696,6 +1703,13 @@ static inline int sk_tx_queue_get(const
>     struct sock *sk)
>             return sk ? sk->sk_tx_queue_mapping : -1;
>      }
> 
>     +static inline void sk_mark_rx_queue(struct sock *sk, struct sk_buff
>     *skb)
>     +{
>     +#ifdef CONFIG_XPS
>     +       sk->sk_rx_queue_mapping = skb_get_rx_queue(skb);
>     +#endif
>     +}
>     +
>      static inline void sk_set_socket(struct sock *sk, struct socket *sock)
>      {
>             sk_tx_queue_clear(sk);
>     diff --git a/net/core/dev.c b/net/core/dev.c
>     index bba755f..1880e6c 100644
>     --- a/net/core/dev.c
>     +++ b/net/core/dev.c
>     @@ -3479,36 +3479,58 @@ sch_handle_egress(struct sk_buff *skb, int
>     *ret, struct net_device *dev)
>      }
>      #endif /* CONFIG_NET_EGRESS */
> 
>     -static inline int get_xps_queue(struct net_device *dev, struct
>     sk_buff *skb)
>     +#ifdef CONFIG_XPS
>     +static int __get_xps_queue_idx(struct net_device *dev, struct
>     sk_buff *skb,
>     +                              struct xps_dev_maps *dev_maps,
>     unsigned int tci)
>     +{
>     +       struct xps_map *map;
>     +       int queue_index = -1;
>     +
>     +       if (dev->num_tc) {
>     +               tci *= dev->num_tc;
>     +               tci += netdev_get_prio_tc_map(dev, skb->priority);
>     +       }
>     +
>     +       map = rcu_dereference(dev_maps->attr_map[tci]);
>     +       if (map) {
>     +               if (map->len == 1)
>     +                       queue_index = map->queues[0];
>     +               else
>     +                       queue_index = map->queues[reciprocal_scale(
>     +                                               skb_get_hash(skb),
>     map->len)];
>     +               if (unlikely(queue_index >= dev->real_num_tx_queues))
>     +                       queue_index = -1;
>     +       }
>     +       return queue_index;
>     +}
>     +#endif
>     +
>     +static int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
>      {
>      #ifdef CONFIG_XPS
>             struct xps_dev_maps *dev_maps;
>     -       struct xps_map *map;
>     +       struct sock *sk = skb->sk;
>             int queue_index = -1;
>     +       unsigned int tci = 0;
> 
>             if (!static_key_false(&xps_needed))
>                     return -1;
> 
>     +       if (sk && sk->sk_rx_queue_mapping <= dev->num_rx_queues)
>     +               tci = sk->sk_rx_queue_mapping;
> 
> 
> This is only be needed if xps_rxqs_map is not null so it should be in
> the block below.
>  
> 
>     +
>             rcu_read_lock();
>     -       dev_maps = rcu_dereference(dev->xps_cpus_map);
>     -       if (dev_maps) {
>     -               unsigned int tci = skb->sender_cpu - 1;
>     +       dev_maps = rcu_dereference(dev->xps_rxqs_map);
>     +       if (dev_maps)
>     +               queue_index = __get_xps_queue_idx(dev, skb,
>     dev_maps, tci);
> 
>     -               if (dev->num_tc) {
>     -                       tci *= dev->num_tc;
>     -                       tci += netdev_get_prio_tc_map(dev,
>     skb->priority);
>     -               }
> 
>     -               map = rcu_dereference(dev_maps->attr_map[tci]);
>     -               if (map) {
>     -                       if (map->len == 1)
>     -                               queue_index = map->queues[0];
>     -                       else
>     -                               queue_index =
>     map->queues[reciprocal_scale(skb_get_hash(skb),
>     -                                                                   
>           map->len)];
>     -                       if (unlikely(queue_index >=
>     dev->real_num_tx_queues))
>     -                               queue_index = -1;
>     -               }
>     +       if (queue_index < 0) {
>     +               tci = skb->sender_cpu - 1;
>     +               dev_maps = rcu_dereference(dev->xps_cpus_map);
>     +               if (dev_maps)
>     +                       queue_index = __get_xps_queue_idx(dev, skb,
>     dev_maps,
>     +                                                         tci);
>             }
>             rcu_read_unlock();
> 
>     diff --git a/net/core/sock.c b/net/core/sock.c
>     index 435a0ba..3c10d31 100644
>     --- a/net/core/sock.c
>     +++ b/net/core/sock.c
>     @@ -2824,6 +2824,10 @@ void sock_init_data(struct socket *sock,
>     struct sock *sk)
>             sk->sk_pacing_rate = ~0U;
>             sk->sk_pacing_shift = 10;
>             sk->sk_incoming_cpu = -1;
>     +
>     +#ifdef CONFIG_XPS
>     +       sk->sk_rx_queue_mapping = -1;
>     +#endif
>             /*
>              * Before updating sk_refcnt, we must commit prior changes
>     to memory
>              * (Documentation/RCU/rculist_nulls.txt for details)
>     diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
>     index d5ffb57..cc69f75 100644
>     --- a/net/ipv4/tcp_input.c
>     +++ b/net/ipv4/tcp_input.c
>     @@ -78,6 +78,7 @@
>      #include <linux/errqueue.h>
>      #include <trace/events/tcp.h>
>      #include <linux/static_key.h>
>     +#include <net/busy_poll.h>
> 
>      int sysctl_tcp_max_orphans __read_mostly = NR_FILE;
> 
>     @@ -5574,6 +5575,7 @@ void tcp_finish_connect(struct sock *sk,
>     struct sk_buff *skb)
>             if (skb) {
>                     icsk->icsk_af_ops->sk_rx_dst_set(sk, skb);
>                     security_inet_conn_established(sk, skb);
>     +               sk_mark_napi_id(sk, skb);
>             }
> 
>             tcp_init_transfer(sk, BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB);
>     @@ -6402,6 +6404,7 @@ int tcp_conn_request(struct request_sock_ops
>     *rsk_ops,
>             tcp_rsk(req)->snt_isn = isn;
>             tcp_rsk(req)->txhash = net_tx_rndhash();
>             tcp_openreq_init_rwin(req, sk, dst);
>     +       sk_mark_rx_queue(req_to_sk(req), skb);
>             if (!want_cookie) {
>                     tcp_reqsk_record_syn(sk, req, skb);
>                     fastopen_sk = tcp_try_fastopen(sk, skb, req, &foc, dst);
> 
> 

^ permalink raw reply

* Re: [PATCH] hv_netvsc: Add per-cpu ethtool stats for netvsc
From: Stephen Hemminger @ 2018-06-06 22:53 UTC (permalink / raw)
  To: Yidong Ren; +Cc: sthemmin, netdev, haiyangz, linux-kernel, devel, davem
In-Reply-To: <20180606222700.24732-1-yidren@linuxonhyperv.com>

On Wed,  6 Jun 2018 15:27:00 -0700
Yidong Ren <yidren@linuxonhyperv.com> wrote:

> From: Yidong Ren <yidren@microsoft.com>
> 
> This patch implements following ethtool stats fields for netvsc:
> cpu<n>_tx/rx_packets/bytes
> cpu<n>_vf_tx/rx_packets/bytes
> 
> Corresponding per-cpu counters exist in current code. Exposing these
> counters will help troubleshooting performance issues.
> 
> Signed-off-by: Yidong Ren <yidren@microsoft.com>

This patch would be targeted for net-next (davem's tree);
but net-next is currently closed until 4.19-rc1 is done.

> ---
>  drivers/net/hyperv/hyperv_net.h |  11 ++++
>  drivers/net/hyperv/netvsc_drv.c | 104 +++++++++++++++++++++++++++++++-
>  2 files changed, 113 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
> index 960f06141472..f8c798bf9418 100644
> --- a/drivers/net/hyperv/hyperv_net.h
> +++ b/drivers/net/hyperv/hyperv_net.h
> @@ -710,6 +710,17 @@ struct netvsc_ethtool_stats {
>  	unsigned long wake_queue;
>  };
>  
> +struct netvsc_ethtool_pcpu_stats {
> +	u64     rx_packets;
> +	u64     rx_bytes;
> +	u64     tx_packets;
> +	u64     tx_bytes;
> +	u64     vf_rx_packets;
> +	u64     vf_rx_bytes;
> +	u64     vf_tx_packets;
> +	u64     vf_tx_bytes;
> +};
> +
>  struct netvsc_vf_pcpu_stats {
>  	u64     rx_packets;
>  	u64     rx_bytes;
> diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
> index da07ccdf84bf..c43e64606c1a 100644
> --- a/drivers/net/hyperv/netvsc_drv.c
> +++ b/drivers/net/hyperv/netvsc_drv.c
> @@ -1104,6 +1104,66 @@ static void netvsc_get_vf_stats(struct net_device *net,
>  	}
>  }
>  
> +static void netvsc_get_pcpu_stats(struct net_device *net,
> +				  struct netvsc_ethtool_pcpu_stats
> +					__percpu *pcpu_tot)
> +{
> +	struct net_device_context *ndev_ctx = netdev_priv(net);
> +	struct netvsc_device *nvdev = rcu_dereference_rtnl(ndev_ctx->nvdev);
> +	int i;
> +
> +	// fetch percpu stats of vf

If you ran checkpatch you would see that Linux always uses C style
comments, and not C++ style //

> +	for_each_possible_cpu(i) {
> +		const struct netvsc_vf_pcpu_stats *stats =
> +			per_cpu_ptr(ndev_ctx->vf_stats, i);
> +		struct netvsc_ethtool_pcpu_stats *this_tot =
> +			per_cpu_ptr(pcpu_tot, i);
> +		unsigned int start;
> +
> +		do {
> +			start = u64_stats_fetch_begin_irq(&stats->syncp);
> +			this_tot->vf_rx_packets = stats->rx_packets;
> +			this_tot->vf_tx_packets = stats->tx_packets;
> +			this_tot->vf_rx_bytes = stats->rx_bytes;
> +			this_tot->vf_tx_bytes = stats->tx_bytes;
> +		} while (u64_stats_fetch_retry_irq(&stats->syncp, start));
> +		this_tot->rx_packets = this_tot->vf_rx_packets;
> +		this_tot->tx_packets = this_tot->vf_tx_packets;
> +		this_tot->rx_bytes   = this_tot->vf_rx_bytes;
> +		this_tot->tx_bytes   = this_tot->vf_tx_bytes;
> +	}
> +
> +	// fetch percpu stats of netvsc
> +	for (i = 0; i < nvdev->num_chn; i++) {
> +		const struct netvsc_channel *nvchan = &nvdev->chan_table[i];
> +		const struct netvsc_stats *stats;
> +		struct netvsc_ethtool_pcpu_stats *this_tot =
> +			per_cpu_ptr(pcpu_tot, nvchan->channel->target_cpu);
> +		u64 packets, bytes;
> +		unsigned int start;
> +
> +		stats = &nvchan->tx_stats;
> +		do {
> +			start = u64_stats_fetch_begin_irq(&stats->syncp);
> +			packets = stats->packets;
> +			bytes = stats->bytes;
> +		} while (u64_stats_fetch_retry_irq(&stats->syncp, start));
> +
> +		this_tot->tx_bytes	+= bytes;
> +		this_tot->tx_packets	+= packets;
> +
> +		stats = &nvchan->rx_stats;
> +		do {
> +			start = u64_stats_fetch_begin_irq(&stats->syncp);
> +			packets = stats->packets;
> +			bytes = stats->bytes;
> +		} while (u64_stats_fetch_retry_irq(&stats->syncp, start));
> +
> +		this_tot->rx_bytes	+= bytes;
> +		this_tot->rx_packets	+= packets;
> +	}
> +}
> +
>  static void netvsc_get_stats64(struct net_device *net,
>  			       struct rtnl_link_stats64 *t)
>  {
> @@ -1201,6 +1261,23 @@ static const struct {
>  	{ "rx_no_memory", offsetof(struct netvsc_ethtool_stats, rx_no_memory) },
>  	{ "stop_queue", offsetof(struct netvsc_ethtool_stats, stop_queue) },
>  	{ "wake_queue", offsetof(struct netvsc_ethtool_stats, wake_queue) },
> +}, pcpu_stats[] = {
> +	{ "cpu%u_rx_packets",
> +		offsetof(struct netvsc_ethtool_pcpu_stats, rx_packets) },
> +	{ "cpu%u_rx_bytes",
> +		offsetof(struct netvsc_ethtool_pcpu_stats, rx_bytes) },
> +	{ "cpu%u_tx_packets",
> +		offsetof(struct netvsc_ethtool_pcpu_stats, tx_packets) },
> +	{ "cpu%u_tx_bytes",
> +		offsetof(struct netvsc_ethtool_pcpu_stats, tx_bytes) },
> +	{ "cpu%u_vf_rx_packets",
> +		offsetof(struct netvsc_ethtool_pcpu_stats, vf_rx_packets) },
> +	{ "cpu%u_vf_rx_bytes",
> +		offsetof(struct netvsc_ethtool_pcpu_stats, vf_rx_bytes) },
> +	{ "cpu%u_vf_tx_packets",
> +		offsetof(struct netvsc_ethtool_pcpu_stats, vf_tx_packets) },
> +	{ "cpu%u_vf_tx_bytes",
> +		offsetof(struct netvsc_ethtool_pcpu_stats, vf_tx_bytes) },
>  }, vf_stats[] = {
>  	{ "vf_rx_packets", offsetof(struct netvsc_vf_pcpu_stats, rx_packets) },
>  	{ "vf_rx_bytes",   offsetof(struct netvsc_vf_pcpu_stats, rx_bytes) },
> @@ -1212,6 +1289,9 @@ static const struct {
>  #define NETVSC_GLOBAL_STATS_LEN	ARRAY_SIZE(netvsc_stats)
>  #define NETVSC_VF_STATS_LEN	ARRAY_SIZE(vf_stats)
>  
> +/* statistics per queue (rx/tx packets/bytes) */
> +#define NETVSC_PCPU_STATS_LEN (num_present_cpus() * ARRAY_SIZE(pcpu_stats))
> +
>  /* 4 statistics per queue (rx/tx packets/bytes) */
>  #define NETVSC_QUEUE_STATS_LEN(dev) ((dev)->num_chn * 4)
>  
> @@ -1227,6 +1307,7 @@ static int netvsc_get_sset_count(struct net_device *dev, int string_set)
>  	case ETH_SS_STATS:
>  		return NETVSC_GLOBAL_STATS_LEN
>  			+ NETVSC_VF_STATS_LEN
> +			+ NETVSC_PCPU_STATS_LEN
>  			+ NETVSC_QUEUE_STATS_LEN(nvdev);
>  	default:
>  		return -EINVAL;
> @@ -1241,9 +1322,10 @@ static void netvsc_get_ethtool_stats(struct net_device *dev,
>  	const void *nds = &ndc->eth_stats;
>  	const struct netvsc_stats *qstats;
>  	struct netvsc_vf_pcpu_stats sum;
> +	struct netvsc_ethtool_pcpu_stats __percpu *pcpu_sum;
>  	unsigned int start;
>  	u64 packets, bytes;
> -	int i, j;
> +	int i, j, cpu;
>  
>  	if (!nvdev)
>  		return;
> @@ -1255,6 +1337,17 @@ static void netvsc_get_ethtool_stats(struct net_device *dev,
>  	for (j = 0; j < NETVSC_VF_STATS_LEN; j++)
>  		data[i++] = *(u64 *)((void *)&sum + vf_stats[j].offset);
>  
> +	pcpu_sum = alloc_percpu(struct netvsc_ethtool_pcpu_stats);
> +	netvsc_get_pcpu_stats(dev, pcpu_sum);
> +	for_each_present_cpu(cpu) {
> +		struct netvsc_ethtool_pcpu_stats *this_sum =
> +			per_cpu_ptr(pcpu_sum, cpu);
> +		for (j = 0; j < ARRAY_SIZE(pcpu_stats); j++)
> +			data[i++] = *(u64 *)((void *)this_sum
> +					     + pcpu_stats[j].offset);
> +	}
> +	free_percpu(pcpu_sum);
> +
>  	for (j = 0; j < nvdev->num_chn; j++) {
>  		qstats = &nvdev->chan_table[j].tx_stats;
>  
> @@ -1282,7 +1375,7 @@ static void netvsc_get_strings(struct net_device *dev, u32 stringset, u8 *data)
>  	struct net_device_context *ndc = netdev_priv(dev);
>  	struct netvsc_device *nvdev = rtnl_dereference(ndc->nvdev);
>  	u8 *p = data;
> -	int i;
> +	int i, cpu;
>  
>  	if (!nvdev)
>  		return;
> @@ -1299,6 +1392,13 @@ static void netvsc_get_strings(struct net_device *dev, u32 stringset, u8 *data)
>  			p += ETH_GSTRING_LEN;
>  		}
>  
> +		for_each_present_cpu(cpu) {
> +			for (i = 0; i < ARRAY_SIZE(pcpu_stats); i++) {
> +				sprintf(p, pcpu_stats[i].name, cpu);
> +				p += ETH_GSTRING_LEN;
> +			}
> +		}
> +
>  		for (i = 0; i < nvdev->num_chn; i++) {
>  			sprintf(p, "tx_queue_%u_packets", i);
>  			p += ETH_GSTRING_LEN;

^ permalink raw reply

* [PATCH] ip_tunnel: Fix name string concatenate in __ip_tunnel_create()
From: Sultan Alsawaf @ 2018-06-06 22:56 UTC (permalink / raw)
  To: sultanxda; +Cc: netdev, davem, kuznet, yoshfuji

By passing a limit of 2 bytes to strncat, strncat is limited to writing
fewer bytes than what it's supposed to append to the name here.

Since the bounds are checked on the line above this, just remove the string
bounds checks entirely since they're unneeded.

Signed-off-by: Sultan Alsawaf <sultanxda@gmail.com>
---
 net/ipv4/ip_tunnel.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c
index 38d906baf1df..c4f5602308ed 100644
--- a/net/ipv4/ip_tunnel.c
+++ b/net/ipv4/ip_tunnel.c
@@ -261,8 +261,8 @@ static struct net_device *__ip_tunnel_create(struct net *net,
 	} else {
 		if (strlen(ops->kind) > (IFNAMSIZ - 3))
 			goto failed;
-		strlcpy(name, ops->kind, IFNAMSIZ);
-		strncat(name, "%d", 2);
+		strcpy(name, ops->kind);
+		strcat(name, "%d");
 	}
 
 	ASSERT_RTNL();
-- 
2.17.1

^ permalink raw reply related

* Re: [net-next PATCH v3 3/5] net: Enable Tx queue selection based on Rx queues
From: Nambiar, Amritha @ 2018-06-06 23:02 UTC (permalink / raw)
  To: Willem de Bruijn, Samudrala, Sridhar
  Cc: Network Development, David Miller, Alexander Duyck, Eric Dumazet,
	Hannes Frederic Sowa, Tom Herbert
In-Reply-To: <CAF=yD-+QY5u_nbsJv7uqnroLivBaSExghJaq+Cjt3oos0cVR0A@mail.gmail.com>

On 6/6/2018 12:13 PM, Willem de Bruijn wrote:
> On Wed, Jun 6, 2018 at 3:08 PM, Samudrala, Sridhar
> <sridhar.samudrala@intel.com> wrote:
>>
>> On 6/6/2018 11:56 AM, Willem de Bruijn wrote:
>>>
>>> On Tue, Jun 5, 2018 at 4:38 AM, Amritha Nambiar
>>> <amritha.nambiar@intel.com> wrote:
>>>>
>>>> This patch adds support to pick Tx queue based on the Rx queue(s) map
>>>> configuration set by the admin through the sysfs attribute
>>>> for each Tx queue. If the user configuration for receive queue(s) map
>>>> does not apply, then the Tx queue selection falls back to CPU(s) map
>>>> based selection and finally to hashing.
>>>>
>>>> Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
>>>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>>>> ---
>>>>   int sysctl_tcp_max_orphans __read_mostly = NR_FILE;
>>>>
>>>> @@ -5574,6 +5575,7 @@ void tcp_finish_connect(struct sock *sk, struct
>>>> sk_buff *skb)
>>>>          if (skb) {
>>>>                  icsk->icsk_af_ops->sk_rx_dst_set(sk, skb);
>>>>                  security_inet_conn_established(sk, skb);
>>>> +               sk_mark_napi_id(sk, skb);
>>>>          }
>>>
>>> This and the call below should be in a standalone patch, as the mark
>>> changes are not rxq-xps specific. Is the additional earlier marking really
>>> required?
>>
>>
>> The additional earlier marking in tcp_finish_connect() allows a client app
>> to do
>> SO_INCOMING_NAPI_ID after a a connect() call to get the right queue
>> association
>> for a socket.
>>
>> The marking in tcp_conn_request() allows syn-ack to go on the right tx-queue
>> associated with the queue on which syn is received.
> 
> I understand the intent. My question really is whether it is needed.
> Marking has been slightly lossy in this regard in the past, not
> necessarily as an oversight. I don't mean to make that call here,
> but it's worth discussion and its own patch.
> 
Will separate this out into a standalone patch in v4.

^ permalink raw reply

* Re: net: do not allow changing SO_REUSEADDR/SO_REUSEPORT on bound sockets
From: Andrei Vagin @ 2018-06-06 23:25 UTC (permalink / raw)
  To: Maciej Żenczykowski
  Cc: Maciej Żenczykowski, David S . Miller, Eric Dumazet, netdev
In-Reply-To: <20180603174705.51802-1-zenczykowski@gmail.com>

This patch breaks CRIU tests:

===================== Run zdtm/transition/socket-tcp6 in h =====================
Start test
./socket-tcp6 --pidfile=socket-tcp6.pid --outfile=socket-tcp6.out
start time for zdtm/transition/socket-tcp6: 0.90
Run criu dump
Run criu restore
=[log]=> dump/zdtm/transition/socket-tcp6/39/1/restore.log
------------------------ grep Error ------------------------
(00.132120)     39: 	restore rcvlowat 1 for socket
(00.132130)     39: 	restore mark 0 for socket
(00.132149)     39: 		Create fd for 6
(00.132157)     39: Schedule 3 socket for repair off
(00.132183)     39: Error (criu/sockets.c:379): Can't set 1:15 (len 4): Structure needs cleaning
(00.132192)     39: Error (criu/files.c:1243): Unable to open fd=6 id=0x8
(00.132241) Error (criu/cr-restore.c:2391): Failed to wait inprogress tasks
(00.132286) Error (criu/cr-restore.c:2568): Restoring FAILED.
------------------------ ERROR OVER ------------------------
############ Test zdtm/transition/socket-tcp6 FAIL at CRIU restore #############

https://travis-ci.org/avagin/linux/jobs/388989833

We use these options to restore tcp sockets. On the first stage, CRIU creates
all sockets with SO_REUSEADDR and SO_REUSEPORT, than it restores established
and listening sockets. After that criu restores values of SO_REUSEADDR and
SO_REUSEPORT options.

On Sun, Jun 03, 2018 at 10:47:05AM -0700, Maciej Żenczykowski wrote:
> From: Maciej Żenczykowski <maze@google.com>
> 
> It is not safe to do so because such sockets are already in the
> hash tables and changing these options can result in invalidating
> the tb->fastreuse(port) caching.
> 
> This can have later far reaching consequences wrt. bind conflict checks
> which rely on these caches (for optimization purposes).
> 
> Not to mention that you can currently end up with two identical
> non-reuseport listening sockets bound to the same local ip:port
> by clearing reuseport on them after they've already both been bound.
> 
> There is unfortunately no EISBOUND error or anything similar,
> and EISCONN seems to be misleading for a bound-but-not-connected
> socket, so use EUCLEAN 'Structure needs cleaning' which AFAICT
> is the closest you can get to meaning 'socket in bad state'.
> (although perhaps EINVAL wouldn't be a bad choice either?)
> 
> This does unfortunately run the risk of breaking buggy
> userspace programs...
> 
> Signed-off-by: Maciej Żenczykowski <maze@google.com>
> Cc: Eric Dumazet <edumazet@google.com>
> 
> Change-Id: I77c2b3429b2fdf42671eee0fa7a8ba721c94963b
> Reviewed-by: Eric Dumazet <edumazet@google.com>
> ---
>  net/core/sock.c | 15 ++++++++++++++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 435a0ba85e52..feca4c98f8a0 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -728,9 +728,22 @@ int sock_setsockopt(struct socket *sock, int level, int optname,
>  			sock_valbool_flag(sk, SOCK_DBG, valbool);
>  		break;
>  	case SO_REUSEADDR:
> -		sk->sk_reuse = (valbool ? SK_CAN_REUSE : SK_NO_REUSE);
> +		val = (valbool ? SK_CAN_REUSE : SK_NO_REUSE);
> +		if ((sk->sk_family == PF_INET || sk->sk_family == PF_INET6) &&
> +		    inet_sk(sk)->inet_num &&
> +		    (sk->sk_reuse != val)) {
> +			ret = (sk->sk_state == TCP_ESTABLISHED) ? -EISCONN : -EUCLEAN;
> +			break;
> +		}
> +		sk->sk_reuse = val;
>  		break;
>  	case SO_REUSEPORT:
> +		if ((sk->sk_family == PF_INET || sk->sk_family == PF_INET6) &&
> +		    inet_sk(sk)->inet_num &&
> +		    (sk->sk_reuseport != valbool)) {
> +			ret = (sk->sk_state == TCP_ESTABLISHED) ? -EISCONN : -EUCLEAN;
> +			break;
> +		}
>  		sk->sk_reuseport = valbool;
>  		break;
>  	case SO_TYPE:

^ permalink raw reply

* Re: net: do not allow changing SO_REUSEADDR/SO_REUSEPORT on bound sockets
From: Maciej Żenczykowski @ 2018-06-07  0:25 UTC (permalink / raw)
  To: Andrei Vagin; +Cc: David S . Miller, Eric Dumazet, Linux NetDev
In-Reply-To: <20180606232524.GA3632@outlook.office365.com>

Yes, it does, we found this internally last night and been debating
what to do about it.

Fundamentally what it points out is that prior to this patch CRIU
could get the host into an inconsistent state.
It inserts all the sockets into the hashtables with SO_REUSEADDR set,
and then (potentially) clears it on some of them...
But the tb cache still thinks it's set on all of them.
So later attempts to bind() a socket with SO_REUSEADDR set can then
succeed even though they should fail (or something like that).

I wonder if we instead need a socket option to basically say 'ignore
all conflicts' that CRIU could set, and then clear post
bind/listen/connect
hash table insertion...

Or maybe the transition from 1->0 is valid, but from 0->1 isn't??

Or we need special per-protocol code in the SO_REUSE{ADDR,PORT}
setsockopt handler to recalculate the tb cache?

Anyone have any smart ideas?

^ permalink raw reply

* [PATCH] ip_tunnel: Fix GCC 8 warning in __ip_tunnel_create()
From: Sultan Alsawaf @ 2018-06-07  0:27 UTC (permalink / raw)
  To: sultanxda; +Cc: netdev, davem, kuznet, yoshfuji

By passing a limit of 2 bytes to strncat, GCC 8 thinks that strncat is
limited to writing fewer bytes than what it's supposed to append to the
name here. However, strncat appends n+1 bytes to the destination, so this
is correct even though it doesn't look right.

Since the bounds are checked on the line above this, just remove the string
bounds checks entirely to silence the warning since they're unneeded.

Signed-off-by: Sultan Alsawaf <sultanxda@gmail.com>
---
 net/ipv4/ip_tunnel.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c
index 38d906baf1df..c4f5602308ed 100644
--- a/net/ipv4/ip_tunnel.c
+++ b/net/ipv4/ip_tunnel.c
@@ -261,8 +261,8 @@ static struct net_device *__ip_tunnel_create(struct net *net,
 	} else {
 		if (strlen(ops->kind) > (IFNAMSIZ - 3))
 			goto failed;
-		strlcpy(name, ops->kind, IFNAMSIZ);
-		strncat(name, "%d", 2);
+		strcpy(name, ops->kind);
+		strcat(name, "%d");
 	}

 	ASSERT_RTNL();
-- 
2.17.1

^ permalink raw reply related

* [PATCH - RFC] rhashtable: add rhashtable_walk_last_seen()
From: NeilBrown @ 2018-06-07  2:45 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Herbert Xu, Thomas Graf, Linux Kernel Network Developers, LKML,
	Tom Herbert
In-Reply-To: <87in6wo636.fsf@notabene.neil.brown.name>

[-- Attachment #1: Type: text/plain, Size: 2863 bytes --]


rhashtable_walk_last_seen() returns the object returned by
the previous rhashtable_walk_next(), providing it is still in the
table (or was during this grace period).
This works even if rhashtable_walk_stop() and rhashtable_talk_start()
have been called since the last rhashtable_walk_next().

If there have been no calls to rhashtable_walk_next(), or if the
object is gone from the table, then NULL is returned.

This can usefully be used in a seq_file ->start() function.
If the pos is the same as was returned by the last ->next() call,
then rhashtable_walk_last_seen() can be used to re-establish the
current location in the table.  If it returns NULL, then
rhashtable_walk_next() should be used.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 include/linux/rhashtable.h |  1 +
 lib/rhashtable.c           | 30 ++++++++++++++++++++++++++++++
 2 files changed, 31 insertions(+)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index 657e37ae314c..d63b472e9d50 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -248,6 +248,7 @@ static inline void rhashtable_walk_start(struct rhashtable_iter *iter)
 
 void *rhashtable_walk_next(struct rhashtable_iter *iter);
 void *rhashtable_walk_peek(struct rhashtable_iter *iter);
+void *rhashtable_walk_last_seen(struct rhashtable_iter *iter);
 void rhashtable_walk_stop(struct rhashtable_iter *iter) __releases(RCU);
 
 void rhashtable_free_and_destroy(struct rhashtable *ht,
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index e0060cb264e9..45f2554399a5 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -949,6 +949,36 @@ void *rhashtable_walk_peek(struct rhashtable_iter *iter)
 }
 EXPORT_SYMBOL_GPL(rhashtable_walk_peek);
 
+/**
+ * rhashtable_walk_last_seen - Return the previously returned object, if available
+ * @iter:	Hash table iterator
+ *
+ * If rhashtable_walk_next() has previously been called and the object
+ * it returned is still in the hash table, that object is returned again,
+ * otherwise %NULL is returned.
+ *
+ * If the recent rhashtable_walk_next() call was since the most recent
+ * rhashtable_walk_start() call then the returned object may not, strictly
+ * speaking, still be in the table.  It will be safe to dereference.
+ *
+ * Note that the iterator is not changed.
+ */
+void *rhashtable_walk_last_seen(struct rhashtable_iter *iter)
+{
+	struct rhashtable *ht = iter->ht;
+	struct rhash_head *p = iter->p;
+
+	if (!p)
+		return NULL;
+	if (!iter->p_is_unsafe || ht->rhlist)
+		return p;
+	rht_for_each_rcu(p, iter->walker.tbl, iter->slot)
+		if (p == iter->p)
+			return p;
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(rhashtable_walk_last_seen);
+
 /**
  * rhashtable_walk_stop - Finish a hash table walk
  * @iter:	Hash table iterator
-- 
2.14.0.rc0.dirty


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related

* [PATCH - RFC] rhashtable: implement rhashtable_walk_peek() using rhashtable_walk_last_seen()
From: NeilBrown @ 2018-06-07  2:46 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Herbert Xu, Thomas Graf, Linux Kernel Network Developers, LKML,
	Tom Herbert
In-Reply-To: <871sdjnwkr.fsf@notabene.neil.brown.name>

[-- Attachment #1: Type: text/plain, Size: 2927 bytes --]


rhashtable_walk_last_seen() does most of the work that
rhashtable_walk_peek() needs done, so use it.
Also update the documentation for rhashtable_walk_peek() to clarify
the expected use case.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 lib/rhashtable.c | 42 ++++++++++++++++++++----------------------
 1 file changed, 20 insertions(+), 22 deletions(-)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 45f2554399a5..30bb9ead15f4 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -916,36 +916,34 @@ void *rhashtable_walk_next(struct rhashtable_iter *iter)
 EXPORT_SYMBOL_GPL(rhashtable_walk_next);
 
 /**
- * rhashtable_walk_peek - Return the next object but don't advance the iterator
+ * rhashtable_walk_peek - Return the next object to use in an interrupted walk
  * @iter:	Hash table iterator
  *
- * Returns the next object or NULL when the end of the table is reached.
+ * Returns the "current" object or NULL when the end of the table is reached.
+ * When an rhashtable_walk is interrupted with rhashtable_walk_stop(),
+ * it is often because an object was found that could not be processed
+ * immediately, possible because there is no more space to encode details
+ * of the object (e.g. when producing a seq_file from the table).
+ * When the walk is restarted, the same object needs to be processed again,
+ * if possible.  The object might have been removed from the table while
+ * the walk was paused, so it might not be available.  In that case, the
+ * normal "next" object should be treated as "current".
  *
- * Returns -EAGAIN if resize event occurred.  Note that the iterator
+ * To support this common case, rhashtable_walk_peek() returns the
+ * appropriate object to process after an interrupted walk, either the
+ * one that was most recently returned, or if that doesn't exist - the
+ * next one.
+ *
+ * Returns -EAGAIN if resize event occurred.  In that case the iterator
  * will rewind back to the beginning and you may continue to use it.
  */
 void *rhashtable_walk_peek(struct rhashtable_iter *iter)
 {
-	struct rhlist_head *list = iter->list;
-	struct rhashtable *ht = iter->ht;
-	struct rhash_head *p = iter->p;
+	void *ret = rhashtable_walk_last_seen(iter);
 
-	if (p)
-		return rht_obj(ht, ht->rhlist ? &list->rhead : p);
-
-	/* No object found in current iter, find next one in the table. */
-
-	if (iter->skip) {
-		/* A nonzero skip value points to the next entry in the table
-		 * beyond that last one that was found. Decrement skip so
-		 * we find the current value. __rhashtable_walk_find_next
-		 * will restore the original value of skip assuming that
-		 * the table hasn't changed.
-		 */
-		iter->skip--;
-	}
-
-	return __rhashtable_walk_find_next(iter);
+	if (!ret)
+		ret = rhashtable_walk_next(iter);
+	return ret;
 }
 EXPORT_SYMBOL_GPL(rhashtable_walk_peek);
 
-- 
2.14.0.rc0.dirty


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related

* Re: [PATCH 03/18] rhashtable: remove nulls_base and related code.
From: NeilBrown @ 2018-06-07  2:49 UTC (permalink / raw)
  To: Thomas Graf, Herbert Xu; +Cc: netdev, linux-kernel
In-Reply-To: <152782824939.30340.13120991612931450792.stgit@noble>

[-- Attachment #1: Type: text/plain, Size: 7030 bytes --]

On Fri, Jun 01 2018, NeilBrown wrote:

> This "feature" is unused, undocumented, and untested and so
> doesn't really belong.  Next patch will introduce support
> to detect when a search gets diverted down a different chain,
> which the common purpose of nulls markers.
>
> This patch actually fixes a bug too.  The table resizing allows a
> table to grow to 2^31 buckets, but the hash is truncated to 27 bits -
> any growth beyond 2^27 is wasteful an ineffective.
>
> This patch results in NULLS_MARKER(0) being used for all chains,
> and leaves the use of rht_is_a_null() to test for it.
>
> Signed-off-by: NeilBrown <neilb@suse.com>

Hi Herbert,
 You've acked a few patches that depends on this one, but not this
 patch itself.  If you could ack this one, I could submit a collection
 of patches for inclusion (after the merge window closes I guess)
 and then have fewer outstanding.
 This assumes you are in-principle happy with the alternative approach I
 took to handling list-nulls.  I got the impression that it was only
 some small details holding that back.

Thanks,
NeilBrown

> ---
>  include/linux/rhashtable-types.h |    2 --
>  include/linux/rhashtable.h       |   33 +++------------------------------
>  lib/rhashtable.c                 |    8 --------
>  lib/test_rhashtable.c            |    5 +----
>  4 files changed, 4 insertions(+), 44 deletions(-)
>
> diff --git a/include/linux/rhashtable-types.h b/include/linux/rhashtable-types.h
> index 9740063ff13b..763d613ce2c2 100644
> --- a/include/linux/rhashtable-types.h
> +++ b/include/linux/rhashtable-types.h
> @@ -50,7 +50,6 @@ typedef int (*rht_obj_cmpfn_t)(struct rhashtable_compare_arg *arg,
>   * @min_size: Minimum size while shrinking
>   * @locks_mul: Number of bucket locks to allocate per cpu (default: 32)
>   * @automatic_shrinking: Enable automatic shrinking of tables
> - * @nulls_base: Base value to generate nulls marker
>   * @hashfn: Hash function (default: jhash2 if !(key_len % 4), or jhash)
>   * @obj_hashfn: Function to hash object
>   * @obj_cmpfn: Function to compare key with object
> @@ -64,7 +63,6 @@ struct rhashtable_params {
>  	u16			min_size;
>  	bool			automatic_shrinking;
>  	u8			locks_mul;
> -	u32			nulls_base;
>  	rht_hashfn_t		hashfn;
>  	rht_obj_hashfn_t	obj_hashfn;
>  	rht_obj_cmpfn_t		obj_cmpfn;
> diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
> index 48754ab07cdf..d9f719af7936 100644
> --- a/include/linux/rhashtable.h
> +++ b/include/linux/rhashtable.h
> @@ -28,25 +28,8 @@
>  #include <linux/rhashtable-types.h>
>  /*
>   * The end of the chain is marked with a special nulls marks which has
> - * the following format:
> - *
> - * +-------+-----------------------------------------------------+-+
> - * | Base  |                      Hash                           |1|
> - * +-------+-----------------------------------------------------+-+
> - *
> - * Base (4 bits) : Reserved to distinguish between multiple tables.
> - *                 Specified via &struct rhashtable_params.nulls_base.
> - * Hash (27 bits): Full hash (unmasked) of first element added to bucket
> - * 1 (1 bit)     : Nulls marker (always set)
> - *
> - * The remaining bits of the next pointer remain unused for now.
> + * the least significant bit set.
>   */
> -#define RHT_BASE_BITS		4
> -#define RHT_HASH_BITS		27
> -#define RHT_BASE_SHIFT		RHT_HASH_BITS
> -
> -/* Base bits plus 1 bit for nulls marker */
> -#define RHT_HASH_RESERVED_SPACE	(RHT_BASE_BITS + 1)
>  
>  /* Maximum chain length before rehash
>   *
> @@ -92,24 +75,14 @@ struct bucket_table {
>  	struct rhash_head __rcu *buckets[] ____cacheline_aligned_in_smp;
>  };
>  
> -static inline unsigned long rht_marker(const struct rhashtable *ht, u32 hash)
> -{
> -	return NULLS_MARKER(ht->p.nulls_base + hash);
> -}
> -
>  #define INIT_RHT_NULLS_HEAD(ptr, ht, hash) \
> -	((ptr) = (typeof(ptr)) rht_marker(ht, hash))
> +	((ptr) = (typeof(ptr)) NULLS_MARKER(0))
>  
>  static inline bool rht_is_a_nulls(const struct rhash_head *ptr)
>  {
>  	return ((unsigned long) ptr & 1);
>  }
>  
> -static inline unsigned long rht_get_nulls_value(const struct rhash_head *ptr)
> -{
> -	return ((unsigned long) ptr) >> 1;
> -}
> -
>  static inline void *rht_obj(const struct rhashtable *ht,
>  			    const struct rhash_head *he)
>  {
> @@ -119,7 +92,7 @@ static inline void *rht_obj(const struct rhashtable *ht,
>  static inline unsigned int rht_bucket_index(const struct bucket_table *tbl,
>  					    unsigned int hash)
>  {
> -	return (hash >> RHT_HASH_RESERVED_SPACE) & (tbl->size - 1);
> +	return hash & (tbl->size - 1);
>  }
>  
>  static inline unsigned int rht_key_get_hash(struct rhashtable *ht,
> diff --git a/lib/rhashtable.c b/lib/rhashtable.c
> index c9fafea7dc6e..688693c919be 100644
> --- a/lib/rhashtable.c
> +++ b/lib/rhashtable.c
> @@ -995,7 +995,6 @@ static u32 rhashtable_jhash2(const void *key, u32 length, u32 seed)
>   *	.key_offset = offsetof(struct test_obj, key),
>   *	.key_len = sizeof(int),
>   *	.hashfn = jhash,
> - *	.nulls_base = (1U << RHT_BASE_SHIFT),
>   * };
>   *
>   * Configuration Example 2: Variable length keys
> @@ -1029,9 +1028,6 @@ int rhashtable_init(struct rhashtable *ht,
>  	    (params->obj_hashfn && !params->obj_cmpfn))
>  		return -EINVAL;
>  
> -	if (params->nulls_base && params->nulls_base < (1U << RHT_BASE_SHIFT))
> -		return -EINVAL;
> -
>  	memset(ht, 0, sizeof(*ht));
>  	mutex_init(&ht->mutex);
>  	spin_lock_init(&ht->lock);
> @@ -1096,10 +1092,6 @@ int rhltable_init(struct rhltable *hlt, const struct rhashtable_params *params)
>  {
>  	int err;
>  
> -	/* No rhlist NULLs marking for now. */
> -	if (params->nulls_base)
> -		return -EINVAL;
> -
>  	err = rhashtable_init(&hlt->ht, params);
>  	hlt->ht.rhlist = true;
>  	return err;
> diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c
> index bf92b7aa2a49..b428a9c7522a 100644
> --- a/lib/test_rhashtable.c
> +++ b/lib/test_rhashtable.c
> @@ -83,7 +83,7 @@ static u32 my_hashfn(const void *data, u32 len, u32 seed)
>  {
>  	const struct test_obj_rhl *obj = data;
>  
> -	return (obj->value.id % 10) << RHT_HASH_RESERVED_SPACE;
> +	return (obj->value.id % 10);
>  }
>  
>  static int my_cmpfn(struct rhashtable_compare_arg *arg, const void *obj)
> @@ -99,7 +99,6 @@ static struct rhashtable_params test_rht_params = {
>  	.key_offset = offsetof(struct test_obj, value),
>  	.key_len = sizeof(struct test_obj_val),
>  	.hashfn = jhash,
> -	.nulls_base = (3U << RHT_BASE_SHIFT),
>  };
>  
>  static struct rhashtable_params test_rht_params_dup = {
> @@ -294,8 +293,6 @@ static int __init test_rhltable(unsigned int entries)
>  	if (!obj_in_table)
>  		goto out_free;
>  
> -	/* nulls_base not supported in rhlist interface */
> -	test_rht_params.nulls_base = 0;
>  	err = rhltable_init(&rhlt, &test_rht_params);
>  	if (WARN_ON(err))
>  		goto out_free;

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply

* [v2, 00/10] Support DPAA PTP clock and timestamping
From: Yangbo Lu @ 2018-06-07  3:22 UTC (permalink / raw)
  To: netdev, madalin.bucur, Richard Cochran, Rob Herring, Shawn Guo,
	David S . Miller
  Cc: devicetree, linuxppc-dev, linux-arm-kernel, linux-kernel,
	Yangbo Lu

This patchset is to support DPAA FMAN PTP clock and HW timestamping.
- The patch #1 to patch #5 are to support DPAA FMAN 1588 timer in
  ptp_qoriq driver.
- The patch #6 to patch #10 are to add HW timestamping support in
  DPAA ethernet driver.

Yangbo Lu (10):
  fsl/fman: share the event interrupt
  ptp: support DPAA FMan 1588 timer in ptp_qoriq
  dt-binding: ptp_qoriq: add DPAA FMan support
  powerpc/mpc85xx: move ptp timer out of fman in dts
  arm64: dts: fsl: move ptp timer out of fman
  fsl/fman: add set_tstamp interface
  fsl/fman_port: support getting timestamp field
  fsl/fman: define frame description command UPD
  dpaa_eth: add support for hardware timestamping
  dpaa_eth: add the get_ts_info interface for ethtool

 Documentation/devicetree/bindings/net/fsl-fman.txt |   25 +-----
 .../devicetree/bindings/ptp/ptp-qoriq.txt          |   15 +++-
 arch/arm64/boot/dts/freescale/qoriq-fman3-0.dtsi   |   14 ++-
 arch/powerpc/boot/dts/fsl/qoriq-fman-0.dtsi        |   14 ++-
 arch/powerpc/boot/dts/fsl/qoriq-fman-1.dtsi        |   14 ++-
 arch/powerpc/boot/dts/fsl/qoriq-fman3-0.dtsi       |   14 ++-
 arch/powerpc/boot/dts/fsl/qoriq-fman3-1.dtsi       |   14 ++-
 arch/powerpc/boot/dts/fsl/qoriq-fman3l-0.dtsi      |   14 ++-
 drivers/net/ethernet/freescale/dpaa/dpaa_eth.c     |  101 ++++++++++++++++++-
 drivers/net/ethernet/freescale/dpaa/dpaa_eth.h     |    3 +
 drivers/net/ethernet/freescale/dpaa/dpaa_ethtool.c |   39 ++++++++
 drivers/net/ethernet/freescale/fman/fman.c         |    3 +-
 drivers/net/ethernet/freescale/fman/fman.h         |    1 +
 drivers/net/ethernet/freescale/fman/fman_dtsec.c   |   27 +++++
 drivers/net/ethernet/freescale/fman/fman_dtsec.h   |    1 +
 drivers/net/ethernet/freescale/fman/fman_memac.c   |    5 +
 drivers/net/ethernet/freescale/fman/fman_memac.h   |    1 +
 drivers/net/ethernet/freescale/fman/fman_port.c    |   12 +++
 drivers/net/ethernet/freescale/fman/fman_port.h    |    3 +
 drivers/net/ethernet/freescale/fman/fman_tgec.c    |   21 ++++
 drivers/net/ethernet/freescale/fman/fman_tgec.h    |    1 +
 drivers/net/ethernet/freescale/fman/mac.c          |    3 +
 drivers/net/ethernet/freescale/fman/mac.h          |    1 +
 drivers/ptp/Kconfig                                |    2 +-
 drivers/ptp/ptp_qoriq.c                            |  104 ++++++++++++-------
 include/linux/fsl/ptp_qoriq.h                      |   38 ++++++--
 26 files changed, 375 insertions(+), 115 deletions(-)

^ permalink raw reply

* [v2, 01/10] fsl/fman: share the event interrupt
From: Yangbo Lu @ 2018-06-07  3:22 UTC (permalink / raw)
  To: netdev, madalin.bucur, Richard Cochran, Rob Herring, Shawn Guo,
	David S . Miller
  Cc: devicetree, linuxppc-dev, linux-arm-kernel, linux-kernel,
	Yangbo Lu
In-Reply-To: <20180607032256.39802-1-yangbo.lu@nxp.com>

This patch is to share fman event interrupt because
the 1588 timer driver will also use this interrupt.

Signed-off-by: Yangbo Lu <yangbo.lu@nxp.com>
---
Changes for v2:
	- None.
---
 drivers/net/ethernet/freescale/fman/fman.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/freescale/fman/fman.c b/drivers/net/ethernet/freescale/fman/fman.c
index 9530405..c415ac6 100644
--- a/drivers/net/ethernet/freescale/fman/fman.c
+++ b/drivers/net/ethernet/freescale/fman/fman.c
@@ -2801,7 +2801,8 @@ static irqreturn_t fman_irq(int irq, void *handle)
 	of_node_put(muram_node);
 	of_node_put(fm_node);
 
-	err = devm_request_irq(&of_dev->dev, irq, fman_irq, 0, "fman", fman);
+	err = devm_request_irq(&of_dev->dev, irq, fman_irq, IRQF_SHARED,
+			       "fman", fman);
 	if (err < 0) {
 		dev_err(&of_dev->dev, "%s: irq %d allocation failed (error = %d)\n",
 			__func__, irq, err);
-- 
1.7.1

^ permalink raw reply related

* [v2, 02/10] ptp: support DPAA FMan 1588 timer in ptp_qoriq
From: Yangbo Lu @ 2018-06-07  3:22 UTC (permalink / raw)
  To: netdev, madalin.bucur, Richard Cochran, Rob Herring, Shawn Guo,
	David S . Miller
  Cc: devicetree, linuxppc-dev, linux-arm-kernel, linux-kernel,
	Yangbo Lu
In-Reply-To: <20180607032256.39802-1-yangbo.lu@nxp.com>

This patch is to support DPAA (Data Path Acceleration Architecture)
1588 timer by adding "fsl,fman-ptp-timer" compatible, sharing
interrupt with FMan, adding FSL_DPAA_ETH dependency, and fixing
up register offset.

Signed-off-by: Yangbo Lu <yangbo.lu@nxp.com>
---
Changes for v2:
	- None.
---
 drivers/ptp/Kconfig           |    2 +-
 drivers/ptp/ptp_qoriq.c       |  104 ++++++++++++++++++++++++++---------------
 include/linux/fsl/ptp_qoriq.h |   38 ++++++++++++---
 3 files changed, 98 insertions(+), 46 deletions(-)

diff --git a/drivers/ptp/Kconfig b/drivers/ptp/Kconfig
index 474c988..d137c48 100644
--- a/drivers/ptp/Kconfig
+++ b/drivers/ptp/Kconfig
@@ -43,7 +43,7 @@ config PTP_1588_CLOCK_DTE
 
 config PTP_1588_CLOCK_QORIQ
 	tristate "Freescale QorIQ 1588 timer as PTP clock"
-	depends on GIANFAR
+	depends on GIANFAR || FSL_DPAA_ETH
 	depends on PTP_1588_CLOCK
 	default y
 	help
diff --git a/drivers/ptp/ptp_qoriq.c b/drivers/ptp/ptp_qoriq.c
index 1468a16..c4e3545 100644
--- a/drivers/ptp/ptp_qoriq.c
+++ b/drivers/ptp/ptp_qoriq.c
@@ -39,11 +39,12 @@
 /* Caller must hold qoriq_ptp->lock. */
 static u64 tmr_cnt_read(struct qoriq_ptp *qoriq_ptp)
 {
+	struct qoriq_ptp_registers *regs = &qoriq_ptp->regs;
 	u64 ns;
 	u32 lo, hi;
 
-	lo = qoriq_read(&qoriq_ptp->regs->tmr_cnt_l);
-	hi = qoriq_read(&qoriq_ptp->regs->tmr_cnt_h);
+	lo = qoriq_read(&regs->ctrl_regs->tmr_cnt_l);
+	hi = qoriq_read(&regs->ctrl_regs->tmr_cnt_h);
 	ns = ((u64) hi) << 32;
 	ns |= lo;
 	return ns;
@@ -52,16 +53,18 @@ static u64 tmr_cnt_read(struct qoriq_ptp *qoriq_ptp)
 /* Caller must hold qoriq_ptp->lock. */
 static void tmr_cnt_write(struct qoriq_ptp *qoriq_ptp, u64 ns)
 {
+	struct qoriq_ptp_registers *regs = &qoriq_ptp->regs;
 	u32 hi = ns >> 32;
 	u32 lo = ns & 0xffffffff;
 
-	qoriq_write(&qoriq_ptp->regs->tmr_cnt_l, lo);
-	qoriq_write(&qoriq_ptp->regs->tmr_cnt_h, hi);
+	qoriq_write(&regs->ctrl_regs->tmr_cnt_l, lo);
+	qoriq_write(&regs->ctrl_regs->tmr_cnt_h, hi);
 }
 
 /* Caller must hold qoriq_ptp->lock. */
 static void set_alarm(struct qoriq_ptp *qoriq_ptp)
 {
+	struct qoriq_ptp_registers *regs = &qoriq_ptp->regs;
 	u64 ns;
 	u32 lo, hi;
 
@@ -70,16 +73,18 @@ static void set_alarm(struct qoriq_ptp *qoriq_ptp)
 	ns -= qoriq_ptp->tclk_period;
 	hi = ns >> 32;
 	lo = ns & 0xffffffff;
-	qoriq_write(&qoriq_ptp->regs->tmr_alarm1_l, lo);
-	qoriq_write(&qoriq_ptp->regs->tmr_alarm1_h, hi);
+	qoriq_write(&regs->alarm_regs->tmr_alarm1_l, lo);
+	qoriq_write(&regs->alarm_regs->tmr_alarm1_h, hi);
 }
 
 /* Caller must hold qoriq_ptp->lock. */
 static void set_fipers(struct qoriq_ptp *qoriq_ptp)
 {
+	struct qoriq_ptp_registers *regs = &qoriq_ptp->regs;
+
 	set_alarm(qoriq_ptp);
-	qoriq_write(&qoriq_ptp->regs->tmr_fiper1, qoriq_ptp->tmr_fiper1);
-	qoriq_write(&qoriq_ptp->regs->tmr_fiper2, qoriq_ptp->tmr_fiper2);
+	qoriq_write(&regs->fiper_regs->tmr_fiper1, qoriq_ptp->tmr_fiper1);
+	qoriq_write(&regs->fiper_regs->tmr_fiper2, qoriq_ptp->tmr_fiper2);
 }
 
 /*
@@ -89,16 +94,17 @@ static void set_fipers(struct qoriq_ptp *qoriq_ptp)
 static irqreturn_t isr(int irq, void *priv)
 {
 	struct qoriq_ptp *qoriq_ptp = priv;
+	struct qoriq_ptp_registers *regs = &qoriq_ptp->regs;
 	struct ptp_clock_event event;
 	u64 ns;
 	u32 ack = 0, lo, hi, mask, val;
 
-	val = qoriq_read(&qoriq_ptp->regs->tmr_tevent);
+	val = qoriq_read(&regs->ctrl_regs->tmr_tevent);
 
 	if (val & ETS1) {
 		ack |= ETS1;
-		hi = qoriq_read(&qoriq_ptp->regs->tmr_etts1_h);
-		lo = qoriq_read(&qoriq_ptp->regs->tmr_etts1_l);
+		hi = qoriq_read(&regs->etts_regs->tmr_etts1_h);
+		lo = qoriq_read(&regs->etts_regs->tmr_etts1_l);
 		event.type = PTP_CLOCK_EXTTS;
 		event.index = 0;
 		event.timestamp = ((u64) hi) << 32;
@@ -108,8 +114,8 @@ static irqreturn_t isr(int irq, void *priv)
 
 	if (val & ETS2) {
 		ack |= ETS2;
-		hi = qoriq_read(&qoriq_ptp->regs->tmr_etts2_h);
-		lo = qoriq_read(&qoriq_ptp->regs->tmr_etts2_l);
+		hi = qoriq_read(&regs->etts_regs->tmr_etts2_h);
+		lo = qoriq_read(&regs->etts_regs->tmr_etts2_l);
 		event.type = PTP_CLOCK_EXTTS;
 		event.index = 1;
 		event.timestamp = ((u64) hi) << 32;
@@ -130,16 +136,16 @@ static irqreturn_t isr(int irq, void *priv)
 			hi = ns >> 32;
 			lo = ns & 0xffffffff;
 			spin_lock(&qoriq_ptp->lock);
-			qoriq_write(&qoriq_ptp->regs->tmr_alarm2_l, lo);
-			qoriq_write(&qoriq_ptp->regs->tmr_alarm2_h, hi);
+			qoriq_write(&regs->alarm_regs->tmr_alarm2_l, lo);
+			qoriq_write(&regs->alarm_regs->tmr_alarm2_h, hi);
 			spin_unlock(&qoriq_ptp->lock);
 			qoriq_ptp->alarm_value = ns;
 		} else {
-			qoriq_write(&qoriq_ptp->regs->tmr_tevent, ALM2);
+			qoriq_write(&regs->ctrl_regs->tmr_tevent, ALM2);
 			spin_lock(&qoriq_ptp->lock);
-			mask = qoriq_read(&qoriq_ptp->regs->tmr_temask);
+			mask = qoriq_read(&regs->ctrl_regs->tmr_temask);
 			mask &= ~ALM2EN;
-			qoriq_write(&qoriq_ptp->regs->tmr_temask, mask);
+			qoriq_write(&regs->ctrl_regs->tmr_temask, mask);
 			spin_unlock(&qoriq_ptp->lock);
 			qoriq_ptp->alarm_value = 0;
 			qoriq_ptp->alarm_interval = 0;
@@ -153,7 +159,7 @@ static irqreturn_t isr(int irq, void *priv)
 	}
 
 	if (ack) {
-		qoriq_write(&qoriq_ptp->regs->tmr_tevent, ack);
+		qoriq_write(&regs->ctrl_regs->tmr_tevent, ack);
 		return IRQ_HANDLED;
 	} else
 		return IRQ_NONE;
@@ -169,6 +175,7 @@ static int ptp_qoriq_adjfine(struct ptp_clock_info *ptp, long scaled_ppm)
 	u32 tmr_add;
 	int neg_adj = 0;
 	struct qoriq_ptp *qoriq_ptp = container_of(ptp, struct qoriq_ptp, caps);
+	struct qoriq_ptp_registers *regs = &qoriq_ptp->regs;
 
 	if (scaled_ppm < 0) {
 		neg_adj = 1;
@@ -186,7 +193,7 @@ static int ptp_qoriq_adjfine(struct ptp_clock_info *ptp, long scaled_ppm)
 
 	tmr_add = neg_adj ? tmr_add - diff : tmr_add + diff;
 
-	qoriq_write(&qoriq_ptp->regs->tmr_add, tmr_add);
+	qoriq_write(&regs->ctrl_regs->tmr_add, tmr_add);
 
 	return 0;
 }
@@ -250,6 +257,7 @@ static int ptp_qoriq_enable(struct ptp_clock_info *ptp,
 			      struct ptp_clock_request *rq, int on)
 {
 	struct qoriq_ptp *qoriq_ptp = container_of(ptp, struct qoriq_ptp, caps);
+	struct qoriq_ptp_registers *regs = &qoriq_ptp->regs;
 	unsigned long flags;
 	u32 bit, mask;
 
@@ -266,23 +274,23 @@ static int ptp_qoriq_enable(struct ptp_clock_info *ptp,
 			return -EINVAL;
 		}
 		spin_lock_irqsave(&qoriq_ptp->lock, flags);
-		mask = qoriq_read(&qoriq_ptp->regs->tmr_temask);
+		mask = qoriq_read(&regs->ctrl_regs->tmr_temask);
 		if (on)
 			mask |= bit;
 		else
 			mask &= ~bit;
-		qoriq_write(&qoriq_ptp->regs->tmr_temask, mask);
+		qoriq_write(&regs->ctrl_regs->tmr_temask, mask);
 		spin_unlock_irqrestore(&qoriq_ptp->lock, flags);
 		return 0;
 
 	case PTP_CLK_REQ_PPS:
 		spin_lock_irqsave(&qoriq_ptp->lock, flags);
-		mask = qoriq_read(&qoriq_ptp->regs->tmr_temask);
+		mask = qoriq_read(&regs->ctrl_regs->tmr_temask);
 		if (on)
 			mask |= PP1EN;
 		else
 			mask &= ~PP1EN;
-		qoriq_write(&qoriq_ptp->regs->tmr_temask, mask);
+		qoriq_write(&regs->ctrl_regs->tmr_temask, mask);
 		spin_unlock_irqrestore(&qoriq_ptp->lock, flags);
 		return 0;
 
@@ -313,10 +321,12 @@ static int qoriq_ptp_probe(struct platform_device *dev)
 {
 	struct device_node *node = dev->dev.of_node;
 	struct qoriq_ptp *qoriq_ptp;
+	struct qoriq_ptp_registers *regs;
 	struct timespec64 now;
 	int err = -ENOMEM;
 	u32 tmr_ctrl;
 	unsigned long flags;
+	void __iomem *base;
 
 	qoriq_ptp = kzalloc(sizeof(*qoriq_ptp), GFP_KERNEL);
 	if (!qoriq_ptp)
@@ -351,7 +361,7 @@ static int qoriq_ptp_probe(struct platform_device *dev)
 		pr_err("irq not in device tree\n");
 		goto no_node;
 	}
-	if (request_irq(qoriq_ptp->irq, isr, 0, DRIVER, qoriq_ptp)) {
+	if (request_irq(qoriq_ptp->irq, isr, IRQF_SHARED, DRIVER, qoriq_ptp)) {
 		pr_err("request_irq failed\n");
 		goto no_node;
 	}
@@ -368,12 +378,27 @@ static int qoriq_ptp_probe(struct platform_device *dev)
 
 	spin_lock_init(&qoriq_ptp->lock);
 
-	qoriq_ptp->regs = ioremap(qoriq_ptp->rsrc->start,
-				resource_size(qoriq_ptp->rsrc));
-	if (!qoriq_ptp->regs) {
+	base = ioremap(qoriq_ptp->rsrc->start,
+		       resource_size(qoriq_ptp->rsrc));
+	if (!base) {
 		pr_err("ioremap ptp registers failed\n");
 		goto no_ioremap;
 	}
+
+	qoriq_ptp->base = base;
+
+	if (of_device_is_compatible(node, "fsl,fman-ptp-timer")) {
+		qoriq_ptp->regs.ctrl_regs = base + FMAN_CTRL_REGS_OFFSET;
+		qoriq_ptp->regs.alarm_regs = base + FMAN_ALARM_REGS_OFFSET;
+		qoriq_ptp->regs.fiper_regs = base + FMAN_FIPER_REGS_OFFSET;
+		qoriq_ptp->regs.etts_regs = base + FMAN_ETTS_REGS_OFFSET;
+	} else {
+		qoriq_ptp->regs.ctrl_regs = base + CTRL_REGS_OFFSET;
+		qoriq_ptp->regs.alarm_regs = base + ALARM_REGS_OFFSET;
+		qoriq_ptp->regs.fiper_regs = base + FIPER_REGS_OFFSET;
+		qoriq_ptp->regs.etts_regs = base + ETTS_REGS_OFFSET;
+	}
+
 	getnstimeofday64(&now);
 	ptp_qoriq_settime(&qoriq_ptp->caps, &now);
 
@@ -383,13 +408,14 @@ static int qoriq_ptp_probe(struct platform_device *dev)
 
 	spin_lock_irqsave(&qoriq_ptp->lock, flags);
 
-	qoriq_write(&qoriq_ptp->regs->tmr_ctrl,   tmr_ctrl);
-	qoriq_write(&qoriq_ptp->regs->tmr_add,    qoriq_ptp->tmr_add);
-	qoriq_write(&qoriq_ptp->regs->tmr_prsc,   qoriq_ptp->tmr_prsc);
-	qoriq_write(&qoriq_ptp->regs->tmr_fiper1, qoriq_ptp->tmr_fiper1);
-	qoriq_write(&qoriq_ptp->regs->tmr_fiper2, qoriq_ptp->tmr_fiper2);
+	regs = &qoriq_ptp->regs;
+	qoriq_write(&regs->ctrl_regs->tmr_ctrl,   tmr_ctrl);
+	qoriq_write(&regs->ctrl_regs->tmr_add,    qoriq_ptp->tmr_add);
+	qoriq_write(&regs->ctrl_regs->tmr_prsc,   qoriq_ptp->tmr_prsc);
+	qoriq_write(&regs->fiper_regs->tmr_fiper1, qoriq_ptp->tmr_fiper1);
+	qoriq_write(&regs->fiper_regs->tmr_fiper2, qoriq_ptp->tmr_fiper2);
 	set_alarm(qoriq_ptp);
-	qoriq_write(&qoriq_ptp->regs->tmr_ctrl,   tmr_ctrl|FIPERST|RTPE|TE|FRD);
+	qoriq_write(&regs->ctrl_regs->tmr_ctrl,   tmr_ctrl|FIPERST|RTPE|TE|FRD);
 
 	spin_unlock_irqrestore(&qoriq_ptp->lock, flags);
 
@@ -405,7 +431,7 @@ static int qoriq_ptp_probe(struct platform_device *dev)
 	return 0;
 
 no_clock:
-	iounmap(qoriq_ptp->regs);
+	iounmap(qoriq_ptp->base);
 no_ioremap:
 	release_resource(qoriq_ptp->rsrc);
 no_resource:
@@ -419,12 +445,13 @@ static int qoriq_ptp_probe(struct platform_device *dev)
 static int qoriq_ptp_remove(struct platform_device *dev)
 {
 	struct qoriq_ptp *qoriq_ptp = platform_get_drvdata(dev);
+	struct qoriq_ptp_registers *regs = &qoriq_ptp->regs;
 
-	qoriq_write(&qoriq_ptp->regs->tmr_temask, 0);
-	qoriq_write(&qoriq_ptp->regs->tmr_ctrl,   0);
+	qoriq_write(&regs->ctrl_regs->tmr_temask, 0);
+	qoriq_write(&regs->ctrl_regs->tmr_ctrl,   0);
 
 	ptp_clock_unregister(qoriq_ptp->clock);
-	iounmap(qoriq_ptp->regs);
+	iounmap(qoriq_ptp->base);
 	release_resource(qoriq_ptp->rsrc);
 	free_irq(qoriq_ptp->irq, qoriq_ptp);
 	kfree(qoriq_ptp);
@@ -434,6 +461,7 @@ static int qoriq_ptp_remove(struct platform_device *dev)
 
 static const struct of_device_id match_table[] = {
 	{ .compatible = "fsl,etsec-ptp" },
+	{ .compatible = "fsl,fman-ptp-timer" },
 	{},
 };
 MODULE_DEVICE_TABLE(of, match_table);
diff --git a/include/linux/fsl/ptp_qoriq.h b/include/linux/fsl/ptp_qoriq.h
index b462d9e..dc3dac4 100644
--- a/include/linux/fsl/ptp_qoriq.h
+++ b/include/linux/fsl/ptp_qoriq.h
@@ -11,9 +11,8 @@
 
 /*
  * qoriq ptp registers
- * Generated by regen.tcl on Thu May 13 01:38:57 PM CEST 2010
  */
-struct qoriq_ptp_registers {
+struct ctrl_regs {
 	u32 tmr_ctrl;     /* Timer control register */
 	u32 tmr_tevent;   /* Timestamp event register */
 	u32 tmr_temask;   /* Timer event mask register */
@@ -28,22 +27,47 @@ struct qoriq_ptp_registers {
 	u8  res1[4];
 	u32 tmroff_h;     /* Timer offset high */
 	u32 tmroff_l;     /* Timer offset low */
-	u8  res2[8];
+};
+
+struct alarm_regs {
 	u32 tmr_alarm1_h; /* Timer alarm 1 high register */
 	u32 tmr_alarm1_l; /* Timer alarm 1 high register */
 	u32 tmr_alarm2_h; /* Timer alarm 2 high register */
 	u32 tmr_alarm2_l; /* Timer alarm 2 high register */
-	u8  res3[48];
+};
+
+struct fiper_regs {
 	u32 tmr_fiper1;   /* Timer fixed period interval */
 	u32 tmr_fiper2;   /* Timer fixed period interval */
 	u32 tmr_fiper3;   /* Timer fixed period interval */
-	u8  res4[20];
+};
+
+struct etts_regs {
 	u32 tmr_etts1_h;  /* Timestamp of general purpose external trigger */
 	u32 tmr_etts1_l;  /* Timestamp of general purpose external trigger */
 	u32 tmr_etts2_h;  /* Timestamp of general purpose external trigger */
 	u32 tmr_etts2_l;  /* Timestamp of general purpose external trigger */
 };
 
+struct qoriq_ptp_registers {
+	struct ctrl_regs __iomem *ctrl_regs;
+	struct alarm_regs __iomem *alarm_regs;
+	struct fiper_regs __iomem *fiper_regs;
+	struct etts_regs __iomem *etts_regs;
+};
+
+/* Offset definitions for the four register groups */
+#define CTRL_REGS_OFFSET	0x0
+#define ALARM_REGS_OFFSET	0x40
+#define FIPER_REGS_OFFSET	0x80
+#define ETTS_REGS_OFFSET	0xa0
+
+#define FMAN_CTRL_REGS_OFFSET	0x80
+#define FMAN_ALARM_REGS_OFFSET	0xb8
+#define FMAN_FIPER_REGS_OFFSET	0xd0
+#define FMAN_ETTS_REGS_OFFSET	0xe0
+
+
 /* Bit definitions for the TMR_CTRL register */
 #define ALM1P                 (1<<31) /* Alarm1 output polarity */
 #define ALM2P                 (1<<30) /* Alarm2 output polarity */
@@ -105,10 +129,10 @@ struct qoriq_ptp_registers {
 #define DRIVER		"ptp_qoriq"
 #define DEFAULT_CKSEL	1
 #define N_EXT_TS	2
-#define REG_SIZE	sizeof(struct qoriq_ptp_registers)
 
 struct qoriq_ptp {
-	struct qoriq_ptp_registers __iomem *regs;
+	void __iomem *base;
+	struct qoriq_ptp_registers regs;
 	spinlock_t lock; /* protects regs */
 	struct ptp_clock *clock;
 	struct ptp_clock_info caps;
-- 
1.7.1

^ permalink raw reply related

* [v2, 03/10] dt-binding: ptp_qoriq: add DPAA FMan support
From: Yangbo Lu @ 2018-06-07  3:22 UTC (permalink / raw)
  To: netdev, madalin.bucur, Richard Cochran, Rob Herring, Shawn Guo,
	David S . Miller
  Cc: devicetree, linuxppc-dev, linux-arm-kernel, linux-kernel,
	Yangbo Lu
In-Reply-To: <20180607032256.39802-1-yangbo.lu@nxp.com>

This patch is to add bindings description for DPAA
FMan 1588 timer, and also remove its description in
fsl-fman dt-bindings document.

Signed-off-by: Yangbo Lu <yangbo.lu@nxp.com>
---
Changes for v2:
	- None.
---
 Documentation/devicetree/bindings/net/fsl-fman.txt |   25 +-------------------
 .../devicetree/bindings/ptp/ptp-qoriq.txt          |   15 +++++++++--
 2 files changed, 13 insertions(+), 27 deletions(-)

diff --git a/Documentation/devicetree/bindings/net/fsl-fman.txt b/Documentation/devicetree/bindings/net/fsl-fman.txt
index df873d1..74603dd 100644
--- a/Documentation/devicetree/bindings/net/fsl-fman.txt
+++ b/Documentation/devicetree/bindings/net/fsl-fman.txt
@@ -356,30 +356,7 @@ ethernet@e0000 {
 ============================================================================
 FMan IEEE 1588 Node
 
-DESCRIPTION
-
-The FMan interface to support IEEE 1588
-
-
-PROPERTIES
-
-- compatible
-		Usage: required
-		Value type: <stringlist>
-		Definition: A standard property.
-		Must include "fsl,fman-ptp-timer".
-
-- reg
-		Usage: required
-		Value type: <prop-encoded-array>
-		Definition: A standard property.
-
-EXAMPLE
-
-ptp-timer@fe000 {
-	compatible = "fsl,fman-ptp-timer";
-	reg = <0xfe000 0x1000>;
-};
+Refer to Documentation/devicetree/bindings/ptp/ptp-qoriq.txt
 
 =============================================================================
 FMan MDIO Node
diff --git a/Documentation/devicetree/bindings/ptp/ptp-qoriq.txt b/Documentation/devicetree/bindings/ptp/ptp-qoriq.txt
index 0f569d8..c5d0e79 100644
--- a/Documentation/devicetree/bindings/ptp/ptp-qoriq.txt
+++ b/Documentation/devicetree/bindings/ptp/ptp-qoriq.txt
@@ -2,7 +2,8 @@
 
 General Properties:
 
-  - compatible   Should be "fsl,etsec-ptp"
+  - compatible   Should be "fsl,etsec-ptp" for eTSEC
+                 Should be "fsl,fman-ptp-timer" for DPAA FMan
   - reg          Offset and length of the register set for the device
   - interrupts   There should be at least two interrupts. Some devices
                  have as many as four PTP related interrupts.
@@ -43,14 +44,22 @@ Clock Properties:
   value, which will be directly written in those bits, that is why,
   according to reference manual, the next clock sources can be used:
 
+  For eTSEC,
   <0> - external high precision timer reference clock (TSEC_TMR_CLK
         input is used for this purpose);
   <1> - eTSEC system clock;
   <2> - eTSEC1 transmit clock;
   <3> - RTC clock input.
 
-  When this attribute is not used, eTSEC system clock will serve as
-  IEEE 1588 timer reference clock.
+  For DPAA FMan,
+  <0> - external high precision timer reference clock (TMR_1588_CLK)
+  <1> - MAC system clock (1/2 FMan clock)
+  <2> - reserved
+  <3> - RTC clock oscillator
+
+  When this attribute is not used, the IEEE 1588 timer reference clock
+  will use the eTSEC system clock (for Gianfar) or the MAC system
+  clock (for DPAA).
 
 Example:
 
-- 
1.7.1

^ permalink raw reply related

* [v2, 04/10] powerpc/mpc85xx: move ptp timer out of fman in dts
From: Yangbo Lu @ 2018-06-07  3:22 UTC (permalink / raw)
  To: netdev, madalin.bucur, Richard Cochran, Rob Herring, Shawn Guo,
	David S . Miller
  Cc: devicetree, linuxppc-dev, linux-arm-kernel, linux-kernel,
	Yangbo Lu
In-Reply-To: <20180607032256.39802-1-yangbo.lu@nxp.com>

This patch is to move ptp timer node out of fman.
Because ptp timer will be probed by ptp_qoriq driver,
it should be an independent device in case of conflict
memory mapping.

Signed-off-by: Yangbo Lu <yangbo.lu@nxp.com>
---
Changes for v2:
	- None.
---
 arch/powerpc/boot/dts/fsl/qoriq-fman-0.dtsi   |   14 ++++++++------
 arch/powerpc/boot/dts/fsl/qoriq-fman-1.dtsi   |   14 ++++++++------
 arch/powerpc/boot/dts/fsl/qoriq-fman3-0.dtsi  |   14 ++++++++------
 arch/powerpc/boot/dts/fsl/qoriq-fman3-1.dtsi  |   14 ++++++++------
 arch/powerpc/boot/dts/fsl/qoriq-fman3l-0.dtsi |   14 ++++++++------
 5 files changed, 40 insertions(+), 30 deletions(-)

diff --git a/arch/powerpc/boot/dts/fsl/qoriq-fman-0.dtsi b/arch/powerpc/boot/dts/fsl/qoriq-fman-0.dtsi
index abd01d4..6b124f7 100644
--- a/arch/powerpc/boot/dts/fsl/qoriq-fman-0.dtsi
+++ b/arch/powerpc/boot/dts/fsl/qoriq-fman-0.dtsi
@@ -37,12 +37,13 @@ fman0: fman@400000 {
 	#size-cells = <1>;
 	cell-index = <0>;
 	compatible = "fsl,fman";
-	ranges = <0 0x400000 0x100000>;
-	reg = <0x400000 0x100000>;
+	ranges = <0 0x400000 0xfe000>;
+	reg = <0x400000 0xfe000>;
 	interrupts = <96 2 0 0>, <16 2 1 1>;
 	clocks = <&clockgen 3 0>;
 	clock-names = "fmanclk";
 	fsl,qman-channel-range = <0x40 0xc>;
+	ptimer-handle = <&ptp_timer0>;
 
 	muram@0 {
 		compatible = "fsl,fman-muram";
@@ -93,9 +94,10 @@ fman0: fman@400000 {
 		reg = <0x87000 0x1000>;
 		status = "disabled";
 	};
+};
 
-	ptp_timer0: ptp-timer@fe000 {
-		compatible = "fsl,fman-ptp-timer";
-		reg = <0xfe000 0x1000>;
-	};
+ptp_timer0: ptp-timer@4fe000 {
+	compatible = "fsl,fman-ptp-timer";
+	reg = <0x4fe000 0x1000>;
+	interrupts = <96 2 0 0>;
 };
diff --git a/arch/powerpc/boot/dts/fsl/qoriq-fman-1.dtsi b/arch/powerpc/boot/dts/fsl/qoriq-fman-1.dtsi
index debea75..b80aaf5 100644
--- a/arch/powerpc/boot/dts/fsl/qoriq-fman-1.dtsi
+++ b/arch/powerpc/boot/dts/fsl/qoriq-fman-1.dtsi
@@ -37,12 +37,13 @@ fman1: fman@500000 {
 	#size-cells = <1>;
 	cell-index = <1>;
 	compatible = "fsl,fman";
-	ranges = <0 0x500000 0x100000>;
-	reg = <0x500000 0x100000>;
+	ranges = <0 0x500000 0xfe000>;
+	reg = <0x500000 0xfe000>;
 	interrupts = <97 2 0 0>, <16 2 1 0>;
 	clocks = <&clockgen 3 1>;
 	clock-names = "fmanclk";
 	fsl,qman-channel-range = <0x60 0xc>;
+	ptimer-handle = <&ptp_timer1>;
 
 	muram@0 {
 		compatible = "fsl,fman-muram";
@@ -93,9 +94,10 @@ fman1: fman@500000 {
 		reg = <0x87000 0x1000>;
 		status = "disabled";
 	};
+};
 
-	ptp_timer1: ptp-timer@fe000 {
-		compatible = "fsl,fman-ptp-timer";
-		reg = <0xfe000 0x1000>;
-	};
+ptp_timer1: ptp-timer@5fe000 {
+	compatible = "fsl,fman-ptp-timer";
+	reg = <0x5fe000 0x1000>;
+	interrupts = <97 2 0 0>;
 };
diff --git a/arch/powerpc/boot/dts/fsl/qoriq-fman3-0.dtsi b/arch/powerpc/boot/dts/fsl/qoriq-fman3-0.dtsi
index 3a20e0d..d3720fd 100644
--- a/arch/powerpc/boot/dts/fsl/qoriq-fman3-0.dtsi
+++ b/arch/powerpc/boot/dts/fsl/qoriq-fman3-0.dtsi
@@ -37,12 +37,13 @@ fman0: fman@400000 {
 	#size-cells = <1>;
 	cell-index = <0>;
 	compatible = "fsl,fman";
-	ranges = <0 0x400000 0x100000>;
-	reg = <0x400000 0x100000>;
+	ranges = <0 0x400000 0xfe000>;
+	reg = <0x400000 0xfe000>;
 	interrupts = <96 2 0 0>, <16 2 1 1>;
 	clocks = <&clockgen 3 0>;
 	clock-names = "fmanclk";
 	fsl,qman-channel-range = <0x800 0x10>;
+	ptimer-handle = <&ptp_timer0>;
 
 	muram@0 {
 		compatible = "fsl,fman-muram";
@@ -98,9 +99,10 @@ fman0: fman@400000 {
 		compatible = "fsl,fman-memac-mdio", "fsl,fman-xmdio";
 		reg = <0xfd000 0x1000>;
 	};
+};
 
-	ptp_timer0: ptp-timer@fe000 {
-		compatible = "fsl,fman-ptp-timer";
-		reg = <0xfe000 0x1000>;
-	};
+ptp_timer0: ptp-timer@4fe000 {
+	compatible = "fsl,fman-ptp-timer";
+	reg = <0x4fe000 0x1000>;
+	interrupts = <96 2 0 0>;
 };
diff --git a/arch/powerpc/boot/dts/fsl/qoriq-fman3-1.dtsi b/arch/powerpc/boot/dts/fsl/qoriq-fman3-1.dtsi
index 82750ac..ae34c20 100644
--- a/arch/powerpc/boot/dts/fsl/qoriq-fman3-1.dtsi
+++ b/arch/powerpc/boot/dts/fsl/qoriq-fman3-1.dtsi
@@ -37,12 +37,13 @@ fman1: fman@500000 {
 	#size-cells = <1>;
 	cell-index = <1>;
 	compatible = "fsl,fman";
-	ranges = <0 0x500000 0x100000>;
-	reg = <0x500000 0x100000>;
+	ranges = <0 0x500000 0xfe000>;
+	reg = <0x500000 0xfe000>;
 	interrupts = <97 2 0 0>, <16 2 1 0>;
 	clocks = <&clockgen 3 1>;
 	clock-names = "fmanclk";
 	fsl,qman-channel-range = <0x820 0x10>;
+	ptimer-handle = <&ptp_timer1>;
 
 	muram@0 {
 		compatible = "fsl,fman-muram";
@@ -98,9 +99,10 @@ fman1: fman@500000 {
 		compatible = "fsl,fman-memac-mdio", "fsl,fman-xmdio";
 		reg = <0xfd000 0x1000>;
 	};
+};
 
-	ptp_timer1: ptp-timer@fe000 {
-		compatible = "fsl,fman-ptp-timer";
-		reg = <0xfe000 0x1000>;
-	};
+ptp_timer1: ptp-timer@5fe000 {
+	compatible = "fsl,fman-ptp-timer";
+	reg = <0x5fe000 0x1000>;
+	interrupts = <97 2 0 0>;
 };
diff --git a/arch/powerpc/boot/dts/fsl/qoriq-fman3l-0.dtsi b/arch/powerpc/boot/dts/fsl/qoriq-fman3l-0.dtsi
index 7f60b60..02f2755 100644
--- a/arch/powerpc/boot/dts/fsl/qoriq-fman3l-0.dtsi
+++ b/arch/powerpc/boot/dts/fsl/qoriq-fman3l-0.dtsi
@@ -37,12 +37,13 @@ fman0: fman@400000 {
 	#size-cells = <1>;
 	cell-index = <0>;
 	compatible = "fsl,fman";
-	ranges = <0 0x400000 0x100000>;
-	reg = <0x400000 0x100000>;
+	ranges = <0 0x400000 0xfe000>;
+	reg = <0x400000 0xfe000>;
 	interrupts = <96 2 0 0>, <16 2 1 1>;
 	clocks = <&clockgen 3 0>;
 	clock-names = "fmanclk";
 	fsl,qman-channel-range = <0x800 0x10>;
+	ptimer-handle = <&ptp_timer0>;
 
 	muram@0 {
 		compatible = "fsl,fman-muram";
@@ -86,9 +87,10 @@ fman0: fman@400000 {
 		compatible = "fsl,fman-memac-mdio", "fsl,fman-xmdio";
 		reg = <0xfd000 0x1000>;
 	};
+};
 
-	ptp_timer0: ptp-timer@fe000 {
-		compatible = "fsl,fman-ptp-timer";
-		reg = <0xfe000 0x1000>;
-	};
+ptp_timer0: ptp-timer@4fe000 {
+	compatible = "fsl,fman-ptp-timer";
+	reg = <0x4fe000 0x1000>;
+	interrupts = <96 2 0 0>;
 };
-- 
1.7.1

^ permalink raw reply related

* [v2, 05/10] arm64: dts: fsl: move ptp timer out of fman
From: Yangbo Lu @ 2018-06-07  3:22 UTC (permalink / raw)
  To: netdev, madalin.bucur, Richard Cochran, Rob Herring, Shawn Guo,
	David S . Miller
  Cc: devicetree, linuxppc-dev, linux-arm-kernel, linux-kernel,
	Yangbo Lu
In-Reply-To: <20180607032256.39802-1-yangbo.lu@nxp.com>

This patch is to move ptp timer node out of fman.
Because ptp timer will be probed by ptp_qoriq driver,
it should be an independent device in case of conflict
memory mapping.

Signed-off-by: Yangbo Lu <yangbo.lu@nxp.com>
---
Changes for v2:
	- Fixed address-cells for ptp-timer.
---
 arch/arm64/boot/dts/freescale/qoriq-fman3-0.dtsi |   14 ++++++++------
 1 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/boot/dts/freescale/qoriq-fman3-0.dtsi b/arch/arm64/boot/dts/freescale/qoriq-fman3-0.dtsi
index 4dd0676..a56a408 100644
--- a/arch/arm64/boot/dts/freescale/qoriq-fman3-0.dtsi
+++ b/arch/arm64/boot/dts/freescale/qoriq-fman3-0.dtsi
@@ -11,13 +11,14 @@ fman0: fman@1a00000 {
 	#size-cells = <1>;
 	cell-index = <0>;
 	compatible = "fsl,fman";
-	ranges = <0x0 0x0 0x1a00000 0x100000>;
-	reg = <0x0 0x1a00000 0x0 0x100000>;
+	ranges = <0x0 0x0 0x1a00000 0xfe000>;
+	reg = <0x0 0x1a00000 0x0 0xfe000>;
 	interrupts = <GIC_SPI 44 IRQ_TYPE_LEVEL_HIGH>,
 		     <GIC_SPI 45 IRQ_TYPE_LEVEL_HIGH>;
 	clocks = <&clockgen 3 0>;
 	clock-names = "fmanclk";
 	fsl,qman-channel-range = <0x800 0x10>;
+	ptimer-handle = <&ptp_timer0>;
 
 	muram@0 {
 		compatible = "fsl,fman-muram";
@@ -73,9 +74,10 @@ fman0: fman@1a00000 {
 		compatible = "fsl,fman-memac-mdio", "fsl,fman-xmdio";
 		reg = <0xfd000 0x1000>;
 	};
+};
 
-	ptp_timer0: ptp-timer@fe000 {
-		compatible = "fsl,fman-ptp-timer";
-		reg = <0xfe000 0x1000>;
-	};
+ptp_timer0: ptp-timer@1afe000 {
+	compatible = "fsl,fman-ptp-timer";
+	reg = <0x0 0x1afe000 0x0 0x1000>;
+	interrupts = <GIC_SPI 44 IRQ_TYPE_LEVEL_HIGH>;
 };
-- 
1.7.1

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox