netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Petr Machata <petrm@nvidia.com>
To: "Machnikowski, Maciej" <maciej.machnikowski@intel.com>
Cc: Petr Machata <petrm@nvidia.com>,
	"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
	"intel-wired-lan@lists.osuosl.org"
	<intel-wired-lan@lists.osuosl.org>,
	"richardcochran@gmail.com" <richardcochran@gmail.com>,
	"abyagowi@fb.com" <abyagowi@fb.com>,
	"Nguyen, Anthony L" <anthony.l.nguyen@intel.com>,
	"davem@davemloft.net" <davem@davemloft.net>,
	"kuba@kernel.org" <kuba@kernel.org>,
	"linux-kselftest@vger.kernel.org"
	<linux-kselftest@vger.kernel.org>,
	"idosch@idosch.org" <idosch@idosch.org>,
	"mkubecek@suse.cz" <mkubecek@suse.cz>,
	"saeed@kernel.org" <saeed@kernel.org>,
	"michael.chan@broadcom.com" <michael.chan@broadcom.com>
Subject: Re: [PATCH v2 net-next 6/6] docs: net: Add description of SyncE interfaces
Date: Tue, 9 Nov 2021 15:52:33 +0100	[thread overview]
Message-ID: <87mtmdcrf2.fsf@nvidia.com> (raw)
In-Reply-To: <MW5PR11MB5812B0A4E6227C6896AC12B5EA929@MW5PR11MB5812.namprd11.prod.outlook.com>


Machnikowski, Maciej <maciej.machnikowski@intel.com> writes:

>> Maciej Machnikowski <maciej.machnikowski@intel.com> writes:
>> 
>> > +====================
>> > +Synchronous Ethernet
>> > +====================
>> > +
>> > +Synchronous Ethernet networks use a physical layer clock to syntonize
>> > +the frequency across different network elements.
>> > +
>> > +Basic SyncE node defined in the ITU-T G.8264 consist of an Ethernet
>> > +Equipment Clock (EEC) and a PHY that has dedicated outputs of recovered
>> clocks
>> > +and a dedicated TX clock input that is used as to transmit data to other
>> nodes.
>> > +
>> > +The SyncE capable PHY is able to recover the incomning frequency of the
>> data
>> > +stream on RX lanes and redirect it (sometimes dividing it) to recovered
>> > +clock outputs. In SyncE PHY the TX frequency is directly dependent on the
>> > +input frequency - either on the PHY CLK input, or on a dedicated
>> > +TX clock input.
>> > +
>> > +      ┌───────────┬──────────┐
>> > +      │ RX        │ TX       │
>> > +  1   │ lanes     │ lanes    │ 1
>> > +  ───►├──────┐    │          ├─────►
>> > +  2   │      │    │          │ 2
>> > +  ───►├──┐   │    │          ├─────►
>> > +  3   │  │   │    │          │ 3
>> > +  ───►├─▼▼   ▼    │          ├─────►
>> > +      │ ──────    │          │
>> > +      │ \____/    │          │
>> > +      └──┼──┼─────┴──────────┘
>> > +        1│ 2│        ▲
>> > + RCLK out│  │        │ TX CLK in
>> > +         ▼  ▼        │
>> > +       ┌─────────────┴───┐
>> > +       │                 │
>> > +       │       EEC       │
>> > +       │                 │
>> > +       └─────────────────┘
>> > +
>> > +The EEC can synchronize its frequency to one of the synchronization
>> inputs
>> > +either clocks recovered on traffic interfaces or (in advanced deployments)
>> > +external frequency sources.
>> > +
>> > +Some EEC implementations can select synchronization source through
>> > +priority tables and synchronization status messaging and provide
>> necessary
>> > +filtering and holdover capabilities.
>> > +
>> > +The following interface can be applicable to diffferent packet network
>> types
>> > +following ITU-T G.8261/G.8262 recommendations.
>> > +
>> > +Interface
>> > +=========
>> > +
>> > +The following RTNL messages are used to read/configure SyncE recovered
>> > +clocks.
>> > +
>> > +RTM_GETRCLKRANGE
>> > +-----------------
>> > +Reads the allowed pin index range for the recovered clock outputs.
>> > +This can be aligned to PHY outputs or to EEC inputs, whichever is
>> > +better for a given application.
>> > +Will call the ndo_get_rclk_range function to read the allowed range
>> > +of output pin indexes.
>> > +Will call ndo_get_rclk_range to determine the allowed recovered clock
>> > +range and return them in the IFLA_RCLK_RANGE_MIN_PIN and the
>> > +IFLA_RCLK_RANGE_MAX_PIN attributes
>> > +
>> > +RTM_GETRCLKSTATE
>> > +-----------------
>> > +Read the state of recovered pins that output recovered clock from
>> > +a given port. The message will contain the number of assigned clocks
>> > +(IFLA_RCLK_STATE_COUNT) and an N pin indexes in
>> IFLA_RCLK_STATE_OUT_IDX
>> > +To support multiple recovered clock outputs from the same port, this
>> message
>> > +will return the IFLA_RCLK_STATE_COUNT attribute containing the number
>> of
>> > +active recovered clock outputs (N) and N IFLA_RCLK_STATE_OUT_IDX
>> attributes
>> > +listing the active output indexes.
>> > +This message will call the ndo_get_rclk_range to determine the allowed
>> > +recovered clock indexes and then will loop through them, calling
>> > +the ndo_get_rclk_state for each of them.
>> 
>> Let me make sure I understand the model that you propose. Specifically
>> from the point of view of a multi-port device, because that's my
>> immediate use case.
>> 
>> RTM_GETRCLKRANGE would report number of "pins" that matches the
>> number
>> of lanes in the system. So e.g. a 32-port switch, where each port has 4
>> lanes, would give a range of [1; 128], inclusive. (Or maybe [0; 128) or
>> whatever.)
>> 
>> RTM_GETRCLKSTATE would then return some subset of those pins,
>> depending
>> on which lanes actually managed to establish a connection and carry a
>> valid clock signal. So, say, [1, 2, 3, 4] if the first port has e.g. a
>> 100Gbps established.
>> 
>
> Those 2 will be merged into a single RTM_GETRCLKSTATE that will report
> the state of all available pins for a given port.
>
> Also lanes here should really be ports - will fix in next revision.
>
> But the logic will be: 
> Call the RTM_GETRCLKSTATE. It will return the list of pins and their state
> for a given port. Once you read the range you will send the RTM_SETRCLKSTATE
> to enable the redirection to a given RCLK output from the PHY. If your DPLL/EEC
> is configured to accept it automatically - it's all you need to do and you need to
> wait for the right state of the EEC (locked/locked with HO).

Ha, ok, so the RANGE call goes away, it's all in the RTM_GETRCLKSTATE.

>> > +
>> > +RTM_SETRCLKSTATE
>> > +-----------------
>> > +Sets the redirection of the recovered clock for a given pin. This message
>> > +expects one attribute:
>> > +struct if_set_rclk_msg {
>> > +	__u32 ifindex; /* interface index */
>> > +	__u32 out_idx; /* output index (from a valid range)
>> > +	__u32 flags; /* configuration flags */
>> > +};
>> > +
>> > +Supported flags are:
>> > +SET_RCLK_FLAGS_ENA - if set in flags - the given output will be enabled,
>> > +		     if clear - the output will be disabled.
>> 
>> OK, so here I set up the tracking. ifindex tells me which EEC to
>> configure, out_idx is the pin to track, flags tell me whether to set up
>> the tracking or tear it down. Thus e.g. on port 2, track pin 2, because
>> I somehow know that lane 2 has the best clock.
>
> It's bound to ifindex to know which PHY port you interact with. It has nothing to
> do with the EEC yet.

It has in the sense that I'm configuring "TX CLK in", which leads from
EEC to the port.

>> If the above is broadly correct, I've got some questions.
>> 
>> First, what if more than one out_idx is set? What are drivers / HW meant
>> to do with this? What is the expected behavior?
>
> Expected behavior is deployment specific. You can use different phy recovered
> clock outputs to implement active/passive mode of clock failover.

How? Which one is primary and which one is backup? I just have two
enabled pins...

Wouldn't failover be implementable in a userspace daemon? That would get
a notification from the system that holdover was entered, and can
reconfigure tracking to another pin based on arbitrary rules.

>> Also GETRCLKSTATE and SETRCLKSTATE have a somewhat different scope:
>> one
>> reports which pins carry a clock signal, the other influences tracking.
>> That seems wrong. There also does not seems to be an UAPI to retrieve
>> the tracking settings.
>
> They don't. Get reads the redirection state and SET sets it - nothing more,
> nothing less. In ICE we use EEC pin indexes so that the model translates easier
> to the one when we support DPLL subsystem.
>
>> Second, as a user-space client, how do I know that if ports 1 and 2 both
>> report pin range [A; B], that they both actually share the same
>> underlying EEC? Is there some sort of coordination among the drivers,
>> such that each pin in the system has a unique ID?
>
> For now we don't, as we don't have EEC subsystem. But that can be solved
> by a config file temporarily.

I think it would be better to model this properly from day one.

>> Further, how do I actually know the mapping from ports to pins? E.g. as
>> a user, I might know my master is behind swp1. How do I know what pins
>> correspond to that port? As a user-space tool author, how do I help
>> users to do something like "eec set clock eec0 track swp1"?
>
> That's why driver needs to be smart there and return indexes properly.

What do you mean, properly? Up there you have RTM_GETRCLKRANGE that just
gives me a min and a max. Is there a policy about how to correlate
numbers in that range to... ifindices, netdevice names, devlink port
numbers, I don't know, something?

How do several drivers coordinate this numbering among themselves? Is
there a core kernel authority that manages pin number de/allocations?

>> Additionally, how would things like external GPSs or 1pps be modeled? I
>> guess the driver would know about such interface, and would expose it as
>> a "pin". When the GPS signal locks, the driver starts reporting the pin
>> in the RCLK set. Then it is possible to set up tracking of that pin.
>
> That won't be enabled before we get the DPLL subsystem ready.

It might prove challenging to retrofit an existing netdev-centric
interface into a more generic model. It would be better to model this
properly from day one, and OK, if we can carve out a subset of that
model to implement now, and leave the rest for later, fine. But the
current model does not strike me as having a natural migration path to
something more generic. E.g. reporting the EEC state through the
interfaces attached to that EEC... like, that will have to stay, even at
a time when it is superseded by a better interface.

>> It seems to me it would be easier to understand, and to write user-space
>> tools and drivers for, a model that has EEC as an explicit first-class
>> object. That's where the EEC state naturally belongs, that's where the
>> pin range naturally belongs. Netdevs should have a reference to EEC and
>> pins, not present this information as if they own it. A first-class EEC
>> would also allow to later figure out how to hook up PHC and EEC.
>
> We have the userspace tool, but can’t upstream it until we define
> kernel Interfaces. It's paragraph 22 :(

I'm sure you do, presumably you test this somehow. Still, as a potential
consumer of that interface, I will absolutely poke at it to figure out
how to use it, what it lets me to do, and what won't work.

BTW, what we've done in the past in a situation like this was, here's
the current submission, here's a pointer to a GIT with more stuff we
plan to send later on, here's a pointer to a GIT with the userspace
stuff. I doubt anybody actually looks at that code, ain't nobody got
time for that, but really there's no catch 22.

  reply	other threads:[~2021-11-09 14:52 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-11-05 20:53 [PATCH v2 net-next 0/6] Add RTNL interface for SyncE Maciej Machnikowski
2021-11-05 20:53 ` [PATCH v2 net-next 1/6] ice: add support detecting features based on netlist Maciej Machnikowski
2021-11-05 20:53 ` [PATCH v2 net-next 2/6] rtnetlink: Add new RTM_GETEECSTATE message to get SyncE status Maciej Machnikowski
2021-11-07 13:44   ` Ido Schimmel
2021-11-05 20:53 ` [PATCH v2 net-next 3/6] ice: add support for reading SyncE DPLL state Maciej Machnikowski
2021-11-05 20:53 ` [PATCH v2 net-next 4/6] rtnetlink: Add support for SyncE recovered clock configuration Maciej Machnikowski
2021-11-05 20:53 ` [PATCH v2 net-next 5/6] ice: add support for SyncE recovered clocks Maciej Machnikowski
2021-11-05 20:53 ` [PATCH v2 net-next 6/6] docs: net: Add description of SyncE interfaces Maciej Machnikowski
2021-11-07 14:08   ` Ido Schimmel
2021-11-08  8:35     ` Machnikowski, Maciej
2021-11-08 16:29       ` Ido Schimmel
2021-11-08 17:03         ` Jakub Kicinski
2021-11-09 10:50           ` Machnikowski, Maciej
2021-11-09 10:32         ` Machnikowski, Maciej
2021-11-08 18:00   ` Petr Machata
2021-11-09 10:43     ` Machnikowski, Maciej
2021-11-09 14:52       ` Petr Machata [this message]
2021-11-09 18:19         ` Machnikowski, Maciej
2021-11-10 10:27           ` Petr Machata
2021-11-10 11:19             ` Machnikowski, Maciej
2021-11-10 15:15               ` Petr Machata
2021-11-10 15:50                 ` Machnikowski, Maciej
2021-11-10 21:05                   ` Petr Machata
2021-11-15 10:12                     ` Machnikowski, Maciej
2021-11-15 21:42                       ` Jakub Kicinski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87mtmdcrf2.fsf@nvidia.com \
    --to=petrm@nvidia.com \
    --cc=abyagowi@fb.com \
    --cc=anthony.l.nguyen@intel.com \
    --cc=davem@davemloft.net \
    --cc=idosch@idosch.org \
    --cc=intel-wired-lan@lists.osuosl.org \
    --cc=kuba@kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=maciej.machnikowski@intel.com \
    --cc=michael.chan@broadcom.com \
    --cc=mkubecek@suse.cz \
    --cc=netdev@vger.kernel.org \
    --cc=richardcochran@gmail.com \
    --cc=saeed@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).