mlxsw and rtnl lock

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* mlxsw and rtnl lock
@ 2017-08-25 20:26 David Ahern
  2017-08-26 17:04 ` Ido Schimmel
  0 siblings, 1 reply; 8+ messages in thread
From: David Ahern @ 2017-08-25 20:26 UTC (permalink / raw)
  To: Jiri Pirko, Ido Schimmel; +Cc: netdev@vger.kernel.org

Jiri / Ido:

I was looking at the mlxsw driver and the places it holds the rtnl lock.
There are a lot of them and from an admittedly short review it seems
like the rtnl is protecting changes to mlxsw data structures as opposed
to calling into the core networking stack. This is going to have huge
impacts on scalability when both the kernel programming (user changes)
and the hardware programming require the rtnl.

With regards to the FIB notifier, why add the fib events to a work queue
that is processed asynchronously if processing the work queue requires
the rtnl lock? What is gained by deferring the work since a major side
effect of the work queue is the loss of error propagation back to the
user on the a failure. That is, if the FIB add/replace/append fails in
the h/w for any reason, offload is silently aborted (an entry in the
kernel log is still a silent abort).

Code in question:

fib_table_insert
- call_fib_entry_notifiers
  ...
    + mlxsw_sp_router_fib_event
      * allocate work entry
      * copy fib change data to it
      * take a reference on fib info / rt
      * schedule work

<some time later>

mlxsw_sp_router_fib{4,6}_event_work
- rtnl_lock

- mlxsw_sp_router_fib{4,6}_add
  if (err)
      mlxsw_sp_router_fib_abort    <----- not propagated to the user

- fib_info_put / rt6_release

David

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: mlxsw and rtnl lock
  2017-08-25 20:26 mlxsw and rtnl lock David Ahern
@ 2017-08-26 17:04 ` Ido Schimmel
  2017-08-28 15:55   ` Roopa Prabhu
  2017-08-28 18:00   ` David Ahern
  0 siblings, 2 replies; 8+ messages in thread
From: Ido Schimmel @ 2017-08-26 17:04 UTC (permalink / raw)
  To: David Ahern; +Cc: Jiri Pirko, netdev@vger.kernel.org, mlxsw

On Fri, Aug 25, 2017 at 01:26:07PM -0700, David Ahern wrote:
> Jiri / Ido:
> 
> I was looking at the mlxsw driver and the places it holds the rtnl lock.
> There are a lot of them and from an admittedly short review it seems
> like the rtnl is protecting changes to mlxsw data structures as opposed
> to calling into the core networking stack. This is going to have huge
> impacts on scalability when both the kernel programming (user changes)
> and the hardware programming require the rtnl.

I'm aware of the problem and I intend to look into it. When we started
working on mlxsw about two years ago all the operations were serialized
by rtnl_lock, so when we had to add processing in a different context,
we ended up taking the same lock to protect against changes. But it can
impact scalability as you mentioned.

> With regards to the FIB notifier, why add the fib events to a work queue
> that is processed asynchronously if processing the work queue requires
> the rtnl lock? What is gained by deferring the work since a major side
> effect of the work queue is the loss of error propagation back to the
> user on the a failure. That is, if the FIB add/replace/append fails in
> the h/w for any reason, offload is silently aborted (an entry in the
> kernel log is still a silent abort).

FIB events are received in an atomic context and therefore must be
deferred to a workqueue. The chain was initially blocking, but this had
to change in commit d3f706f68e2f ("ipv4: fib: Convert FIB notification
chain to be atomic") to support dumping of IPv4 routes under RCU. IPv6
events are always sent in an atomic context.

Regarding the silent abort, that's intentional. You can look at the same
code in v4.9 - when the chain was still blocking - and you'll see that
we didn't propagate the error even then. This was discussed in the past
and the conclusion was that user doesn't expect to operation to fail. If
hardware resources are exceeded, we let the kernel take care of the
forwarding instead.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: mlxsw and rtnl lock
  2017-08-26 17:04 ` Ido Schimmel
@ 2017-08-28 15:55   ` Roopa Prabhu
  2017-08-28 18:00   ` David Ahern
  1 sibling, 0 replies; 8+ messages in thread
From: Roopa Prabhu @ 2017-08-28 15:55 UTC (permalink / raw)
  To: Ido Schimmel; +Cc: David Ahern, Jiri Pirko, netdev@vger.kernel.org, mlxsw

On Sat, Aug 26, 2017 at 10:04 AM, Ido Schimmel <idosch@idosch.org> wrote:
> On Fri, Aug 25, 2017 at 01:26:07PM -0700, David Ahern wrote:
>> Jiri / Ido:
>>
>>
[snip]
>
> Regarding the silent abort, that's intentional. You can look at the same
> code in v4.9 - when the chain was still blocking - and you'll see that
> we didn't propagate the error even then. This was discussed in the past
> and the conclusion was that user doesn't expect to operation to fail. If
> hardware resources are exceeded, we let the kernel take care of the
> forwarding instead.

History here is that the initial fib offload hooks were added during
rocker days.
subsequently we have had many discussions (on offload policies) to see
if we can remove that
check and report the error back to the user (routing daemon in this
case) since the cpu kernel
cannot handle 100G or more traffic on such platforms. Basically the
cpu kernel cannot take care
of forwarding for such platforms. I believe this came up during the
last switchdev BOF as well.
The current silent abort is in there because kernel needs to maintain
its semantics for hardware offload.
Which is a valid position but we will need to find a way to make it
work for switch platforms.

IIUC, the only place where switchdev offload returns error to the user is
'bridge and bond membership offload'

I know fib and bridge fdb offload don't propagate errors to the user
(maybe they used to before moving to notifiers ?. need to check
history on this).

Are tc hardware offload errors propagated to the user ?

It is a bit inconsistent today, some errors are propagated to the user
and some are not.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: mlxsw and rtnl lock
  2017-08-26 17:04 ` Ido Schimmel
  2017-08-28 15:55   ` Roopa Prabhu
@ 2017-08-28 18:00   ` David Ahern
  2017-08-29  6:10     ` Arkadi Sharshevsky
  2017-08-29 19:16     ` Arkadi Sharshevsky
  1 sibling, 2 replies; 8+ messages in thread
From: David Ahern @ 2017-08-28 18:00 UTC (permalink / raw)
  To: Ido Schimmel; +Cc: Jiri Pirko, netdev@vger.kernel.org, mlxsw

On 8/26/17 11:04 AM, Ido Schimmel wrote:
> Regarding the silent abort, that's intentional. You can look at the same
> code in v4.9 - when the chain was still blocking - and you'll see that
> we didn't propagate the error even then. This was discussed in the past
> and the conclusion was that user doesn't expect to operation to fail. If
> hardware resources are exceeded, we let the kernel take care of the
> forwarding instead.
> 

In addition to Roopa's comments... The silent abort is not a good user
experience. Right now it's add a network address or route, cross fingers
and hope it does not overflow some limit (nexthop, ecmp, neighbor,
prefix, etc) that triggers the offload abort.

The mlxsw driver queries for some limits (e.g., max rifs) but I don't
see any query related to current usage, and there is no API to pass any
of that data to user space so user space has no programmatic way to
handle this. I realize you are aware of this limitation. The point is to
emphasize the need to resolve this.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: mlxsw and rtnl lock
  2017-08-28 18:00   ` David Ahern
@ 2017-08-29  6:10     ` Arkadi Sharshevsky
  2017-08-29 20:04       ` David Ahern
  2017-08-29 19:16     ` Arkadi Sharshevsky
  1 sibling, 1 reply; 8+ messages in thread
From: Arkadi Sharshevsky @ 2017-08-29  6:10 UTC (permalink / raw)
  To: David Ahern, Ido Schimmel; +Cc: Jiri Pirko, netdev@vger.kernel.org, mlxsw



On 08/28/2017 09:00 PM, David Ahern wrote:
> On 8/26/17 11:04 AM, Ido Schimmel wrote:
>> Regarding the silent abort, that's intentional. You can look at the same
>> code in v4.9 - when the chain was still blocking - and you'll see that
>> we didn't propagate the error even then. This was discussed in the past
>> and the conclusion was that user doesn't expect to operation to fail. If
>> hardware resources are exceeded, we let the kernel take care of the
>> forwarding instead.
>>
> 
> In addition to Roopa's comments... The silent abort is not a good user
> experience. Right now it's add a network address or route, cross fingers
> and hope it does not overflow some limit (nexthop, ecmp, neighbor,
> prefix, etc) that triggers the offload abort.
> 
> The mlxsw driver queries for some limits (e.g., max rifs) but I don't
> see any query related to current usage, and there is no API to pass any
> of that data to user space so user space has no programmatic way to
> handle this. I realize you are aware of this limitation. The point is to
> emphasize the need to resolve this.
> 

We actually thought about providing he user some tools to understand
the ASIC's limitations by introducing the 'resource' object to devlink.

By linking dpipe tables to resources the user can understand which
hardware processes share a common resource, furthermore this resources
usage could be observed. By this more visibility can be obtained.

Its not a remedy for the silent abort, but, maybe a notification
can be sent from devlink in case of abort that some resources is
full.

This proposition was sent as RFC several weeks ago.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: mlxsw and rtnl lock
  2017-08-29  6:10     ` Arkadi Sharshevsky
@ 2017-08-29 20:04       ` David Ahern
  2017-08-29 21:18         ` Arkadi Sharshevsky
  0 siblings, 1 reply; 8+ messages in thread
From: David Ahern @ 2017-08-29 20:04 UTC (permalink / raw)
  To: Arkadi Sharshevsky, Ido Schimmel
  Cc: Jiri Pirko, netdev@vger.kernel.org, mlxsw

On 8/29/17 12:10 AM, Arkadi Sharshevsky wrote:
> 
> 
> On 08/28/2017 09:00 PM, David Ahern wrote:
>> On 8/26/17 11:04 AM, Ido Schimmel wrote:
>>> Regarding the silent abort, that's intentional. You can look at the same
>>> code in v4.9 - when the chain was still blocking - and you'll see that
>>> we didn't propagate the error even then. This was discussed in the past
>>> and the conclusion was that user doesn't expect to operation to fail. If
>>> hardware resources are exceeded, we let the kernel take care of the
>>> forwarding instead.
>>>
>>
>> In addition to Roopa's comments... The silent abort is not a good user
>> experience. Right now it's add a network address or route, cross fingers
>> and hope it does not overflow some limit (nexthop, ecmp, neighbor,
>> prefix, etc) that triggers the offload abort.
>>
>> The mlxsw driver queries for some limits (e.g., max rifs) but I don't
>> see any query related to current usage, and there is no API to pass any
>> of that data to user space so user space has no programmatic way to
>> handle this. I realize you are aware of this limitation. The point is to
>> emphasize the need to resolve this.
>>
> 
> We actually thought about providing he user some tools to understand
> the ASIC's limitations by introducing the 'resource' object to devlink.
> 
> By linking dpipe tables to resources the user can understand which
> hardware processes share a common resource, furthermore this resources
> usage could be observed. By this more visibility can be obtained.
> 
> Its not a remedy for the silent abort, but, maybe a notification
> can be sent from devlink in case of abort that some resources is
> full.
> 
> This proposition was sent as RFC several weeks ago.
> 

Do you have patches (kernel and devlink) for the proposal?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: mlxsw and rtnl lock
  2017-08-29 20:04       ` David Ahern
@ 2017-08-29 21:18         ` Arkadi Sharshevsky
  0 siblings, 0 replies; 8+ messages in thread
From: Arkadi Sharshevsky @ 2017-08-29 21:18 UTC (permalink / raw)
  To: David Ahern, Ido Schimmel; +Cc: Jiri Pirko, netdev@vger.kernel.org, mlxsw



On 08/29/2017 11:04 PM, David Ahern wrote:
> On 8/29/17 12:10 AM, Arkadi Sharshevsky wrote:
>>
>>
>> On 08/28/2017 09:00 PM, David Ahern wrote:
>>> On 8/26/17 11:04 AM, Ido Schimmel wrote:
>>>> Regarding the silent abort, that's intentional. You can look at the same
>>>> code in v4.9 - when the chain was still blocking - and you'll see that
>>>> we didn't propagate the error even then. This was discussed in the past
>>>> and the conclusion was that user doesn't expect to operation to fail. If
>>>> hardware resources are exceeded, we let the kernel take care of the
>>>> forwarding instead.
>>>>
>>>
>>> In addition to Roopa's comments... The silent abort is not a good user
>>> experience. Right now it's add a network address or route, cross fingers
>>> and hope it does not overflow some limit (nexthop, ecmp, neighbor,
>>> prefix, etc) that triggers the offload abort.
>>>
>>> The mlxsw driver queries for some limits (e.g., max rifs) but I don't
>>> see any query related to current usage, and there is no API to pass any
>>> of that data to user space so user space has no programmatic way to
>>> handle this. I realize you are aware of this limitation. The point is to
>>> emphasize the need to resolve this.
>>>
>>
>> We actually thought about providing he user some tools to understand
>> the ASIC's limitations by introducing the 'resource' object to devlink.
>>
>> By linking dpipe tables to resources the user can understand which
>> hardware processes share a common resource, furthermore this resources
>> usage could be observed. By this more visibility can be obtained.
>>
>> Its not a remedy for the silent abort, but, maybe a notification
>> can be sent from devlink in case of abort that some resources is
>> full.
>>
>> This proposition was sent as RFC several weeks ago.
>>
> 
> Do you have patches (kernel and devlink) for the proposal?
> 

No, only the design RFC which describe the UAPI, devlink
commands and the devlink/driver interactions.

I wanted to receive some feedback before the coding.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: mlxsw and rtnl lock
  2017-08-28 18:00   ` David Ahern
  2017-08-29  6:10     ` Arkadi Sharshevsky
@ 2017-08-29 19:16     ` Arkadi Sharshevsky
  1 sibling, 0 replies; 8+ messages in thread
From: Arkadi Sharshevsky @ 2017-08-29 19:16 UTC (permalink / raw)
  To: David Ahern, Ido Schimmel; +Cc: Jiri Pirko, netdev@vger.kernel.org, mlxsw



On 08/28/2017 09:00 PM, David Ahern wrote:
> On 8/26/17 11:04 AM, Ido Schimmel wrote:
>> Regarding the silent abort, that's intentional. You can look at the same
>> code in v4.9 - when the chain was still blocking - and you'll see that
>> we didn't propagate the error even then. This was discussed in the past
>> and the conclusion was that user doesn't expect to operation to fail. If
>> hardware resources are exceeded, we let the kernel take care of the
>> forwarding instead.
>>
> 
> In addition to Roopa's comments... The silent abort is not a good user
> experience. Right now it's add a network address or route, cross fingers
> and hope it does not overflow some limit (nexthop, ecmp, neighbor,
> prefix, etc) that triggers the offload abort.
> 
> The mlxsw driver queries for some limits (e.g., max rifs) but I don't
> see any query related to current usage, and there is no API to pass any
> of that data to user space so user space has no programmatic way to
> handle this. I realize you are aware of this limitation. The point is to

The first dpipe table that was introduced was the erif table.
which gathers L3 statistics.

The rifs are actually also fixed size resource, so maybe it is
more correct to introduce it as 'resources' and connect it to
the erif table. That way you will be able to obtain current
usage, and receive notification when it will be drained out.

> emphasize the need to resolve this.
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2017-08-29 21:19 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-08-25 20:26 mlxsw and rtnl lock David Ahern
2017-08-26 17:04 ` Ido Schimmel
2017-08-28 15:55   ` Roopa Prabhu
2017-08-28 18:00   ` David Ahern
2017-08-29  6:10     ` Arkadi Sharshevsky
2017-08-29 20:04       ` David Ahern
2017-08-29 21:18         ` Arkadi Sharshevsky
2017-08-29 19:16     ` Arkadi Sharshevsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).