* mlxsw and rtnl lock
@ 2017-08-25 20:26 David Ahern
2017-08-26 17:04 ` Ido Schimmel
0 siblings, 1 reply; 8+ messages in thread
From: David Ahern @ 2017-08-25 20:26 UTC (permalink / raw)
To: Jiri Pirko, Ido Schimmel; +Cc: netdev@vger.kernel.org
Jiri / Ido:
I was looking at the mlxsw driver and the places it holds the rtnl lock.
There are a lot of them and from an admittedly short review it seems
like the rtnl is protecting changes to mlxsw data structures as opposed
to calling into the core networking stack. This is going to have huge
impacts on scalability when both the kernel programming (user changes)
and the hardware programming require the rtnl.
With regards to the FIB notifier, why add the fib events to a work queue
that is processed asynchronously if processing the work queue requires
the rtnl lock? What is gained by deferring the work since a major side
effect of the work queue is the loss of error propagation back to the
user on the a failure. That is, if the FIB add/replace/append fails in
the h/w for any reason, offload is silently aborted (an entry in the
kernel log is still a silent abort).
Code in question:
fib_table_insert
- call_fib_entry_notifiers
...
+ mlxsw_sp_router_fib_event
* allocate work entry
* copy fib change data to it
* take a reference on fib info / rt
* schedule work
<some time later>
mlxsw_sp_router_fib{4,6}_event_work
- rtnl_lock
- mlxsw_sp_router_fib{4,6}_add
if (err)
mlxsw_sp_router_fib_abort <----- not propagated to the user
- fib_info_put / rt6_release
David
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: mlxsw and rtnl lock 2017-08-25 20:26 mlxsw and rtnl lock David Ahern @ 2017-08-26 17:04 ` Ido Schimmel 2017-08-28 15:55 ` Roopa Prabhu 2017-08-28 18:00 ` David Ahern 0 siblings, 2 replies; 8+ messages in thread From: Ido Schimmel @ 2017-08-26 17:04 UTC (permalink / raw) To: David Ahern; +Cc: Jiri Pirko, netdev@vger.kernel.org, mlxsw On Fri, Aug 25, 2017 at 01:26:07PM -0700, David Ahern wrote: > Jiri / Ido: > > I was looking at the mlxsw driver and the places it holds the rtnl lock. > There are a lot of them and from an admittedly short review it seems > like the rtnl is protecting changes to mlxsw data structures as opposed > to calling into the core networking stack. This is going to have huge > impacts on scalability when both the kernel programming (user changes) > and the hardware programming require the rtnl. I'm aware of the problem and I intend to look into it. When we started working on mlxsw about two years ago all the operations were serialized by rtnl_lock, so when we had to add processing in a different context, we ended up taking the same lock to protect against changes. But it can impact scalability as you mentioned. > With regards to the FIB notifier, why add the fib events to a work queue > that is processed asynchronously if processing the work queue requires > the rtnl lock? What is gained by deferring the work since a major side > effect of the work queue is the loss of error propagation back to the > user on the a failure. That is, if the FIB add/replace/append fails in > the h/w for any reason, offload is silently aborted (an entry in the > kernel log is still a silent abort). FIB events are received in an atomic context and therefore must be deferred to a workqueue. The chain was initially blocking, but this had to change in commit d3f706f68e2f ("ipv4: fib: Convert FIB notification chain to be atomic") to support dumping of IPv4 routes under RCU. IPv6 events are always sent in an atomic context. Regarding the silent abort, that's intentional. You can look at the same code in v4.9 - when the chain was still blocking - and you'll see that we didn't propagate the error even then. This was discussed in the past and the conclusion was that user doesn't expect to operation to fail. If hardware resources are exceeded, we let the kernel take care of the forwarding instead. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: mlxsw and rtnl lock 2017-08-26 17:04 ` Ido Schimmel @ 2017-08-28 15:55 ` Roopa Prabhu 2017-08-28 18:00 ` David Ahern 1 sibling, 0 replies; 8+ messages in thread From: Roopa Prabhu @ 2017-08-28 15:55 UTC (permalink / raw) To: Ido Schimmel; +Cc: David Ahern, Jiri Pirko, netdev@vger.kernel.org, mlxsw On Sat, Aug 26, 2017 at 10:04 AM, Ido Schimmel <idosch@idosch.org> wrote: > On Fri, Aug 25, 2017 at 01:26:07PM -0700, David Ahern wrote: >> Jiri / Ido: >> >> [snip] > > Regarding the silent abort, that's intentional. You can look at the same > code in v4.9 - when the chain was still blocking - and you'll see that > we didn't propagate the error even then. This was discussed in the past > and the conclusion was that user doesn't expect to operation to fail. If > hardware resources are exceeded, we let the kernel take care of the > forwarding instead. History here is that the initial fib offload hooks were added during rocker days. subsequently we have had many discussions (on offload policies) to see if we can remove that check and report the error back to the user (routing daemon in this case) since the cpu kernel cannot handle 100G or more traffic on such platforms. Basically the cpu kernel cannot take care of forwarding for such platforms. I believe this came up during the last switchdev BOF as well. The current silent abort is in there because kernel needs to maintain its semantics for hardware offload. Which is a valid position but we will need to find a way to make it work for switch platforms. IIUC, the only place where switchdev offload returns error to the user is 'bridge and bond membership offload' I know fib and bridge fdb offload don't propagate errors to the user (maybe they used to before moving to notifiers ?. need to check history on this). Are tc hardware offload errors propagated to the user ? It is a bit inconsistent today, some errors are propagated to the user and some are not. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: mlxsw and rtnl lock 2017-08-26 17:04 ` Ido Schimmel 2017-08-28 15:55 ` Roopa Prabhu @ 2017-08-28 18:00 ` David Ahern 2017-08-29 6:10 ` Arkadi Sharshevsky 2017-08-29 19:16 ` Arkadi Sharshevsky 1 sibling, 2 replies; 8+ messages in thread From: David Ahern @ 2017-08-28 18:00 UTC (permalink / raw) To: Ido Schimmel; +Cc: Jiri Pirko, netdev@vger.kernel.org, mlxsw On 8/26/17 11:04 AM, Ido Schimmel wrote: > Regarding the silent abort, that's intentional. You can look at the same > code in v4.9 - when the chain was still blocking - and you'll see that > we didn't propagate the error even then. This was discussed in the past > and the conclusion was that user doesn't expect to operation to fail. If > hardware resources are exceeded, we let the kernel take care of the > forwarding instead. > In addition to Roopa's comments... The silent abort is not a good user experience. Right now it's add a network address or route, cross fingers and hope it does not overflow some limit (nexthop, ecmp, neighbor, prefix, etc) that triggers the offload abort. The mlxsw driver queries for some limits (e.g., max rifs) but I don't see any query related to current usage, and there is no API to pass any of that data to user space so user space has no programmatic way to handle this. I realize you are aware of this limitation. The point is to emphasize the need to resolve this. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: mlxsw and rtnl lock 2017-08-28 18:00 ` David Ahern @ 2017-08-29 6:10 ` Arkadi Sharshevsky 2017-08-29 20:04 ` David Ahern 2017-08-29 19:16 ` Arkadi Sharshevsky 1 sibling, 1 reply; 8+ messages in thread From: Arkadi Sharshevsky @ 2017-08-29 6:10 UTC (permalink / raw) To: David Ahern, Ido Schimmel; +Cc: Jiri Pirko, netdev@vger.kernel.org, mlxsw On 08/28/2017 09:00 PM, David Ahern wrote: > On 8/26/17 11:04 AM, Ido Schimmel wrote: >> Regarding the silent abort, that's intentional. You can look at the same >> code in v4.9 - when the chain was still blocking - and you'll see that >> we didn't propagate the error even then. This was discussed in the past >> and the conclusion was that user doesn't expect to operation to fail. If >> hardware resources are exceeded, we let the kernel take care of the >> forwarding instead. >> > > In addition to Roopa's comments... The silent abort is not a good user > experience. Right now it's add a network address or route, cross fingers > and hope it does not overflow some limit (nexthop, ecmp, neighbor, > prefix, etc) that triggers the offload abort. > > The mlxsw driver queries for some limits (e.g., max rifs) but I don't > see any query related to current usage, and there is no API to pass any > of that data to user space so user space has no programmatic way to > handle this. I realize you are aware of this limitation. The point is to > emphasize the need to resolve this. > We actually thought about providing he user some tools to understand the ASIC's limitations by introducing the 'resource' object to devlink. By linking dpipe tables to resources the user can understand which hardware processes share a common resource, furthermore this resources usage could be observed. By this more visibility can be obtained. Its not a remedy for the silent abort, but, maybe a notification can be sent from devlink in case of abort that some resources is full. This proposition was sent as RFC several weeks ago. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: mlxsw and rtnl lock 2017-08-29 6:10 ` Arkadi Sharshevsky @ 2017-08-29 20:04 ` David Ahern 2017-08-29 21:18 ` Arkadi Sharshevsky 0 siblings, 1 reply; 8+ messages in thread From: David Ahern @ 2017-08-29 20:04 UTC (permalink / raw) To: Arkadi Sharshevsky, Ido Schimmel Cc: Jiri Pirko, netdev@vger.kernel.org, mlxsw On 8/29/17 12:10 AM, Arkadi Sharshevsky wrote: > > > On 08/28/2017 09:00 PM, David Ahern wrote: >> On 8/26/17 11:04 AM, Ido Schimmel wrote: >>> Regarding the silent abort, that's intentional. You can look at the same >>> code in v4.9 - when the chain was still blocking - and you'll see that >>> we didn't propagate the error even then. This was discussed in the past >>> and the conclusion was that user doesn't expect to operation to fail. If >>> hardware resources are exceeded, we let the kernel take care of the >>> forwarding instead. >>> >> >> In addition to Roopa's comments... The silent abort is not a good user >> experience. Right now it's add a network address or route, cross fingers >> and hope it does not overflow some limit (nexthop, ecmp, neighbor, >> prefix, etc) that triggers the offload abort. >> >> The mlxsw driver queries for some limits (e.g., max rifs) but I don't >> see any query related to current usage, and there is no API to pass any >> of that data to user space so user space has no programmatic way to >> handle this. I realize you are aware of this limitation. The point is to >> emphasize the need to resolve this. >> > > We actually thought about providing he user some tools to understand > the ASIC's limitations by introducing the 'resource' object to devlink. > > By linking dpipe tables to resources the user can understand which > hardware processes share a common resource, furthermore this resources > usage could be observed. By this more visibility can be obtained. > > Its not a remedy for the silent abort, but, maybe a notification > can be sent from devlink in case of abort that some resources is > full. > > This proposition was sent as RFC several weeks ago. > Do you have patches (kernel and devlink) for the proposal? ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: mlxsw and rtnl lock 2017-08-29 20:04 ` David Ahern @ 2017-08-29 21:18 ` Arkadi Sharshevsky 0 siblings, 0 replies; 8+ messages in thread From: Arkadi Sharshevsky @ 2017-08-29 21:18 UTC (permalink / raw) To: David Ahern, Ido Schimmel; +Cc: Jiri Pirko, netdev@vger.kernel.org, mlxsw On 08/29/2017 11:04 PM, David Ahern wrote: > On 8/29/17 12:10 AM, Arkadi Sharshevsky wrote: >> >> >> On 08/28/2017 09:00 PM, David Ahern wrote: >>> On 8/26/17 11:04 AM, Ido Schimmel wrote: >>>> Regarding the silent abort, that's intentional. You can look at the same >>>> code in v4.9 - when the chain was still blocking - and you'll see that >>>> we didn't propagate the error even then. This was discussed in the past >>>> and the conclusion was that user doesn't expect to operation to fail. If >>>> hardware resources are exceeded, we let the kernel take care of the >>>> forwarding instead. >>>> >>> >>> In addition to Roopa's comments... The silent abort is not a good user >>> experience. Right now it's add a network address or route, cross fingers >>> and hope it does not overflow some limit (nexthop, ecmp, neighbor, >>> prefix, etc) that triggers the offload abort. >>> >>> The mlxsw driver queries for some limits (e.g., max rifs) but I don't >>> see any query related to current usage, and there is no API to pass any >>> of that data to user space so user space has no programmatic way to >>> handle this. I realize you are aware of this limitation. The point is to >>> emphasize the need to resolve this. >>> >> >> We actually thought about providing he user some tools to understand >> the ASIC's limitations by introducing the 'resource' object to devlink. >> >> By linking dpipe tables to resources the user can understand which >> hardware processes share a common resource, furthermore this resources >> usage could be observed. By this more visibility can be obtained. >> >> Its not a remedy for the silent abort, but, maybe a notification >> can be sent from devlink in case of abort that some resources is >> full. >> >> This proposition was sent as RFC several weeks ago. >> > > Do you have patches (kernel and devlink) for the proposal? > No, only the design RFC which describe the UAPI, devlink commands and the devlink/driver interactions. I wanted to receive some feedback before the coding. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: mlxsw and rtnl lock 2017-08-28 18:00 ` David Ahern 2017-08-29 6:10 ` Arkadi Sharshevsky @ 2017-08-29 19:16 ` Arkadi Sharshevsky 1 sibling, 0 replies; 8+ messages in thread From: Arkadi Sharshevsky @ 2017-08-29 19:16 UTC (permalink / raw) To: David Ahern, Ido Schimmel; +Cc: Jiri Pirko, netdev@vger.kernel.org, mlxsw On 08/28/2017 09:00 PM, David Ahern wrote: > On 8/26/17 11:04 AM, Ido Schimmel wrote: >> Regarding the silent abort, that's intentional. You can look at the same >> code in v4.9 - when the chain was still blocking - and you'll see that >> we didn't propagate the error even then. This was discussed in the past >> and the conclusion was that user doesn't expect to operation to fail. If >> hardware resources are exceeded, we let the kernel take care of the >> forwarding instead. >> > > In addition to Roopa's comments... The silent abort is not a good user > experience. Right now it's add a network address or route, cross fingers > and hope it does not overflow some limit (nexthop, ecmp, neighbor, > prefix, etc) that triggers the offload abort. > > The mlxsw driver queries for some limits (e.g., max rifs) but I don't > see any query related to current usage, and there is no API to pass any > of that data to user space so user space has no programmatic way to > handle this. I realize you are aware of this limitation. The point is to The first dpipe table that was introduced was the erif table. which gathers L3 statistics. The rifs are actually also fixed size resource, so maybe it is more correct to introduce it as 'resources' and connect it to the erif table. That way you will be able to obtain current usage, and receive notification when it will be drained out. > emphasize the need to resolve this. > ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2017-08-29 21:19 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2017-08-25 20:26 mlxsw and rtnl lock David Ahern 2017-08-26 17:04 ` Ido Schimmel 2017-08-28 15:55 ` Roopa Prabhu 2017-08-28 18:00 ` David Ahern 2017-08-29 6:10 ` Arkadi Sharshevsky 2017-08-29 20:04 ` David Ahern 2017-08-29 21:18 ` Arkadi Sharshevsky 2017-08-29 19:16 ` Arkadi Sharshevsky
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).