* switchdev fib offload issues
@ 2016-04-18 15:47 Jiri Pirko
2016-04-18 16:38 ` Ilan Tayari
` (3 more replies)
0 siblings, 4 replies; 9+ messages in thread
From: Jiri Pirko @ 2016-04-18 15:47 UTC (permalink / raw)
To: netdev
Cc: davem, idosch, eladr, yotamg, ogerlitz, roopa, nikolay, jhs,
john.fastabend, rami.rosen, gospo, stephen, sfeldma, dsa,
f.fainelli, andrew, vivien.didelot, tgraf, aduyck
Hi all.
The current situation of fib offloading is not good, I believe we need
to make some changes, therefore I'm writing this email. Please read,
think and comment.
Currently what we have is that for every fib entry inserted into a table,
there is a call to switchdev:
fib_table_insert->switchdev_fib_ipv4_add
Driver then pushes fib entry down to HW. So far good.
However, if for any reason the switchdev add operation fails, there is an
abort function called (switchdev_fib_ipv4_abort). This function does two
things which are both unfortunate in many usecases:
1) evicts all fib entries from HW leaving all processing done in kernel
- For Spectrum ASIC this means that all traffic running at 100G between
all ports is immediately downgraded to ~1-3Gbits
- Also this happens silently, user knows nothing about anything went wrong,
only forwarding performance suddenly sucks.
2) sets net->ipv4.fib_offload_disabled = true
- That results in no other fib entry being offloaded, forever,
until net is removed and added again, machine reboot is required
in case if init_ns
These 2 issues makes fib offload completely unusable. So I propose
to start thinking about fixing this.
I believe that although the current behaviour might be good for default,
user should be able to change it by setting a different policy. This
policy will allow to propagate offload error to user.
Note that user already has to handle fib add errors which are independent
on particular fib entry. That is a case of insufficient memory (-ENOBUFS).
In fact, when offload fails, that is most likely also due to insufficient
resources in HW.
Proposed solutions (ideas):
1) per-netns. Add a procfs file:
/proc/sys/net/ipv4/route/fib_offload_error_policy
with values: "evict" - default, current behaviour
"fail" - propagate offload error to user
The policy value would be stored in struct net.
2) per-VRF/table
When user creates a VRF master, he specifies a table ID
this VRF is going to use. I propose to extend this so
he can pass a policy ("evict"/"fail").
The policy value would be stored in struct fib_table or
struct fib6_table. The problem is that vfr only saves
table ID, allocates dst but does not actually create
table. That might be created later. But I think this
could be resolved.
3) per-VFR/master_netdev
In this case, the policy would be also set during
the creation of VFR master. From user perspective,
this looks same as 2)
The policy value would be stored in struct net_vrf (vrf private).
Thoughts?
Thanks!
Jiri
^ permalink raw reply [flat|nested] 9+ messages in thread
* RE: switchdev fib offload issues
2016-04-18 15:47 switchdev fib offload issues Jiri Pirko
@ 2016-04-18 16:38 ` Ilan Tayari
2016-04-18 17:11 ` David Ahern
2016-04-18 16:59 ` David Ahern
` (2 subsequent siblings)
3 siblings, 1 reply; 9+ messages in thread
From: Ilan Tayari @ 2016-04-18 16:38 UTC (permalink / raw)
To: Jiri Pirko; +Cc: netdev@vger.kernel.org
Hi Jiri,
> Thoughts?
One general thought comes to mind, regardless of your proposed solutions:
You'll need to take care to propagate the error properly in case of routes created by the kernel, such as local routes.
Local routes are created when an IP address is added to an interface, so in such a case would the whole add-ip-address operation be rolled back cleanly?
Ilan.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: switchdev fib offload issues
2016-04-18 15:47 switchdev fib offload issues Jiri Pirko
2016-04-18 16:38 ` Ilan Tayari
@ 2016-04-18 16:59 ` David Ahern
2016-04-19 7:02 ` Jiri Pirko
2016-04-18 17:17 ` Hannes Frederic Sowa
2016-04-18 17:52 ` David Miller
3 siblings, 1 reply; 9+ messages in thread
From: David Ahern @ 2016-04-18 16:59 UTC (permalink / raw)
To: Jiri Pirko, netdev
Cc: davem, idosch, eladr, yotamg, ogerlitz, roopa, nikolay, jhs,
john.fastabend, rami.rosen, gospo, stephen, sfeldma, f.fainelli,
andrew, vivien.didelot, tgraf, aduyck
On 4/18/16 9:47 AM, Jiri Pirko wrote:
> Proposed solutions (ideas):
> 1) per-netns. Add a procfs file:
> /proc/sys/net/ipv4/route/fib_offload_error_policy
> with values: "evict" - default, current behaviour
> "fail" - propagate offload error to user
> The policy value would be stored in struct net.
>
> 2) per-VRF/table
> When user creates a VRF master, he specifies a table ID
> this VRF is going to use. I propose to extend this so
> he can pass a policy ("evict"/"fail").
> The policy value would be stored in struct fib_table or
> struct fib6_table. The problem is that vfr only saves
> table ID, allocates dst but does not actually create
> table. That might be created later. But I think this
> could be resolved.
Yes, we have a local patch where I do create the table for IPv6. Can do
that for IPv4 as well. Some other clean ups are needed in this area -
like the ability to delete a table
>
> 3) per-VFR/master_netdev
> In this case, the policy would be also set during
> the creation of VFR master. From user perspective,
> this looks same as 2)
> The policy value would be stored in struct net_vrf (vrf private).
The VRF device is really only used for guiding lookups, not inserting
routes.
A per table/VRF policy (option 2) seems more appropriate.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: switchdev fib offload issues
2016-04-18 16:38 ` Ilan Tayari
@ 2016-04-18 17:11 ` David Ahern
0 siblings, 0 replies; 9+ messages in thread
From: David Ahern @ 2016-04-18 17:11 UTC (permalink / raw)
To: Ilan Tayari, Jiri Pirko; +Cc: netdev@vger.kernel.org
On 4/18/16 10:38 AM, Ilan Tayari wrote:
> Hi Jiri,
>
>> Thoughts?
>
> One general thought comes to mind, regardless of your proposed solutions:
>
> You'll need to take care to propagate the error properly in case of routes created by the kernel, such as local routes.
> Local routes are created when an IP address is added to an interface, so in such a case would the whole add-ip-address operation be rolled back cleanly?
local routes are added on an 'admin up' too which is completely
disconnected from adding the address to the interface.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: switchdev fib offload issues
2016-04-18 15:47 switchdev fib offload issues Jiri Pirko
2016-04-18 16:38 ` Ilan Tayari
2016-04-18 16:59 ` David Ahern
@ 2016-04-18 17:17 ` Hannes Frederic Sowa
2016-04-21 13:00 ` Roopa Prabhu
2016-04-18 17:52 ` David Miller
3 siblings, 1 reply; 9+ messages in thread
From: Hannes Frederic Sowa @ 2016-04-18 17:17 UTC (permalink / raw)
To: Jiri Pirko, netdev
Cc: davem, idosch, eladr, yotamg, ogerlitz, roopa, nikolay, jhs,
john.fastabend, rami.rosen, gospo, stephen, sfeldma, dsa,
f.fainelli, andrew, vivien.didelot, tgraf, aduyck
Hi Jiri,
On 18.04.2016 17:47, Jiri Pirko wrote:
> Proposed solutions (ideas):
> 1) per-netns. Add a procfs file:
> /proc/sys/net/ipv4/route/fib_offload_error_policy
> with values: "evict" - default, current behaviour
> "fail" - propagate offload error to user
> The policy value would be stored in struct net.
>
> 2) per-VRF/table
> When user creates a VRF master, he specifies a table ID
> this VRF is going to use. I propose to extend this so
> he can pass a policy ("evict"/"fail").
> The policy value would be stored in struct fib_table or
> struct fib6_table. The problem is that vfr only saves
> table ID, allocates dst but does not actually create
> table. That might be created later. But I think this
> could be resolved.
>
> 3) per-VFR/master_netdev
> In this case, the policy would be also set during
> the creation of VFR master. From user perspective,
> this looks same as 2)
> The policy value would be stored in struct net_vrf (vrf private).
I agree that a fail policy is probably the way forward regarding the
issues you outlined.
One question though:
Shouldn't the policy by an attribute of the switch, e.g. configurable by
devlink (maybe also not the right place)? Not sure how user space can
otherwise make correct assumptions about the state of the switch and
initiate proper countermeasures (e.g. reducing the smallest prefix
length installed to hardware).
Bye,
Hannes
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: switchdev fib offload issues
2016-04-18 15:47 switchdev fib offload issues Jiri Pirko
` (2 preceding siblings ...)
2016-04-18 17:17 ` Hannes Frederic Sowa
@ 2016-04-18 17:52 ` David Miller
2016-04-19 7:21 ` Jiri Pirko
3 siblings, 1 reply; 9+ messages in thread
From: David Miller @ 2016-04-18 17:52 UTC (permalink / raw)
To: jiri
Cc: netdev, idosch, eladr, yotamg, ogerlitz, roopa, nikolay, jhs,
john.fastabend, rami.rosen, gospo, stephen, sfeldma, dsa,
f.fainelli, andrew, vivien.didelot, tgraf, aduyck
From: Jiri Pirko <jiri@resnulli.us>
Date: Mon, 18 Apr 2016 17:47:57 +0200
> However, if for any reason the switchdev add operation fails, there is an
> abort function called (switchdev_fib_ipv4_abort). This function does two
> things which are both unfortunate in many usecases:
> 1) evicts all fib entries from HW leaving all processing done in kernel
> - For Spectrum ASIC this means that all traffic running at 100G between
> all ports is immediately downgraded to ~1-3Gbits
> - Also this happens silently, user knows nothing about anything went wrong,
> only forwarding performance suddenly sucks.
>
> 2) sets net->ipv4.fib_offload_disabled = true
> - That results in no other fib entry being offloaded, forever,
> until net is removed and added again, machine reboot is required
> in case if init_ns
>
> These 2 issues makes fib offload completely unusable. So I propose
> to start thinking about fixing this.
>
> I believe that although the current behaviour might be good for default,
> user should be able to change it by setting a different policy. This
> policy will allow to propagate offload error to user.
There were many length discussions about this.
It is extremely hard to load a partial table into the chip and
have it work correctly. This is because with longest matching
prefix you have to pull out the least specific routing entires
and process them in software.
Also, there is no communication about what makes an entry not be
insertable or not. So it may be the case that the shorter prefixes
all fit into the table, because those can be compressed and take
up less table space in the chip.
So it's extremely hard to know when "room" is available again. Room
for what? One 16-bit prefixed route? Or room for one arbitrarily
prefixed route? Which is it?
The user shouldn't need to know anything about this, and I will
be strongly against any design which puts the onus on the user
to configure a table that will fit into the chip.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: switchdev fib offload issues
2016-04-18 16:59 ` David Ahern
@ 2016-04-19 7:02 ` Jiri Pirko
0 siblings, 0 replies; 9+ messages in thread
From: Jiri Pirko @ 2016-04-19 7:02 UTC (permalink / raw)
To: David Ahern
Cc: netdev, davem, idosch, eladr, yotamg, ogerlitz, roopa, nikolay,
jhs, john.fastabend, rami.rosen, gospo, stephen, sfeldma,
f.fainelli, andrew, vivien.didelot, tgraf, aduyck
Mon, Apr 18, 2016 at 06:59:37PM CEST, dsa@cumulusnetworks.com wrote:
>On 4/18/16 9:47 AM, Jiri Pirko wrote:
>>Proposed solutions (ideas):
>>1) per-netns. Add a procfs file:
>> /proc/sys/net/ipv4/route/fib_offload_error_policy
>> with values: "evict" - default, current behaviour
>> "fail" - propagate offload error to user
>> The policy value would be stored in struct net.
>>
>>2) per-VRF/table
>> When user creates a VRF master, he specifies a table ID
>> this VRF is going to use. I propose to extend this so
>> he can pass a policy ("evict"/"fail").
>> The policy value would be stored in struct fib_table or
>> struct fib6_table. The problem is that vfr only saves
>> table ID, allocates dst but does not actually create
>> table. That might be created later. But I think this
>> could be resolved.
>
>Yes, we have a local patch where I do create the table for IPv6. Can do that
>for IPv4 as well. Some other clean ups are needed in this area - like the
>ability to delete a table
>
>>
>>3) per-VFR/master_netdev
>> In this case, the policy would be also set during
>> the creation of VFR master. From user perspective,
>> this looks same as 2)
>> The policy value would be stored in struct net_vrf (vrf private).
>
>The VRF device is really only used for guiding lookups, not inserting routes.
>
>A per table/VRF policy (option 2) seems more appropriate.
Right. Option 2 also seems better to me. Thanks.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: switchdev fib offload issues
2016-04-18 17:52 ` David Miller
@ 2016-04-19 7:21 ` Jiri Pirko
0 siblings, 0 replies; 9+ messages in thread
From: Jiri Pirko @ 2016-04-19 7:21 UTC (permalink / raw)
To: David Miller
Cc: netdev, idosch, eladr, yotamg, ogerlitz, roopa, nikolay, jhs,
john.fastabend, rami.rosen, gospo, stephen, sfeldma, dsa,
f.fainelli, andrew, vivien.didelot, tgraf, aduyck
Mon, Apr 18, 2016 at 07:52:27PM CEST, davem@davemloft.net wrote:
>From: Jiri Pirko <jiri@resnulli.us>
>Date: Mon, 18 Apr 2016 17:47:57 +0200
>
>> However, if for any reason the switchdev add operation fails, there is an
>> abort function called (switchdev_fib_ipv4_abort). This function does two
>> things which are both unfortunate in many usecases:
>> 1) evicts all fib entries from HW leaving all processing done in kernel
>> - For Spectrum ASIC this means that all traffic running at 100G between
>> all ports is immediately downgraded to ~1-3Gbits
>> - Also this happens silently, user knows nothing about anything went wrong,
>> only forwarding performance suddenly sucks.
>>
>> 2) sets net->ipv4.fib_offload_disabled = true
>> - That results in no other fib entry being offloaded, forever,
>> until net is removed and added again, machine reboot is required
>> in case if init_ns
>>
>> These 2 issues makes fib offload completely unusable. So I propose
>> to start thinking about fixing this.
>>
>> I believe that although the current behaviour might be good for default,
>> user should be able to change it by setting a different policy. This
>> policy will allow to propagate offload error to user.
>
>There were many length discussions about this.
I know. I wouln't start this one if the 2 issues I described did not exist.
>
>It is extremely hard to load a partial table into the chip and
>have it work correctly. This is because with longest matching
>prefix you have to pull out the least specific routing entires
>and process them in software.
Agreed. My my policy change did not change this behavious. the tables
would still be in since. Only capacity of kernel table would be limited
by capacity of HW table for that policy.
>
>Also, there is no communication about what makes an entry not be
>insertable or not. So it may be the case that the shorter prefixes
>all fit into the table, because those can be compressed and take
>up less table space in the chip.
>
>So it's extremely hard to know when "room" is available again. Room
>for what? One 16-bit prefixed route? Or room for one arbitrarily
>prefixed route? Which is it?
>
>The user shouldn't need to know anything about this, and I will
>be strongly against any design which puts the onus on the user
>to configure a table that will fit into the chip.
I agree. User should not care and should not speculate if some rule fits
and some other does not. He should just try to add *anything* and see if
the operation was successful or not.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: switchdev fib offload issues
2016-04-18 17:17 ` Hannes Frederic Sowa
@ 2016-04-21 13:00 ` Roopa Prabhu
0 siblings, 0 replies; 9+ messages in thread
From: Roopa Prabhu @ 2016-04-21 13:00 UTC (permalink / raw)
To: Hannes Frederic Sowa
Cc: Jiri Pirko, netdev, davem, idosch, eladr, yotamg, ogerlitz,
nikolay, jhs, john.fastabend, rami.rosen, gospo, stephen, sfeldma,
dsa, f.fainelli, andrew, vivien.didelot, tgraf, aduyck
On 4/18/16, 10:17 AM, Hannes Frederic Sowa wrote:
> Hi Jiri,
>
> On 18.04.2016 17:47, Jiri Pirko wrote:
>> Proposed solutions (ideas):
>> 1) per-netns. Add a procfs file:
>> /proc/sys/net/ipv4/route/fib_offload_error_policy
>> with values: "evict" - default, current behaviour
>> "fail" - propagate offload error to user
>> The policy value would be stored in struct net.
> >
>> 2) per-VRF/table
>> When user creates a VRF master, he specifies a table ID
>> this VRF is going to use. I propose to extend this so
>> he can pass a policy ("evict"/"fail").
>> The policy value would be stored in struct fib_table or
>> struct fib6_table. The problem is that vfr only saves
>> table ID, allocates dst but does not actually create
>> table. That might be created later. But I think this
>> could be resolved.
>>
>> 3) per-VFR/master_netdev
>> In this case, the policy would be also set during
>> the creation of VFR master. From user perspective,
>> this looks same as 2)
>> The policy value would be stored in struct net_vrf (vrf private).
>
> I agree that a fail policy is probably the way forward regarding the issues you outlined.
>
> One question though:
>
> Shouldn't the policy by an attribute of the switch, e.g. configurable by devlink (maybe also not the right place)? Not sure how user space can otherwise make correct assumptions about the state of the switch and initiate proper countermeasures (e.g. reducing the smallest prefix length installed to hardware).
>
I am with hannes here. If we introduce a policy, I think it should be global or per switchdev instance (possibly via devlink).
This would be a system policy (set via the administrator) and the user or app does not need to know about it.
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2016-04-21 13:00 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-04-18 15:47 switchdev fib offload issues Jiri Pirko
2016-04-18 16:38 ` Ilan Tayari
2016-04-18 17:11 ` David Ahern
2016-04-18 16:59 ` David Ahern
2016-04-19 7:02 ` Jiri Pirko
2016-04-18 17:17 ` Hannes Frederic Sowa
2016-04-21 13:00 ` Roopa Prabhu
2016-04-18 17:52 ` David Miller
2016-04-19 7:21 ` Jiri Pirko
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).