Netdev List
 help / color / mirror / Atom feed
* [net 0/3][pull request] Intel Wired LAN Driver Updates 2015-01-06
From: Jeff Kirsher @ 2015-01-06  8:44 UTC (permalink / raw)
  To: davem; +Cc: Jeff Kirsher, netdev, nhorman, sassmann, jogreene

This series contains fixes to i40e only.

Jesse provides a fix for when the driver was polling with interrupts
disabled the hardware would occasionally not write back descriptors.
His fix causes the driver to detect this situation and force an interrupt
to fire which will flush the stuck descriptor.

Anjali provides a couple of fixes, the first corrects an issue where
the receive port checksum error counter was incrementing incorrectly with
UDP encapsulated tunneled traffic.  The second fix resolves an issue where
the driver was examining the outer protocol layer to set the inner protocol
layer checksum offload.  In the case of TCP over IPv6 over an IPv4 based
VXLAN, the inner checksum offloads would be set to look for IPv4/UDP
instead of IPv6/TCP, so fixed the issue so that the driver will look at
the proper layer for encapsulation offload settings.

The following are changes since commit 7ce67a38f799d1fb332f672b117efbadedaa5352:
  net: ethernet: cpsw: fix hangs with interrupts
and are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net master

Anjali Singhai (2):
  i40e: Fix Rx checksum error counter
  i40e: Fix bug with TCP over IPv6 over VXLAN

Jesse Brandeburg (1):
  i40e: fix un-necessary Tx hangs

 drivers/net/ethernet/intel/i40e/i40e_txrx.c | 83 ++++++++++++++++++-----------
 1 file changed, 53 insertions(+), 30 deletions(-)

-- 
1.9.3

^ permalink raw reply

* Re: [PATCH] Revert "ipw2200: select CFG80211_WEXT"
From: Johannes Berg @ 2015-01-06  8:23 UTC (permalink / raw)
  To: Paul Bolle
  Cc: Arend van Spriel, Linus Torvalds, Marcel Holtmann,
	Stanislav Yakovlev, Kalle Valo, Jiri Kosina, linux-wireless,
	Network Development, Linux Kernel Mailing List
In-Reply-To: <1420495519.14308.29.camel@x220>

On Mon, 2015-01-05 at 23:05 +0100, Paul Bolle wrote:
> On Mon, 2015-01-05 at 19:57 +0100, Johannes Berg wrote:
> > Multiple other groups of ioctls could be converted in similar patches,
> > until at the end you can completely remove ipw_wx_handlers and rely
> > entirely on cfg80211's wext compatibility.
> > 
> > So far the theory - in practice nobody cared enough to start working on
> > any of these drivers, let alone actually has the hardware today.
> 
> So my suggestion to make ipw2200 no longer use cfg80211_wext_giwname()
> would actually be backwards. What's actually needed, in theory, is to
> use more of what's provided under CFG80211_WEXT (and, I guess, less of
> what's provided under WIRELESS_EXT). Did I get that right?

Yes, though I'd argue there are multiple levels of truth here.

Yours is the theoretical, hopefully-forward-looking one where we still
expect the driver to actually be modified to take advantage of the new
frameworks (which is independent of wext support towards userspace). In
that scenario, yes, it should use more until it uses all, and then it
can stop concerning itself with wext (which would be a win because
driver/wext interaction is always finicky) (*)

Then there's the other level that you were looking at earlier - simply
removing all of this again from this driver because nobody is going to
work on it. That'd actually make sense and shrink the driver footprint
(no need to load cfg80211 for using almost nothing of it) but this is no
longer feasible since it would again break userspace API - as I said you
can today discover the presence of this device/driver with nl80211 -
that would be broken by removing cfg80211 from here.

And then there's the practical third version that's likely to happen -
i.e. nothing :-)

johannes

(*) FWIW, if done to all drivers it would allow shrinking cfg80211-wext
by a lot because all the EXPORT_SYMBOL would no longer be needed - but
come to think of it we can fix that differently.

^ permalink raw reply

* Re: [PATCH] Revert "ipw2200: select CFG80211_WEXT"
From: Johannes Berg @ 2015-01-06  8:19 UTC (permalink / raw)
  To: Arend van Spriel
  Cc: Paul Bolle, Linus Torvalds, Marcel Holtmann, Stanislav Yakovlev,
	Kalle Valo, Jiri Kosina, linux-wireless, Network Development,
	Linux Kernel Mailing List
In-Reply-To: <54AB0C75.1090204-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>

On Mon, 2015-01-05 at 23:13 +0100, Arend van Spriel wrote:
> On 01/05/15 23:05, Paul Bolle wrote:
> > On Mon, 2015-01-05 at 19:57 +0100, Johannes Berg wrote:
> >> Multiple other groups of ioctls could be converted in similar patches,
> >> until at the end you can completely remove ipw_wx_handlers and rely
> >> entirely on cfg80211's wext compatibility.
> >>
> >> So far the theory - in practice nobody cared enough to start working on
> >> any of these drivers, let alone actually has the hardware today.
> >
> > So my suggestion to make ipw2200 no longer use cfg80211_wext_giwname()
> > would actually be backwards. What's actually needed, in theory, is to
> > use more of what's provided under CFG80211_WEXT (and, I guess, less of
> > what's provided under WIRELESS_EXT). Did I get that right?
> 
> Yes, but as Johannes indicated it needs consideration what to group in 
> the patches.

Oh, that's not strictly necessary - that was just so it would actually
work after each single patch. If you wanted to, you could do it all in a
single huge patch as well :-) I was explaining though why we did the
cfg80211-wext code the way it is - which enables making smaller changes
that don't break the driver.

johannes

--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] Revert "ipw2200: select CFG80211_WEXT"
From: Kalle Valo @ 2015-01-06  8:03 UTC (permalink / raw)
  To: Paul Bolle
  Cc: Stanislav Yakovlev, Jiri Kosina, Linus Torvalds,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1420297188.2397.3.camel-IWqWACnzNjzz+pZb47iToQ@public.gmane.org>

Paul Bolle <pebolle-IWqWACnzNjzz+pZb47iToQ@public.gmane.org> writes:

> This reverts commit dddd60220f41775e634258efd1b54c6fa81ce706.
>
> The raison d'être of commit dddd60220f41 ("ipw2200: select
> CFG80211_WEXT") was reverted in commit 2d36e008739e ("Revert "cfg80211:
> make WEXT compatibility unselectable""). So revert this commit too.
>
> Signed-off-by: Paul Bolle <pebolle-IWqWACnzNjzz+pZb47iToQ@public.gmane.org>

Based on the discussion I have dropped this patch from my queue.

-- 
Kalle Valo
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] GMAC: fix simple_return.cocci warnings
From: David Miller @ 2015-01-06  7:52 UTC (permalink / raw)
  To: roger.chen
  Cc: joe, fengguang.wu, kbuild-all, peppe.cavallaro, netdev,
	linux-kernel
In-Reply-To: <54AB8B23.3080303@rock-chips.com>

From: Roger <roger.chen@rock-chips.com>
Date: Tue, 06 Jan 2015 15:13:39 +0800

> What should I do now?

Nothing, I'm simply not applying this patch.

^ permalink raw reply

* Re: [net-next PATCH v1 08/11] net: rocker: add get flow API operation
From: Scott Feldman @ 2015-01-06  7:40 UTC (permalink / raw)
  To: John Fastabend
  Cc: Thomas Graf, Jiří Pírko, Jamal Hadi Salim,
	simon.horman@netronome.com, Netdev, David S. Miller,
	Andy Gospodarek
In-Reply-To: <20141231194852.31070.72727.stgit@nitbit.x32>

On Wed, Dec 31, 2014 at 11:48 AM, John Fastabend
<john.fastabend@gmail.com> wrote:
> Add operations to get flows. I wouldn't mind cleaning this code
> up a bit but my first attempt to do this used macros which shortered
> the code up but when I was done I decided it just made the code
> unreadable and unmaintainable.
>
> I might think about it a bit more but this implementation albeit
> a bit long and repeatative is easier to understand IMO.

Dang, you put a lot of work into this one.

Something doesn't feel right though.  In this case, rocker driver just
happened to have cached all the flow/group stuff in hash tables in
software, so you don't need to query thru to the device to extract the
if_flow info.  What doesn't feel right is all the work need in the
driver.  For each and every driver.  get_flows needs to go above
driver, somehow.

Seems the caller of if_flow already knows the flows pushed down with
add_flows/del_flows, and with the err handling can't mess it up.

Is one use-case for get_flows to recover from a fatal OS/driver crash,
and to rely on hardware to recover flow set?  In this rocker example,
that's not going to work because driver didn't get thru to device to
get_flows.  I think I'd like to know more about the use-cases of
get_flows.

-scott

^ permalink raw reply

* Re: [net-next PATCH v1 05/11] net: rocker: add set flow rules
From: Scott Feldman @ 2015-01-06  7:23 UTC (permalink / raw)
  To: John Fastabend
  Cc: Thomas Graf, Jiří Pírko, Jamal Hadi Salim,
	simon.horman@netronome.com, Netdev, David S. Miller,
	Andy Gospodarek
In-Reply-To: <20141231194735.31070.55480.stgit@nitbit.x32>

On Wed, Dec 31, 2014 at 11:47 AM, John Fastabend
<john.fastabend@gmail.com> wrote:
> Implement set flow operations for existing rocker tables.
>
> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> ---
>  drivers/net/ethernet/rocker/rocker.c          |  517 +++++++++++++++++++++++++
>  drivers/net/ethernet/rocker/rocker_pipeline.h |    3
>  2 files changed, 519 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/rocker/rocker.c b/drivers/net/ethernet/rocker/rocker.c
> index 4c6787a..c40c58d 100644
> --- a/drivers/net/ethernet/rocker/rocker.c
> +++ b/drivers/net/ethernet/rocker/rocker.c
> @@ -3806,6 +3806,520 @@ static struct net_flow_hdr_node **rocker_get_hgraph(struct net_device *d)
>  {
>         return rocker_header_nodes;
>  }
> +
> +static int is_valid_net_flow_action_arg(struct net_flow_action *a, int id)
> +{
> +       struct net_flow_action_arg *args = a->args;
> +       int i;
> +
> +       for (i = 0; args[i].type != NET_FLOW_ACTION_ARG_TYPE_NULL; i++) {
> +               if (a->args[i].type == NET_FLOW_ACTION_ARG_TYPE_NULL ||
> +                   args[i].type != a->args[i].type)
> +                       return -EINVAL;
> +       }
> +
> +       return 0;
> +}
> +
> +static int is_valid_net_flow_action(struct net_flow_action *a, int *actions)
> +{
> +       int i;
> +
> +       for (i = 0; actions[i]; i++) {
> +               if (actions[i] == a->uid)
> +                       return is_valid_net_flow_action_arg(a, a->uid);
> +       }
> +       return -EINVAL;
> +}
> +
> +static int is_valid_net_flow_match(struct net_flow_field_ref *f,
> +                                  struct net_flow_field_ref *fields)
> +{
> +       int i;
> +
> +       for (i = 0; fields[i].header; i++) {
> +               if (f->header == fields[i].header &&
> +                   f->field == fields[i].field)
> +                       return 0;
> +       }
> +
> +       return -EINVAL;
> +}
> +
> +int is_valid_net_flow(struct net_flow_table *table, struct net_flow_flow *flow)
> +{
> +       struct net_flow_field_ref *fields = table->matches;
> +       int *actions = table->actions;
> +       int i, err;
> +
> +       for (i = 0; flow->actions[i].uid; i++) {
> +               err = is_valid_net_flow_action(&flow->actions[i], actions);
> +               if (err)
> +                       return -EINVAL;
> +       }
> +
> +       for (i = 0; flow->matches[i].header; i++) {
> +               err = is_valid_net_flow_match(&flow->matches[i], fields);
> +               if (err)
> +                       return -EINVAL;
> +       }
> +
> +       return 0;
> +}

All the above doesn't look rocker-specific...up-level?

> +
> +static u32 rocker_goto_value(u32 id)
> +{
> +       switch (id) {
> +       case ROCKER_FLOW_TABLE_ID_INGRESS_PORT:
> +               return ROCKER_OF_DPA_TABLE_ID_INGRESS_PORT;
> +       case ROCKER_FLOW_TABLE_ID_VLAN:
> +               return ROCKER_OF_DPA_TABLE_ID_VLAN;
> +       case ROCKER_FLOW_TABLE_ID_TERMINATION_MAC:
> +               return ROCKER_OF_DPA_TABLE_ID_TERMINATION_MAC;
> +       case ROCKER_FLOW_TABLE_ID_UNICAST_ROUTING:
> +               return ROCKER_OF_DPA_TABLE_ID_UNICAST_ROUTING;
> +       case ROCKER_FLOW_TABLE_ID_MULTICAST_ROUTING:
> +               return ROCKER_OF_DPA_TABLE_ID_MULTICAST_ROUTING;
> +       case ROCKER_FLOW_TABLE_ID_BRIDGING:
> +               return ROCKER_OF_DPA_TABLE_ID_BRIDGING;
> +       case ROCKER_FLOW_TABLE_ID_ACL_POLICY:
> +               return ROCKER_OF_DPA_TABLE_ID_ACL_POLICY;
> +       default:
> +               return 0;
> +       }
> +}

Could the OF-DPA table IDs be used in the flow table defs?  I think I
remember your answer was no because OF-DPA uses INGRESS_PORT ID == 0,
and 0 is a special value for if_flow tables.  Bummer.

> +
> +static int rocker_flow_set_ig_port(struct net_device *dev,
> +                                  struct net_flow_flow *flow)
> +{
> +       struct rocker_port *rocker_port = netdev_priv(dev);
> +       enum rocker_of_dpa_table_id goto_tbl;
> +       u32 in_lport_mask = 0xffff0000;
> +       u32 in_lport = 0;

why initialize these two?

> +       int err, flags = 0;
> +
> +       err = is_valid_net_flow(&ingress_port_table, flow);
> +       if (err)
> +               return err;
> +
> +       /* ingress port table only supports one field/mask/action this
> +        * simplifies the key construction and we can assume the values
> +        * are the correct types/mask/action by valid check above. The
> +        * user could pass multiple match/actions in a message with the
> +        * same field multiple times currently the valid test does not
> +        * catch this and we just use the first specified.
> +        */
> +       in_lport = flow->matches[0].value_u32;
> +       in_lport_mask = flow->matches[0].mask_u32;
> +       goto_tbl = rocker_goto_value(flow->actions[0].args[0].value_u16);
> +
> +       err = rocker_flow_tbl_ig_port(rocker_port, flags,
> +                                     in_lport, in_lport_mask,
> +                                     goto_tbl);
> +       return err;
> +}
> +
> +static int rocker_flow_set_vlan(struct net_device *dev,
> +                               struct net_flow_flow *flow)
> +{
> +       enum rocker_of_dpa_table_id goto_tbl;
> +       struct rocker_port *rocker_port = netdev_priv(dev);

rocker style thing: put rocker_port decl first (sorry for being so pedantic).

> +       int i, err = 0, flags = 0;
> +       u32 in_lport;
> +       __be16 vlan_id, vlan_id_mask, new_vlan_id;
> +       bool untagged, have_in_lport = false;
> +
> +       err = is_valid_net_flow(&vlan_table, flow);
> +       if (err)
> +               return err;
> +
> +       goto_tbl = ROCKER_OF_DPA_TABLE_ID_TERMINATION_MAC;
> +
> +       /* If user does not specify vid match default to any */
> +       vlan_id = 1;

htons()?

Not sure.  Rocker convention is vlan_id is network-order, but some
places you'll see vid and that's host-order.

> +       vlan_id_mask = 0;
> +
> +       for (i = 0; flow->matches && flow->matches[i].instance; i++) {
> +               switch (flow->matches[i].instance) {
> +               case HEADER_INSTANCE_IN_LPORT:
> +                       in_lport = flow->matches[i].value_u32;
> +                       have_in_lport = true;
> +                       break;
> +               case HEADER_INSTANCE_VLAN_OUTER:
> +                       if (flow->matches[i].field != HEADER_VLAN_VID)
> +                               break;
> +
> +                       vlan_id = htons(flow->matches[i].value_u16);
> +                       vlan_id_mask = htons(flow->matches[i].mask_u16);
> +                       break;
> +               default:
> +                       return -EINVAL;
> +               }
> +       }
> +
> +       /* If user does not specify a new vlan id use default vlan id */
> +       new_vlan_id = rocker_port_vid_to_vlan(rocker_port, vlan_id, &untagged);
> +
> +       for (i = 0; flow->actions && flow->actions[i].uid; i++) {
> +               struct net_flow_action_arg *arg = &flow->actions[i].args[0];
> +
> +               switch (flow->actions[i].uid) {
> +               case ACTION_SET_GOTO_TABLE:
> +                       goto_tbl = rocker_goto_value(arg->value_u16);
> +                       break;
> +               case ACTION_SET_VLAN_ID:
> +                       new_vlan_id = htons(arg->value_u16);
> +                       if (new_vlan_id)
> +                               untagged = false;
> +                       break;
> +               }
> +       }
> +
> +       if (!have_in_lport)
> +               return -EINVAL;

This can be moved up, before second for loop

> +
> +       err = rocker_flow_tbl_vlan(rocker_port, flags, in_lport,
> +                                  vlan_id, vlan_id_mask, goto_tbl,
> +                                  untagged, new_vlan_id);
> +       return err;
> +}
> +
> +static int rocker_flow_set_term_mac(struct net_device *dev,
> +                                   struct net_flow_flow *flow)
> +{
> +       struct rocker_port *rocker_port = netdev_priv(dev);
> +       __be16 vlan_id, vlan_id_mask, ethtype = 0;
> +       const u8 *eth_dst, *eth_dst_mask;
> +       u32 in_lport, in_lport_mask;
> +       int i, err = 0, flags = 0;
> +       bool copy_to_cpu;
> +
> +       eth_dst = NULL;
> +       eth_dst_mask = NULL;
> +

Needed?

> +       err = is_valid_net_flow(&term_mac_table, flow);
> +       if (err)
> +               return err;
> +
> +       /* If user does not specify vid match default to any */
> +       vlan_id = rocker_port->internal_vlan_id;
> +       vlan_id_mask = 0;
> +
> +       /* If user does not specify in_lport match default to any */
> +       in_lport = rocker_port->lport;
> +       in_lport_mask = 0;
> +
> +       /* If user does not specify a mac address match any */
> +       eth_dst = rocker_port->dev->dev_addr;
> +       eth_dst_mask = zero_mac;
> +
> +       for (i = 0; flow->matches && flow->matches[i].instance; i++) {
> +               switch (flow->matches[i].instance) {
> +               case HEADER_INSTANCE_IN_LPORT:
> +                       in_lport = flow->matches[i].value_u32;
> +                       in_lport_mask = flow->matches[i].mask_u32;
> +                       break;
> +               case HEADER_INSTANCE_VLAN_OUTER:
> +                       if (flow->matches[i].field != HEADER_VLAN_VID)
> +                               break;
> +
> +                       vlan_id = htons(flow->matches[i].value_u16);
> +                       vlan_id_mask = htons(flow->matches[i].mask_u16);
> +                       break;
> +               case HEADER_INSTANCE_ETHERNET:
> +                       switch (flow->matches[i].field) {
> +                       case HEADER_ETHERNET_DST_MAC:
> +                               eth_dst = (u8 *)&flow->matches[i].value_u64;
> +                               eth_dst_mask = (u8 *)&flow->matches[i].mask_u64;
> +                               break;
> +                       case HEADER_ETHERNET_ETHERTYPE:
> +                               ethtype = htons(flow->matches[i].value_u16);
> +                               break;
> +                       default:
> +                               return -EINVAL;
> +                       }
> +                       break;
> +               default:
> +                       return -EINVAL;
> +               }
> +       }
> +
> +       if (!ethtype)
> +               return -EINVAL;
> +
> +       /* By default do not copy to cpu */
> +       copy_to_cpu = false;
> +
> +       for (i = 0; flow->actions && flow->actions[i].uid; i++) {
> +               switch (flow->actions[i].uid) {
> +               case ACTION_COPY_TO_CPU:
> +                       copy_to_cpu = true;
> +                       break;
> +               default:
> +                       return -EINVAL;
> +               }
> +       }
> +
> +       err = rocker_flow_tbl_term_mac(rocker_port, in_lport, in_lport_mask,
> +                                      ethtype, eth_dst, eth_dst_mask,
> +                                      vlan_id, vlan_id_mask,
> +                                      copy_to_cpu, flags);
> +       return err;
> +}
> +
> +static int rocker_flow_set_ucast_routing(struct net_device *dev,
> +                                        struct net_flow_flow *flow)
> +{
> +       return -EOPNOTSUPP;
> +}
> +
> +static int rocker_flow_set_mcast_routing(struct net_device *dev,
> +                                        struct net_flow_flow *flow)
> +{
> +       return -EOPNOTSUPP;
> +}
> +
> +static int rocker_flow_set_bridge(struct net_device *dev,
> +                                 struct net_flow_flow *flow)
> +{
> +       enum rocker_of_dpa_table_id goto_tbl;
> +       struct rocker_port *rocker_port = netdev_priv(dev);
> +       u32 in_lport, in_lport_mask, group_id, tunnel_id;
> +       __be16 vlan_id, vlan_id_mask;
> +       const u8 *eth_dst, *eth_dst_mask;
> +       int i, err = 0, flags = 0;
> +       bool copy_to_cpu;
> +
> +       err = is_valid_net_flow(&bridge_table, flow);
> +       if (err)
> +               return err;
> +
> +       goto_tbl = ROCKER_OF_DPA_TABLE_ID_ACL_POLICY;
> +
> +       /* If user does not specify vid match default to any */
> +       vlan_id = rocker_port->internal_vlan_id;
> +       vlan_id_mask = 0;
> +
> +       /* If user does not specify in_lport match default to any */
> +       in_lport = rocker_port->lport;
> +       in_lport_mask = 0;
> +
> +       /* If user does not specify a mac address match any */
> +       eth_dst = rocker_port->dev->dev_addr;
> +       eth_dst_mask = NULL;
> +
> +       /* Do not support for tunnel_id yet. */
> +       tunnel_id = 0;
> +
> +       for (i = 0; flow->matches && flow->matches[i].instance; i++) {
> +               switch (flow->matches[i].instance) {
> +               case HEADER_INSTANCE_IN_LPORT:
> +                       in_lport = flow->matches[i].value_u32;
> +                       in_lport_mask = flow->matches[i].mask_u32;
> +                       break;
> +               case HEADER_INSTANCE_VLAN_OUTER:
> +                       if (flow->matches[i].field != HEADER_VLAN_VID)
> +                               break;
> +
> +                       vlan_id = htons(flow->matches[i].value_u16);
> +                       vlan_id_mask = htons(flow->matches[i].mask_u16);
> +                       break;
> +               case HEADER_INSTANCE_ETHERNET:
> +                       switch (flow->matches[i].field) {
> +                       case HEADER_ETHERNET_DST_MAC:
> +                               eth_dst = (u8 *)&flow->matches[i].value_u64;
> +                               eth_dst_mask = (u8 *)&flow->matches[i].mask_u64;
> +                               break;
> +                       default:
> +                               return -EINVAL;
> +                       }
> +                       break;
> +               default:
> +                       return -EINVAL;
> +               }
> +       }
> +
> +       /* By default do not copy to cpu and skip group assignment */
> +       copy_to_cpu = false;
> +       group_id = ROCKER_GROUP_NONE;
> +
> +       for (i = 0; flow->actions && flow->actions[i].uid; i++) {
> +               struct net_flow_action_arg *arg = &flow->actions[i].args[0];
> +
> +               switch (flow->actions[i].uid) {
> +               case ACTION_SET_GOTO_TABLE:
> +                       goto_tbl = rocker_goto_value(arg->value_u16);
> +                       break;
> +               case ACTION_COPY_TO_CPU:
> +                       copy_to_cpu = true;
> +                       break;
> +               case ACTION_SET_GROUP_ID:
> +                       group_id = arg->value_u32;
> +                       break;
> +               default:
> +                       return -EINVAL;
> +               }
> +       }
> +
> +       /* Ignoring eth_dst_mask it seems to cause a EINVAL return code */
> +       err = rocker_flow_tbl_bridge(rocker_port, flags,
> +                                    eth_dst, eth_dst_mask,
> +                                    vlan_id, tunnel_id,
> +                                    goto_tbl, group_id, copy_to_cpu);
> +       return err;
> +}
> +
> +static int rocker_flow_set_acl(struct net_device *dev,
> +                              struct net_flow_flow *flow)
> +{
> +       struct rocker_port *rocker_port = netdev_priv(dev);
> +       u32 in_lport, in_lport_mask, group_id, tunnel_id;
> +       __be16 vlan_id, vlan_id_mask, ethtype = 0;
> +       const u8 *eth_dst, *eth_src, *eth_dst_mask, *eth_src_mask;
> +       u8 protocol, protocol_mask, dscp, dscp_mask;
> +       int i, err = 0, flags = 0;
> +
> +       err = is_valid_net_flow(&bridge_table, flow);
> +       if (err)
> +               return err;
> +
> +       /* If user does not specify vid match default to any */
> +       vlan_id = rocker_port->internal_vlan_id;
> +       vlan_id_mask = 0;
> +
> +       /* If user does not specify in_lport match default to any */
> +       in_lport = rocker_port->lport;
> +       in_lport_mask = 0;
> +
> +       /* If user does not specify a mac address match any */
> +       eth_dst = rocker_port->dev->dev_addr;
> +       eth_src = zero_mac;
> +       eth_dst_mask = NULL;
> +       eth_src_mask = NULL;
> +
> +       /* If user does not set protocol/dscp mask them out */
> +       protocol = 0;
> +       dscp = 0;
> +       protocol_mask = 0;
> +       dscp_mask = 0;
> +
> +       /* Do not support for tunnel_id yet. */
> +       tunnel_id = 0;
> +
> +       for (i = 0; flow->matches && flow->matches[i].instance; i++) {
> +               switch (flow->matches[i].instance) {
> +               case HEADER_INSTANCE_IN_LPORT:
> +                       in_lport = flow->matches[i].value_u32;
> +                       in_lport_mask = flow->matches[i].mask_u32;
> +                       break;
> +               case HEADER_INSTANCE_VLAN_OUTER:
> +                       if (flow->matches[i].field != HEADER_VLAN_VID)
> +                               break;
> +
> +                       vlan_id = htons(flow->matches[i].value_u16);
> +                       vlan_id_mask = htons(flow->matches[i].mask_u16);
> +                       break;
> +               case HEADER_INSTANCE_ETHERNET:
> +                       switch (flow->matches[i].field) {
> +                       case HEADER_ETHERNET_SRC_MAC:
> +                               eth_src = (u8 *)&flow->matches[i].value_u64;
> +                               eth_src_mask = (u8 *)&flow->matches[i].mask_u64;
> +                               break;
> +                       case HEADER_ETHERNET_DST_MAC:
> +                               eth_dst = (u8 *)&flow->matches[i].value_u64;
> +                               eth_dst_mask = (u8 *)&flow->matches[i].mask_u64;
> +                               break;
> +                       case HEADER_ETHERNET_ETHERTYPE:
> +                               ethtype = htons(flow->matches[i].value_u16);
> +                               break;
> +                       default:
> +                               return -EINVAL;
> +                       }
> +                       break;
> +               case HEADER_INSTANCE_IPV4:
> +                       switch (flow->matches[i].field) {
> +                       case HEADER_IPV4_PROTOCOL:
> +                               protocol = flow->matches[i].value_u8;
> +                               protocol_mask = flow->matches[i].mask_u8;
> +                               break;
> +                       case HEADER_IPV4_DSCP:
> +                               dscp = flow->matches[i].value_u8;
> +                               dscp_mask = flow->matches[i].mask_u8;
> +                               break;
> +                       default:
> +                               return -EINVAL;
> +                       }
> +               default:
> +                       return -EINVAL;
> +               }
> +       }
> +
> +       /* By default do not copy to cpu and skip group assignment */
> +       group_id = ROCKER_GROUP_NONE;
> +
> +       for (i = 0; flow->actions && flow->actions[i].uid; i++) {
> +               switch (flow->actions[i].uid) {
> +               case ACTION_SET_GROUP_ID:
> +                       group_id = flow->actions[i].args[0].value_u32;
> +                       break;
> +               default:
> +                       return -EINVAL;
> +               }
> +       }
> +
> +       err = rocker_flow_tbl_acl(rocker_port, flags,
> +                                 in_lport, in_lport_mask,
> +                                 eth_src, eth_src_mask,
> +                                 eth_dst, eth_dst_mask, ethtype,
> +                                 vlan_id, vlan_id_mask,
> +                                 protocol, protocol_mask,
> +                                 dscp, dscp_mask,
> +                                 group_id);
> +       return err;
> +}
> +
> +static int rocker_set_flows(struct net_device *dev,
> +                           struct net_flow_flow *flow)
> +{
> +       int err = -EINVAL;
> +
> +       if (!flow->matches || !flow->actions)
> +               return -EINVAL;
> +
> +       switch (flow->table_id) {
> +       case ROCKER_FLOW_TABLE_ID_INGRESS_PORT:
> +               err = rocker_flow_set_ig_port(dev, flow);
> +               break;
> +       case ROCKER_FLOW_TABLE_ID_VLAN:
> +               err = rocker_flow_set_vlan(dev, flow);
> +               break;
> +       case ROCKER_FLOW_TABLE_ID_TERMINATION_MAC:
> +               err = rocker_flow_set_term_mac(dev, flow);
> +               break;
> +       case ROCKER_FLOW_TABLE_ID_UNICAST_ROUTING:
> +               err = rocker_flow_set_ucast_routing(dev, flow);
> +               break;
> +       case ROCKER_FLOW_TABLE_ID_MULTICAST_ROUTING:
> +               err = rocker_flow_set_mcast_routing(dev, flow);
> +               break;
> +       case ROCKER_FLOW_TABLE_ID_BRIDGING:
> +               err = rocker_flow_set_bridge(dev, flow);
> +               break;
> +       case ROCKER_FLOW_TABLE_ID_ACL_POLICY:
> +               err = rocker_flow_set_acl(dev, flow);
> +               break;
> +       default:
> +               break;
> +       }
> +
> +       return err;
> +}
> +
> +static int rocker_del_flows(struct net_device *dev,
> +                           struct net_flow_flow *flow)
> +{
> +       return -EOPNOTSUPP;
> +}
>  #endif
>
>  static const struct net_device_ops rocker_port_netdev_ops = {
> @@ -3828,6 +4342,9 @@ static const struct net_device_ops rocker_port_netdev_ops = {
>         .ndo_flow_get_actions           = rocker_get_actions,
>         .ndo_flow_get_tbl_graph         = rocker_get_tgraph,
>         .ndo_flow_get_hdr_graph         = rocker_get_hgraph,
> +
> +       .ndo_flow_set_flows             = rocker_set_flows,
> +       .ndo_flow_del_flows             = rocker_del_flows,
>  #endif
>  };

Looks good overall to me

> diff --git a/drivers/net/ethernet/rocker/rocker_pipeline.h b/drivers/net/ethernet/rocker/rocker_pipeline.h
> index 9544339..701e139 100644
> --- a/drivers/net/ethernet/rocker/rocker_pipeline.h
> +++ b/drivers/net/ethernet/rocker/rocker_pipeline.h
> @@ -527,6 +527,7 @@ enum rocker_flow_table_id_space {
>         ROCKER_FLOW_TABLE_ID_VLAN,
>         ROCKER_FLOW_TABLE_ID_TERMINATION_MAC,
>         ROCKER_FLOW_TABLE_ID_UNICAST_ROUTING,
> +       ROCKER_FLOW_TABLE_ID_MULTICAST_ROUTING,
>         ROCKER_FLOW_TABLE_ID_BRIDGING,
>         ROCKER_FLOW_TABLE_ID_ACL_POLICY,
>         ROCKER_FLOW_TABLE_NULL = 0,
> @@ -588,7 +589,7 @@ struct net_flow_table acl_table = {
>
>  struct net_flow_table null_table = {
>         .name = "",
> -       .uid = 0,
> +       .uid = ROCKER_FLOW_TABLE_NULL,
>         .source = 0,
>         .size = 0,
>         .matches = NULL,
>

Move these changes to previous patch?

^ permalink raw reply

* Re: [PATCH] GMAC: fix simple_return.cocci warnings
From: Joe Perches @ 2015-01-06  7:26 UTC (permalink / raw)
  To: Roger
  Cc: David Miller, fengguang.wu, kbuild-all, peppe.cavallaro, netdev,
	linux-kernel
In-Reply-To: <54AB8B23.3080303@rock-chips.com>

On Tue, 2015-01-06 at 15:13 +0800, Roger wrote:
> What should I do now?

I think it would be better to change
"int gmac_clk_enable" to "void gmac_clk_enable"
(it always returns 0)

This function should simply call gmac_clk_enable
and return 0;

> >>> --- a/drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c
> >>> +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c
> >>> @@ -422,11 +422,7 @@ static int rk_gmac_init(struct platform_
> >>>   	if (ret)
> >>>   		return ret;
> >>>   
> >>> -	ret = gmac_clk_enable(bsp_priv, true);
> >>> -	if (ret)
> >>> -		return ret;
> >>> -
> >>> -	return 0;
> >>> +	return gmac_clk_enable(bsp_priv, true);
> >> I think this change is not particularly better.
> >>
> >> When the pattern is multiply repeated like:
> >   ...
> >> I think it's better to not change the last
> >> test in the sequence just to minimize overall
> >> line count.
> > I think it's a wash and that both ways are about the same to me.
> >
> > I won't apply this, sorry.

^ permalink raw reply

* [PATCH net-next 5/5] tipc: convert tipc reference table to use generic rhashtable
From: Ying Xue @ 2015-01-06  7:23 UTC (permalink / raw)
  To: tgraf; +Cc: jon.maloy, netdev, Paul.Gortmaker, tipc-discussion, davem
In-Reply-To: <1420529003-22244-1-git-send-email-ying.xue@windriver.com>

As tipc reference table is statically allocated, its memory size
requested on stack initialization stage is quite big even if the
maximum port number is just restricted to 8191 currently, however,
the number already becomes insufficient in practice. But if the
maximum ports is allowed to its theory value - 2^32, its consumed
memory size will reach a ridiculously unacceptable value. Apart from
this, heavy tipc users spend a considerable amount of time in
tipc_sk_get() due to the read-lock on ref_table_lock.

If tipc reference table is converted with generic rhashtable, above
mentioned both disadvantages would be resolved respectively: making
use of the new resizable hash table can avoid locking on the lookup;
smaller memory size is required at initial stage, for example, 256
hash bucket slots are requested at the beginning phase instead of
allocating the entire 8191 slots in old mode. The hash table will
grow if entries exceeds 75% of table size up to a total table size
of 1M, and it will automatically shrink if usage falls below 30%,
but the minimum table size is allowed down to 256.

Also converts ref_table_lock to a separate mutex to protect hash table
mutations on write side. Lastly defers the release of the socket
reference using call_rcu() to allow using an RCU read-side protected
call to rhashtable_lookup().

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Erik Hugne <erik.hugne@ericsson.com>
Cc: Thomas Graf <tgraf@suug.ch>
---
 net/tipc/Kconfig  |   12 --
 net/tipc/config.c |   24 +--
 net/tipc/core.c   |   10 +-
 net/tipc/core.h   |    3 -
 net/tipc/socket.c |  480 +++++++++++++++++++----------------------------------
 net/tipc/socket.h |    4 +-
 6 files changed, 180 insertions(+), 353 deletions(-)

diff --git a/net/tipc/Kconfig b/net/tipc/Kconfig
index c890848..91c8a8e 100644
--- a/net/tipc/Kconfig
+++ b/net/tipc/Kconfig
@@ -20,18 +20,6 @@ menuconfig TIPC
 
 	  If in doubt, say N.
 
-config TIPC_PORTS
-	int "Maximum number of ports in a node"
-	depends on TIPC
-	range 127 65535
-	default "8191"
-	help
-	  Specifies how many ports can be supported by a node.
-	  Can range from 127 to 65535 ports; default is 8191.
-
-	  Setting this to a smaller value saves some memory,
-	  setting it to higher allows for more ports.
-
 config TIPC_MEDIA_IB
 	bool "InfiniBand media type support"
 	depends on TIPC && INFINIBAND_IPOIB
diff --git a/net/tipc/config.c b/net/tipc/config.c
index 876f4c6..0b3a90e 100644
--- a/net/tipc/config.c
+++ b/net/tipc/config.c
@@ -183,22 +183,6 @@ static struct sk_buff *cfg_set_own_addr(void)
 	return tipc_cfg_reply_error_string("cannot change to network mode");
 }
 
-static struct sk_buff *cfg_set_max_ports(void)
-{
-	u32 value;
-
-	if (!TLV_CHECK(req_tlv_area, req_tlv_space, TIPC_TLV_UNSIGNED))
-		return tipc_cfg_reply_error_string(TIPC_CFG_TLV_ERROR);
-	value = ntohl(*(__be32 *)TLV_DATA(req_tlv_area));
-	if (value == tipc_max_ports)
-		return tipc_cfg_reply_none();
-	if (value < 127 || value > 65535)
-		return tipc_cfg_reply_error_string(TIPC_CFG_INVALID_VALUE
-						   " (max ports must be 127-65535)");
-	return tipc_cfg_reply_error_string(TIPC_CFG_NOT_SUPPORTED
-		" (cannot change max ports while TIPC is active)");
-}
-
 static struct sk_buff *cfg_set_netid(void)
 {
 	u32 value;
@@ -285,15 +269,9 @@ struct sk_buff *tipc_cfg_do_cmd(u32 orig_node, u16 cmd, const void *request_area
 	case TIPC_CMD_SET_NODE_ADDR:
 		rep_tlv_buf = cfg_set_own_addr();
 		break;
-	case TIPC_CMD_SET_MAX_PORTS:
-		rep_tlv_buf = cfg_set_max_ports();
-		break;
 	case TIPC_CMD_SET_NETID:
 		rep_tlv_buf = cfg_set_netid();
 		break;
-	case TIPC_CMD_GET_MAX_PORTS:
-		rep_tlv_buf = tipc_cfg_reply_unsigned(tipc_max_ports);
-		break;
 	case TIPC_CMD_GET_NETID:
 		rep_tlv_buf = tipc_cfg_reply_unsigned(tipc_net_id);
 		break;
@@ -317,6 +295,8 @@ struct sk_buff *tipc_cfg_do_cmd(u32 orig_node, u16 cmd, const void *request_area
 	case TIPC_CMD_SET_REMOTE_MNG:
 	case TIPC_CMD_GET_REMOTE_MNG:
 	case TIPC_CMD_DUMP_LOG:
+	case TIPC_CMD_SET_MAX_PORTS:
+	case TIPC_CMD_GET_MAX_PORTS:
 		rep_tlv_buf = tipc_cfg_reply_error_string(TIPC_CFG_NOT_SUPPORTED
 							  " (obsolete command)");
 		break;
diff --git a/net/tipc/core.c b/net/tipc/core.c
index a5737b8..71b2ada 100644
--- a/net/tipc/core.c
+++ b/net/tipc/core.c
@@ -34,6 +34,8 @@
  * POSSIBILITY OF SUCH DAMAGE.
  */
 
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
 #include "core.h"
 #include "name_table.h"
 #include "subscr.h"
@@ -47,7 +49,6 @@ int tipc_random __read_mostly;
 
 /* configurable TIPC parameters */
 u32 tipc_own_addr __read_mostly;
-int tipc_max_ports __read_mostly;
 int tipc_net_id __read_mostly;
 int sysctl_tipc_rmem[3] __read_mostly;	/* min/default/max */
 
@@ -84,9 +85,9 @@ static void tipc_core_stop(void)
 	tipc_netlink_stop();
 	tipc_subscr_stop();
 	tipc_nametbl_stop();
-	tipc_sk_ref_table_stop();
 	tipc_socket_stop();
 	tipc_unregister_sysctl();
+	tipc_sk_rht_destroy();
 }
 
 /**
@@ -98,7 +99,7 @@ static int tipc_core_start(void)
 
 	get_random_bytes(&tipc_random, sizeof(tipc_random));
 
-	err = tipc_sk_ref_table_init(tipc_max_ports, tipc_random);
+	err = tipc_sk_rht_init();
 	if (err)
 		goto out_reftbl;
 
@@ -138,7 +139,7 @@ out_socket:
 out_netlink:
 	tipc_nametbl_stop();
 out_nametbl:
-	tipc_sk_ref_table_stop();
+	tipc_sk_rht_destroy();
 out_reftbl:
 	return err;
 }
@@ -150,7 +151,6 @@ static int __init tipc_init(void)
 	pr_info("Activated (version " TIPC_MOD_VER ")\n");
 
 	tipc_own_addr = 0;
-	tipc_max_ports = CONFIG_TIPC_PORTS;
 	tipc_net_id = 4711;
 
 	sysctl_tipc_rmem[0] = TIPC_CONN_OVERLOAD_LIMIT >> 4 <<
diff --git a/net/tipc/core.h b/net/tipc/core.h
index 8460213..56fe422 100644
--- a/net/tipc/core.h
+++ b/net/tipc/core.h
@@ -37,8 +37,6 @@
 #ifndef _TIPC_CORE_H
 #define _TIPC_CORE_H
 
-#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
-
 #include <linux/tipc.h>
 #include <linux/tipc_config.h>
 #include <linux/tipc_netlink.h>
@@ -79,7 +77,6 @@ int tipc_snprintf(char *buf, int len, const char *fmt, ...);
  * Global configuration variables
  */
 extern u32 tipc_own_addr __read_mostly;
-extern int tipc_max_ports __read_mostly;
 extern int tipc_net_id __read_mostly;
 extern int sysctl_tipc_rmem[3] __read_mostly;
 extern int sysctl_tipc_named_timeout __read_mostly;
diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index 4731cad..3335ccf 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -34,22 +34,25 @@
  * POSSIBILITY OF SUCH DAMAGE.
  */
 
+#include <linux/rhashtable.h>
+#include <linux/jhash.h>
 #include "core.h"
 #include "name_table.h"
 #include "node.h"
 #include "link.h"
-#include <linux/export.h>
 #include "config.h"
 #include "socket.h"
 
-#define SS_LISTENING	-1	/* socket is listening */
-#define SS_READY	-2	/* socket is connectionless */
+#define SS_LISTENING		-1	/* socket is listening */
+#define SS_READY		-2	/* socket is connectionless */
 
-#define CONN_TIMEOUT_DEFAULT  8000	/* default connect timeout = 8s */
-#define CONN_PROBING_INTERVAL 3600000	/* [ms] => 1 h */
-#define TIPC_FWD_MSG	      1
-#define TIPC_CONN_OK          0
-#define TIPC_CONN_PROBING     1
+#define CONN_TIMEOUT_DEFAULT	8000	/* default connect timeout = 8s */
+#define CONN_PROBING_INTERVAL	3600000	/* [ms] => 1 h */
+#define TIPC_FWD_MSG		1
+#define TIPC_CONN_OK		0
+#define TIPC_CONN_PROBING	1
+#define TIPC_MAX_PORT		0xffffffff
+#define TIPC_MIN_PORT		1
 
 /**
  * struct tipc_sock - TIPC socket structure
@@ -59,7 +62,7 @@
  * @conn_instance: TIPC instance used when connection was established
  * @published: non-zero if port has one or more associated names
  * @max_pkt: maximum packet size "hint" used when building messages sent by port
- * @ref: unique reference to port in TIPC object registry
+ * @portid: unique port identity in TIPC socket hash table
  * @phdr: preformatted message header used when sending messages
  * @port_list: adjacent ports in TIPC's global list of ports
  * @publications: list of publications for port
@@ -74,6 +77,8 @@
  * @link_cong: non-zero if owner must sleep because of link congestion
  * @sent_unacked: # messages sent by socket, and not yet acked by peer
  * @rcv_unacked: # messages read by user, but not yet acked back to peer
+ * @node: hash table node
+ * @rcu: rcu struct for tipc_sock
  */
 struct tipc_sock {
 	struct sock sk;
@@ -82,7 +87,7 @@ struct tipc_sock {
 	u32 conn_instance;
 	int published;
 	u32 max_pkt;
-	u32 ref;
+	u32 portid;
 	struct tipc_msg phdr;
 	struct list_head sock_list;
 	struct list_head publications;
@@ -95,6 +100,8 @@ struct tipc_sock {
 	bool link_cong;
 	uint sent_unacked;
 	uint rcv_unacked;
+	struct rhash_head node;
+	struct rcu_head rcu;
 };
 
 static int tipc_backlog_rcv(struct sock *sk, struct sk_buff *skb);
@@ -103,16 +110,14 @@ static void tipc_write_space(struct sock *sk);
 static int tipc_release(struct socket *sock);
 static int tipc_accept(struct socket *sock, struct socket *new_sock, int flags);
 static int tipc_wait_for_sndmsg(struct socket *sock, long *timeo_p);
-static void tipc_sk_timeout(unsigned long ref);
+static void tipc_sk_timeout(unsigned long portid);
 static int tipc_sk_publish(struct tipc_sock *tsk, uint scope,
 			   struct tipc_name_seq const *seq);
 static int tipc_sk_withdraw(struct tipc_sock *tsk, uint scope,
 			    struct tipc_name_seq const *seq);
-static u32 tipc_sk_ref_acquire(struct tipc_sock *tsk);
-static void tipc_sk_ref_discard(u32 ref);
-static struct tipc_sock *tipc_sk_get(u32 ref);
-static struct tipc_sock *tipc_sk_get_next(u32 *ref);
-static void tipc_sk_put(struct tipc_sock *tsk);
+static struct tipc_sock *tipc_sk_lookup(u32 portid);
+static int tipc_sk_insert(struct tipc_sock *tsk);
+static void tipc_sk_remove(struct tipc_sock *tsk);
 
 static const struct proto_ops packet_ops;
 static const struct proto_ops stream_ops;
@@ -174,6 +179,9 @@ static const struct nla_policy tipc_nl_sock_policy[TIPC_NLA_SOCK_MAX + 1] = {
  *   - port reference
  */
 
+/* Protects tipc socket hash table mutations */
+static struct rhashtable tipc_sk_rht;
+
 static u32 tsk_peer_node(struct tipc_sock *tsk)
 {
 	return msg_destnode(&tsk->phdr);
@@ -305,7 +313,6 @@ static int tipc_sk_create(struct net *net, struct socket *sock,
 	struct sock *sk;
 	struct tipc_sock *tsk;
 	struct tipc_msg *msg;
-	u32 ref;
 
 	/* Validate arguments */
 	if (unlikely(protocol != 0))
@@ -339,24 +346,22 @@ static int tipc_sk_create(struct net *net, struct socket *sock,
 		return -ENOMEM;
 
 	tsk = tipc_sk(sk);
-	ref = tipc_sk_ref_acquire(tsk);
-	if (!ref) {
-		pr_warn("Socket create failed; reference table exhausted\n");
-		return -ENOMEM;
-	}
 	tsk->max_pkt = MAX_PKT_DEFAULT;
-	tsk->ref = ref;
 	INIT_LIST_HEAD(&tsk->publications);
 	msg = &tsk->phdr;
 	tipc_msg_init(msg, TIPC_LOW_IMPORTANCE, TIPC_NAMED_MSG,
 		      NAMED_H_SIZE, 0);
-	msg_set_origport(msg, ref);
 
 	/* Finish initializing socket data structures */
 	sock->ops = ops;
 	sock->state = state;
 	sock_init_data(sock, sk);
-	k_init_timer(&tsk->timer, (Handler)tipc_sk_timeout, ref);
+	if (tipc_sk_insert(tsk)) {
+		pr_warn("Socket create failed; port numbrer exhausted\n");
+		return -EINVAL;
+	}
+	msg_set_origport(msg, tsk->portid);
+	k_init_timer(&tsk->timer, (Handler)tipc_sk_timeout, tsk->portid);
 	sk->sk_backlog_rcv = tipc_backlog_rcv;
 	sk->sk_rcvbuf = sysctl_tipc_rmem[1];
 	sk->sk_data_ready = tipc_data_ready;
@@ -442,6 +447,13 @@ int tipc_sock_accept_local(struct socket *sock, struct socket **newsock,
 	return ret;
 }
 
+static void tipc_sk_callback(struct rcu_head *head)
+{
+	struct tipc_sock *tsk = container_of(head, struct tipc_sock, rcu);
+
+	sock_put(&tsk->sk);
+}
+
 /**
  * tipc_release - destroy a TIPC socket
  * @sock: socket to destroy
@@ -491,7 +503,7 @@ static int tipc_release(struct socket *sock)
 			    (sock->state == SS_CONNECTED)) {
 				sock->state = SS_DISCONNECTING;
 				tsk->connected = 0;
-				tipc_node_remove_conn(dnode, tsk->ref);
+				tipc_node_remove_conn(dnode, tsk->portid);
 			}
 			if (tipc_msg_reverse(skb, &dnode, TIPC_ERR_NO_PORT))
 				tipc_link_xmit_skb(skb, dnode, 0);
@@ -499,16 +511,16 @@ static int tipc_release(struct socket *sock)
 	}
 
 	tipc_sk_withdraw(tsk, 0, NULL);
-	tipc_sk_ref_discard(tsk->ref);
 	k_cancel_timer(&tsk->timer);
+	tipc_sk_remove(tsk);
 	if (tsk->connected) {
 		skb = tipc_msg_create(TIPC_CRITICAL_IMPORTANCE, TIPC_CONN_MSG,
 				      SHORT_H_SIZE, 0, dnode, tipc_own_addr,
 				      tsk_peer_port(tsk),
-				      tsk->ref, TIPC_ERR_NO_PORT);
+				      tsk->portid, TIPC_ERR_NO_PORT);
 		if (skb)
-			tipc_link_xmit_skb(skb, dnode, tsk->ref);
-		tipc_node_remove_conn(dnode, tsk->ref);
+			tipc_link_xmit_skb(skb, dnode, tsk->portid);
+		tipc_node_remove_conn(dnode, tsk->portid);
 	}
 	k_term_timer(&tsk->timer);
 
@@ -518,7 +530,8 @@ static int tipc_release(struct socket *sock)
 	/* Reject any messages that accumulated in backlog queue */
 	sock->state = SS_DISCONNECTING;
 	release_sock(sk);
-	sock_put(sk);
+
+	call_rcu(&tsk->rcu, tipc_sk_callback);
 	sock->sk = NULL;
 
 	return 0;
@@ -611,7 +624,7 @@ static int tipc_getname(struct socket *sock, struct sockaddr *uaddr,
 		addr->addr.id.ref = tsk_peer_port(tsk);
 		addr->addr.id.node = tsk_peer_node(tsk);
 	} else {
-		addr->addr.id.ref = tsk->ref;
+		addr->addr.id.ref = tsk->portid;
 		addr->addr.id.node = tipc_own_addr;
 	}
 
@@ -946,7 +959,7 @@ static int tipc_sendmsg(struct kiocb *iocb, struct socket *sock,
 	}
 
 new_mtu:
-	mtu = tipc_node_get_mtu(dnode, tsk->ref);
+	mtu = tipc_node_get_mtu(dnode, tsk->portid);
 	__skb_queue_head_init(&head);
 	rc = tipc_msg_build(mhdr, m, 0, dsz, mtu, &head);
 	if (rc < 0)
@@ -955,7 +968,7 @@ new_mtu:
 	do {
 		skb = skb_peek(&head);
 		TIPC_SKB_CB(skb)->wakeup_pending = tsk->link_cong;
-		rc = tipc_link_xmit(&head, dnode, tsk->ref);
+		rc = tipc_link_xmit(&head, dnode, tsk->portid);
 		if (likely(rc >= 0)) {
 			if (sock->state != SS_READY)
 				sock->state = SS_CONNECTING;
@@ -1028,7 +1041,7 @@ static int tipc_send_stream(struct kiocb *iocb, struct socket *sock,
 	struct tipc_msg *mhdr = &tsk->phdr;
 	struct sk_buff_head head;
 	DECLARE_SOCKADDR(struct sockaddr_tipc *, dest, m->msg_name);
-	u32 ref = tsk->ref;
+	u32 portid = tsk->portid;
 	int rc = -EINVAL;
 	long timeo;
 	u32 dnode;
@@ -1067,7 +1080,7 @@ next:
 		goto exit;
 	do {
 		if (likely(!tsk_conn_cong(tsk))) {
-			rc = tipc_link_xmit(&head, dnode, ref);
+			rc = tipc_link_xmit(&head, dnode, portid);
 			if (likely(!rc)) {
 				tsk->sent_unacked++;
 				sent += send;
@@ -1076,7 +1089,7 @@ next:
 				goto next;
 			}
 			if (rc == -EMSGSIZE) {
-				tsk->max_pkt = tipc_node_get_mtu(dnode, ref);
+				tsk->max_pkt = tipc_node_get_mtu(dnode, portid);
 				goto next;
 			}
 			if (rc != -ELINKCONG)
@@ -1130,8 +1143,8 @@ static void tipc_sk_finish_conn(struct tipc_sock *tsk, u32 peer_port,
 	tsk->probing_state = TIPC_CONN_OK;
 	tsk->connected = 1;
 	k_start_timer(&tsk->timer, tsk->probing_interval);
-	tipc_node_add_conn(peer_node, tsk->ref, peer_port);
-	tsk->max_pkt = tipc_node_get_mtu(peer_node, tsk->ref);
+	tipc_node_add_conn(peer_node, tsk->portid, peer_port);
+	tsk->max_pkt = tipc_node_get_mtu(peer_node, tsk->portid);
 }
 
 /**
@@ -1238,7 +1251,7 @@ static void tipc_sk_send_ack(struct tipc_sock *tsk, uint ack)
 	if (!tsk->connected)
 		return;
 	skb = tipc_msg_create(CONN_MANAGER, CONN_ACK, INT_H_SIZE, 0, dnode,
-			      tipc_own_addr, peer_port, tsk->ref, TIPC_OK);
+			      tipc_own_addr, peer_port, tsk->portid, TIPC_OK);
 	if (!skb)
 		return;
 	msg = buf_msg(skb);
@@ -1552,7 +1565,7 @@ static int filter_connect(struct tipc_sock *tsk, struct sk_buff **buf)
 				tsk->connected = 0;
 				/* let timer expire on it's own */
 				tipc_node_remove_conn(tsk_peer_node(tsk),
-						      tsk->ref);
+						      tsk->portid);
 			}
 			retval = TIPC_OK;
 		}
@@ -1743,7 +1756,7 @@ int tipc_sk_rcv(struct sk_buff *skb)
 	u32 dnode;
 
 	/* Validate destination and message */
-	tsk = tipc_sk_get(dport);
+	tsk = tipc_sk_lookup(dport);
 	if (unlikely(!tsk)) {
 		rc = tipc_msg_eval(skb, &dnode);
 		goto exit;
@@ -1763,7 +1776,7 @@ int tipc_sk_rcv(struct sk_buff *skb)
 			rc = -TIPC_ERR_OVERLOAD;
 	}
 	spin_unlock_bh(&sk->sk_lock.slock);
-	tipc_sk_put(tsk);
+	sock_put(sk);
 	if (likely(!rc))
 		return 0;
 exit:
@@ -2050,20 +2063,20 @@ restart:
 				goto restart;
 			}
 			if (tipc_msg_reverse(skb, &dnode, TIPC_CONN_SHUTDOWN))
-				tipc_link_xmit_skb(skb, dnode, tsk->ref);
-			tipc_node_remove_conn(dnode, tsk->ref);
+				tipc_link_xmit_skb(skb, dnode, tsk->portid);
+			tipc_node_remove_conn(dnode, tsk->portid);
 		} else {
 			dnode = tsk_peer_node(tsk);
 			skb = tipc_msg_create(TIPC_CRITICAL_IMPORTANCE,
 					      TIPC_CONN_MSG, SHORT_H_SIZE,
 					      0, dnode, tipc_own_addr,
 					      tsk_peer_port(tsk),
-					      tsk->ref, TIPC_CONN_SHUTDOWN);
-			tipc_link_xmit_skb(skb, dnode, tsk->ref);
+					      tsk->portid, TIPC_CONN_SHUTDOWN);
+			tipc_link_xmit_skb(skb, dnode, tsk->portid);
 		}
 		tsk->connected = 0;
 		sock->state = SS_DISCONNECTING;
-		tipc_node_remove_conn(dnode, tsk->ref);
+		tipc_node_remove_conn(dnode, tsk->portid);
 		/* fall through */
 
 	case SS_DISCONNECTING:
@@ -2084,14 +2097,14 @@ restart:
 	return res;
 }
 
-static void tipc_sk_timeout(unsigned long ref)
+static void tipc_sk_timeout(unsigned long portid)
 {
 	struct tipc_sock *tsk;
 	struct sock *sk;
 	struct sk_buff *skb = NULL;
 	u32 peer_port, peer_node;
 
-	tsk = tipc_sk_get(ref);
+	tsk = tipc_sk_lookup(portid);
 	if (!tsk)
 		return;
 
@@ -2108,20 +2121,20 @@ static void tipc_sk_timeout(unsigned long ref)
 		/* Previous probe not answered -> self abort */
 		skb = tipc_msg_create(TIPC_CRITICAL_IMPORTANCE, TIPC_CONN_MSG,
 				      SHORT_H_SIZE, 0, tipc_own_addr,
-				      peer_node, ref, peer_port,
+				      peer_node, portid, peer_port,
 				      TIPC_ERR_NO_PORT);
 	} else {
 		skb = tipc_msg_create(CONN_MANAGER, CONN_PROBE, INT_H_SIZE,
 				      0, peer_node, tipc_own_addr,
-				      peer_port, ref, TIPC_OK);
+				      peer_port, portid, TIPC_OK);
 		tsk->probing_state = TIPC_CONN_PROBING;
 		k_start_timer(&tsk->timer, tsk->probing_interval);
 	}
 	bh_unlock_sock(sk);
 	if (skb)
-		tipc_link_xmit_skb(skb, peer_node, ref);
+		tipc_link_xmit_skb(skb, peer_node, portid);
 exit:
-	tipc_sk_put(tsk);
+	sock_put(sk);
 }
 
 static int tipc_sk_publish(struct tipc_sock *tsk, uint scope,
@@ -2132,12 +2145,12 @@ static int tipc_sk_publish(struct tipc_sock *tsk, uint scope,
 
 	if (tsk->connected)
 		return -EINVAL;
-	key = tsk->ref + tsk->pub_count + 1;
-	if (key == tsk->ref)
+	key = tsk->portid + tsk->pub_count + 1;
+	if (key == tsk->portid)
 		return -EADDRINUSE;
 
 	publ = tipc_nametbl_publish(seq->type, seq->lower, seq->upper,
-				    scope, tsk->ref, key);
+				    scope, tsk->portid, key);
 	if (unlikely(!publ))
 		return -EINVAL;
 
@@ -2188,9 +2201,9 @@ static int tipc_sk_show(struct tipc_sock *tsk, char *buf,
 		ret = tipc_snprintf(buf, len, "<%u.%u.%u:%u>:",
 				    tipc_zone(tipc_own_addr),
 				    tipc_cluster(tipc_own_addr),
-				    tipc_node(tipc_own_addr), tsk->ref);
+				    tipc_node(tipc_own_addr), tsk->portid);
 	else
-		ret = tipc_snprintf(buf, len, "%-10u:", tsk->ref);
+		ret = tipc_snprintf(buf, len, "%-10u:", tsk->portid);
 
 	if (tsk->connected) {
 		u32 dport = tsk_peer_port(tsk);
@@ -2224,13 +2237,15 @@ static int tipc_sk_show(struct tipc_sock *tsk, char *buf,
 
 struct sk_buff *tipc_sk_socks_show(void)
 {
+	const struct bucket_table *tbl;
+	struct rhash_head *pos;
 	struct sk_buff *buf;
 	struct tlv_desc *rep_tlv;
 	char *pb;
 	int pb_len;
 	struct tipc_sock *tsk;
 	int str_len = 0;
-	u32 ref = 0;
+	int i;
 
 	buf = tipc_cfg_reply_alloc(TLV_SPACE(ULTRA_STRING_MAX_LEN));
 	if (!buf)
@@ -2239,14 +2254,18 @@ struct sk_buff *tipc_sk_socks_show(void)
 	pb = TLV_DATA(rep_tlv);
 	pb_len = ULTRA_STRING_MAX_LEN;
 
-	tsk = tipc_sk_get_next(&ref);
-	for (; tsk; tsk = tipc_sk_get_next(&ref)) {
-		lock_sock(&tsk->sk);
-		str_len += tipc_sk_show(tsk, pb + str_len,
-					pb_len - str_len, 0);
-		release_sock(&tsk->sk);
-		tipc_sk_put(tsk);
+	rcu_read_lock();
+	tbl = rht_dereference_rcu((&tipc_sk_rht)->tbl, &tipc_sk_rht);
+	for (i = 0; i < tbl->size; i++) {
+		rht_for_each_entry_rcu(tsk, pos, tbl, i, node) {
+			spin_lock_bh(&tsk->sk.sk_lock.slock);
+			str_len += tipc_sk_show(tsk, pb + str_len,
+						pb_len - str_len, 0);
+			spin_unlock_bh(&tsk->sk.sk_lock.slock);
+		}
 	}
+	rcu_read_unlock();
+
 	str_len += 1;	/* for "\0" */
 	skb_put(buf, TLV_SPACE(str_len));
 	TLV_SET(rep_tlv, TIPC_TLV_ULTRA_STRING, NULL, str_len);
@@ -2259,255 +2278,91 @@ struct sk_buff *tipc_sk_socks_show(void)
  */
 void tipc_sk_reinit(void)
 {
+	const struct bucket_table *tbl;
+	struct rhash_head *pos;
+	struct tipc_sock *tsk;
 	struct tipc_msg *msg;
-	u32 ref = 0;
-	struct tipc_sock *tsk = tipc_sk_get_next(&ref);
+	int i;
 
-	for (; tsk; tsk = tipc_sk_get_next(&ref)) {
-		lock_sock(&tsk->sk);
-		msg = &tsk->phdr;
-		msg_set_prevnode(msg, tipc_own_addr);
-		msg_set_orignode(msg, tipc_own_addr);
-		release_sock(&tsk->sk);
-		tipc_sk_put(tsk);
+	rcu_read_lock();
+	tbl = rht_dereference_rcu((&tipc_sk_rht)->tbl, &tipc_sk_rht);
+	for (i = 0; i < tbl->size; i++) {
+		rht_for_each_entry_rcu(tsk, pos, tbl, i, node) {
+			spin_lock_bh(&tsk->sk.sk_lock.slock);
+			msg = &tsk->phdr;
+			msg_set_prevnode(msg, tipc_own_addr);
+			msg_set_orignode(msg, tipc_own_addr);
+			spin_unlock_bh(&tsk->sk.sk_lock.slock);
+		}
 	}
+	rcu_read_unlock();
 }
 
-/**
- * struct reference - TIPC socket reference entry
- * @tsk: pointer to socket associated with reference entry
- * @ref: reference value for socket (combines instance & array index info)
- */
-struct reference {
-	struct tipc_sock *tsk;
-	u32 ref;
-};
-
-/**
- * struct tipc_ref_table - table of TIPC socket reference entries
- * @entries: pointer to array of reference entries
- * @capacity: array index of first unusable entry
- * @init_point: array index of first uninitialized entry
- * @first_free: array index of first unused socket reference entry
- * @last_free: array index of last unused socket reference entry
- * @index_mask: bitmask for array index portion of reference values
- * @start_mask: initial value for instance value portion of reference values
- */
-struct ref_table {
-	struct reference *entries;
-	u32 capacity;
-	u32 init_point;
-	u32 first_free;
-	u32 last_free;
-	u32 index_mask;
-	u32 start_mask;
-};
-
-/* Socket reference table consists of 2**N entries.
- *
- * State	Socket ptr	Reference
- * -----        ----------      ---------
- * In use        non-NULL       XXXX|own index
- *				(XXXX changes each time entry is acquired)
- * Free            NULL         YYYY|next free index
- *				(YYYY is one more than last used XXXX)
- * Uninitialized   NULL         0
- *
- * Entry 0 is not used; this allows index 0 to denote the end of the free list.
- *
- * Note that a reference value of 0 does not necessarily indicate that an
- * entry is uninitialized, since the last entry in the free list could also
- * have a reference value of 0 (although this is unlikely).
- */
-
-static struct ref_table tipc_ref_table;
-
-static DEFINE_RWLOCK(ref_table_lock);
-
-/**
- * tipc_ref_table_init - create reference table for sockets
- */
-int tipc_sk_ref_table_init(u32 req_sz, u32 start)
+static struct tipc_sock *tipc_sk_lookup(u32 portid)
 {
-	struct reference *table;
-	u32 actual_sz;
-
-	/* account for unused entry, then round up size to a power of 2 */
-
-	req_sz++;
-	for (actual_sz = 16; actual_sz < req_sz; actual_sz <<= 1) {
-		/* do nothing */
-	};
-
-	/* allocate table & mark all entries as uninitialized */
-	table = vzalloc(actual_sz * sizeof(struct reference));
-	if (table == NULL)
-		return -ENOMEM;
-
-	tipc_ref_table.entries = table;
-	tipc_ref_table.capacity = req_sz;
-	tipc_ref_table.init_point = 1;
-	tipc_ref_table.first_free = 0;
-	tipc_ref_table.last_free = 0;
-	tipc_ref_table.index_mask = actual_sz - 1;
-	tipc_ref_table.start_mask = start & ~tipc_ref_table.index_mask;
+	struct tipc_sock *tsk;
 
-	return 0;
-}
+	rcu_read_lock();
+	tsk = rhashtable_lookup(&tipc_sk_rht, &portid);
+	if (tsk)
+		sock_hold(&tsk->sk);
+	rcu_read_unlock();
 
-/**
- * tipc_ref_table_stop - destroy reference table for sockets
- */
-void tipc_sk_ref_table_stop(void)
-{
-	if (!tipc_ref_table.entries)
-		return;
-	vfree(tipc_ref_table.entries);
-	tipc_ref_table.entries = NULL;
+	return tsk;
 }
 
-/* tipc_ref_acquire - create reference to a socket
- *
- * Register an socket pointer in the reference table.
- * Returns a unique reference value that is used from then on to retrieve the
- * socket pointer, or to determine if the socket has been deregistered.
- */
-u32 tipc_sk_ref_acquire(struct tipc_sock *tsk)
+static int tipc_sk_insert(struct tipc_sock *tsk)
 {
-	u32 index;
-	u32 index_mask;
-	u32 next_plus_upper;
-	u32 ref = 0;
-	struct reference *entry;
-
-	if (unlikely(!tsk)) {
-		pr_err("Attempt to acquire ref. to non-existent obj\n");
-		return 0;
-	}
-	if (unlikely(!tipc_ref_table.entries)) {
-		pr_err("Ref. table not found in acquisition attempt\n");
-		return 0;
-	}
-
-	/* Take a free entry, if available; otherwise initialize a new one */
-	write_lock_bh(&ref_table_lock);
-	index = tipc_ref_table.first_free;
-	entry = &tipc_ref_table.entries[index];
+	u32 remaining = (TIPC_MAX_PORT - TIPC_MIN_PORT) + 1;
+	u32 portid = prandom_u32() % remaining + TIPC_MIN_PORT;
 
-	if (likely(index)) {
-		index = tipc_ref_table.first_free;
-		entry = &tipc_ref_table.entries[index];
-		index_mask = tipc_ref_table.index_mask;
-		next_plus_upper = entry->ref;
-		tipc_ref_table.first_free = next_plus_upper & index_mask;
-		ref = (next_plus_upper & ~index_mask) + index;
-		entry->tsk = tsk;
-	} else if (tipc_ref_table.init_point < tipc_ref_table.capacity) {
-		index = tipc_ref_table.init_point++;
-		entry = &tipc_ref_table.entries[index];
-		ref = tipc_ref_table.start_mask + index;
+	while (remaining--) {
+		portid++;
+		if ((portid < TIPC_MIN_PORT) || (portid > TIPC_MAX_PORT))
+			portid = TIPC_MIN_PORT;
+		tsk->portid = portid;
+		sock_hold(&tsk->sk);
+		if (rhashtable_lookup_insert(&tipc_sk_rht, &tsk->node))
+			return 0;
+		sock_put(&tsk->sk);
 	}
 
-	if (ref) {
-		entry->ref = ref;
-		entry->tsk = tsk;
-	}
-	write_unlock_bh(&ref_table_lock);
-	return ref;
+	return -1;
 }
 
-/* tipc_sk_ref_discard - invalidate reference to an socket
- *
- * Disallow future references to an socket and free up the entry for re-use.
- */
-void tipc_sk_ref_discard(u32 ref)
+static void tipc_sk_remove(struct tipc_sock *tsk)
 {
-	struct reference *entry;
-	u32 index;
-	u32 index_mask;
-
-	if (unlikely(!tipc_ref_table.entries)) {
-		pr_err("Ref. table not found during discard attempt\n");
-		return;
-	}
-
-	index_mask = tipc_ref_table.index_mask;
-	index = ref & index_mask;
-	entry = &tipc_ref_table.entries[index];
-
-	write_lock_bh(&ref_table_lock);
+	struct sock *sk = &tsk->sk;
 
-	if (unlikely(!entry->tsk)) {
-		pr_err("Attempt to discard ref. to non-existent socket\n");
-		goto exit;
-	}
-	if (unlikely(entry->ref != ref)) {
-		pr_err("Attempt to discard non-existent reference\n");
-		goto exit;
+	if (rhashtable_remove(&tipc_sk_rht, &tsk->node)) {
+		WARN_ON(atomic_read(&sk->sk_refcnt) == 1);
+		__sock_put(sk);
 	}
-
-	/* Mark entry as unused; increment instance part of entry's
-	 *   reference to invalidate any subsequent references
-	 */
-
-	entry->tsk = NULL;
-	entry->ref = (ref & ~index_mask) + (index_mask + 1);
-
-	/* Append entry to free entry list */
-	if (unlikely(tipc_ref_table.first_free == 0))
-		tipc_ref_table.first_free = index;
-	else
-		tipc_ref_table.entries[tipc_ref_table.last_free].ref |= index;
-	tipc_ref_table.last_free = index;
-exit:
-	write_unlock_bh(&ref_table_lock);
 }
 
-/* tipc_sk_get - find referenced socket and return pointer to it
- */
-struct tipc_sock *tipc_sk_get(u32 ref)
+int tipc_sk_rht_init(void)
 {
-	struct reference *entry;
-	struct tipc_sock *tsk;
+	struct rhashtable_params rht_params = {
+		.nelem_hint = 256,
+		.head_offset = offsetof(struct tipc_sock, node),
+		.key_offset = offsetof(struct tipc_sock, portid),
+		.key_len = sizeof(u32), /* portid */
+		.hashfn = jhash,
+		.max_shift = 20, /* 1M */
+		.min_shift = 8,  /* 256 */
+		.grow_decision = rht_grow_above_75,
+		.shrink_decision = rht_shrink_below_30,
+	};
 
-	if (unlikely(!tipc_ref_table.entries))
-		return NULL;
-	read_lock_bh(&ref_table_lock);
-	entry = &tipc_ref_table.entries[ref & tipc_ref_table.index_mask];
-	tsk = entry->tsk;
-	if (likely(tsk && (entry->ref == ref)))
-		sock_hold(&tsk->sk);
-	else
-		tsk = NULL;
-	read_unlock_bh(&ref_table_lock);
-	return tsk;
+	return rhashtable_init(&tipc_sk_rht, &rht_params);
 }
 
-/* tipc_sk_get_next - lock & return next socket after referenced one
-*/
-struct tipc_sock *tipc_sk_get_next(u32 *ref)
+void tipc_sk_rht_destroy(void)
 {
-	struct reference *entry;
-	struct tipc_sock *tsk = NULL;
-	uint index = *ref & tipc_ref_table.index_mask;
+	/* Wait for socket readers to complete */
+	synchronize_net();
 
-	read_lock_bh(&ref_table_lock);
-	while (++index < tipc_ref_table.capacity) {
-		entry = &tipc_ref_table.entries[index];
-		if (!entry->tsk)
-			continue;
-		tsk = entry->tsk;
-		sock_hold(&tsk->sk);
-		*ref = entry->ref;
-		break;
-	}
-	read_unlock_bh(&ref_table_lock);
-	return tsk;
-}
-
-static void tipc_sk_put(struct tipc_sock *tsk)
-{
-	sock_put(&tsk->sk);
+	rhashtable_destroy(&tipc_sk_rht);
 }
 
 /**
@@ -2829,7 +2684,7 @@ static int __tipc_nl_add_sk(struct sk_buff *skb, struct netlink_callback *cb,
 	attrs = nla_nest_start(skb, TIPC_NLA_SOCK);
 	if (!attrs)
 		goto genlmsg_cancel;
-	if (nla_put_u32(skb, TIPC_NLA_SOCK_REF, tsk->ref))
+	if (nla_put_u32(skb, TIPC_NLA_SOCK_REF, tsk->portid))
 		goto attr_msg_cancel;
 	if (nla_put_u32(skb, TIPC_NLA_SOCK_ADDR, tipc_own_addr))
 		goto attr_msg_cancel;
@@ -2859,22 +2714,29 @@ int tipc_nl_sk_dump(struct sk_buff *skb, struct netlink_callback *cb)
 {
 	int err;
 	struct tipc_sock *tsk;
-	u32 prev_ref = cb->args[0];
-	u32 ref = prev_ref;
-
-	tsk = tipc_sk_get_next(&ref);
-	for (; tsk; tsk = tipc_sk_get_next(&ref)) {
-		lock_sock(&tsk->sk);
-		err = __tipc_nl_add_sk(skb, cb, tsk);
-		release_sock(&tsk->sk);
-		tipc_sk_put(tsk);
-		if (err)
-			break;
+	const struct bucket_table *tbl;
+	struct rhash_head *pos;
+	u32 prev_portid = cb->args[0];
+	u32 portid = prev_portid;
+	int i;
 
-		prev_ref = ref;
+	rcu_read_lock();
+	tbl = rht_dereference_rcu((&tipc_sk_rht)->tbl, &tipc_sk_rht);
+	for (i = 0; i < tbl->size; i++) {
+		rht_for_each_entry_rcu(tsk, pos, tbl, i, node) {
+			spin_lock_bh(&tsk->sk.sk_lock.slock);
+			portid = tsk->portid;
+			err = __tipc_nl_add_sk(skb, cb, tsk);
+			spin_unlock_bh(&tsk->sk.sk_lock.slock);
+			if (err)
+				break;
+
+			prev_portid = portid;
+		}
 	}
+	rcu_read_unlock();
 
-	cb->args[0] = prev_ref;
+	cb->args[0] = prev_portid;
 
 	return skb->len;
 }
@@ -2962,12 +2824,12 @@ static int __tipc_nl_list_sk_publ(struct sk_buff *skb,
 int tipc_nl_publ_dump(struct sk_buff *skb, struct netlink_callback *cb)
 {
 	int err;
-	u32 tsk_ref = cb->args[0];
+	u32 tsk_portid = cb->args[0];
 	u32 last_publ = cb->args[1];
 	u32 done = cb->args[2];
 	struct tipc_sock *tsk;
 
-	if (!tsk_ref) {
+	if (!tsk_portid) {
 		struct nlattr **attrs;
 		struct nlattr *sock[TIPC_NLA_SOCK_MAX + 1];
 
@@ -2984,13 +2846,13 @@ int tipc_nl_publ_dump(struct sk_buff *skb, struct netlink_callback *cb)
 		if (!sock[TIPC_NLA_SOCK_REF])
 			return -EINVAL;
 
-		tsk_ref = nla_get_u32(sock[TIPC_NLA_SOCK_REF]);
+		tsk_portid = nla_get_u32(sock[TIPC_NLA_SOCK_REF]);
 	}
 
 	if (done)
 		return 0;
 
-	tsk = tipc_sk_get(tsk_ref);
+	tsk = tipc_sk_lookup(tsk_portid);
 	if (!tsk)
 		return -EINVAL;
 
@@ -2999,9 +2861,9 @@ int tipc_nl_publ_dump(struct sk_buff *skb, struct netlink_callback *cb)
 	if (!err)
 		done = 1;
 	release_sock(&tsk->sk);
-	tipc_sk_put(tsk);
+	sock_put(&tsk->sk);
 
-	cb->args[0] = tsk_ref;
+	cb->args[0] = tsk_portid;
 	cb->args[1] = last_publ;
 	cb->args[2] = done;
 
diff --git a/net/tipc/socket.h b/net/tipc/socket.h
index d340893..c7d46d06 100644
--- a/net/tipc/socket.h
+++ b/net/tipc/socket.h
@@ -46,8 +46,8 @@ int tipc_sk_rcv(struct sk_buff *buf);
 struct sk_buff *tipc_sk_socks_show(void);
 void tipc_sk_mcast_rcv(struct sk_buff *buf);
 void tipc_sk_reinit(void);
-int tipc_sk_ref_table_init(u32 requested_size, u32 start);
-void tipc_sk_ref_table_stop(void);
+int tipc_sk_rht_init(void);
+void tipc_sk_rht_destroy(void);
 int tipc_nl_sk_dump(struct sk_buff *skb, struct netlink_callback *cb);
 int tipc_nl_publ_dump(struct sk_buff *skb, struct netlink_callback *cb);
 
-- 
1.7.9.5


------------------------------------------------------------------------------
Dive into the World of Parallel Programming! The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net

^ permalink raw reply related

* [PATCH net-next 4/5] rhashtable: involve rhashtable_lookup_insert routine
From: Ying Xue @ 2015-01-06  7:23 UTC (permalink / raw)
  To: tgraf; +Cc: jon.maloy, netdev, Paul.Gortmaker, tipc-discussion, davem
In-Reply-To: <1420529003-22244-1-git-send-email-ying.xue@windriver.com>

Involve a new function called rhashtable_lookup_insert() which makes
lookup and insertion atomic under bucket lock protection, helping us
avoid to introduce an extra lock when we search and insert an object
into hash table.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Thomas Graf <tgraf@suug.ch>
---
 include/linux/rhashtable.h |    1 +
 lib/rhashtable.c           |   98 ++++++++++++++++++++++++++++++++++++--------
 2 files changed, 83 insertions(+), 16 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index de1459c7..73c913f 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -168,6 +168,7 @@ int rhashtable_shrink(struct rhashtable *ht);
 void *rhashtable_lookup(struct rhashtable *ht, const void *key);
 void *rhashtable_lookup_compare(struct rhashtable *ht, const void *key,
 				bool (*compare)(void *, void *), void *arg);
+bool rhashtable_lookup_insert(struct rhashtable *ht, struct rhash_head *obj);
 
 void rhashtable_destroy(struct rhashtable *ht);
 
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index f5288b1..c77b472 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -504,8 +504,27 @@ static void rhashtable_wakeup_worker(struct rhashtable *ht)
 		schedule_delayed_work(&ht->run_work, 0);
 }
 
+static void __rhashtable_insert(struct rhashtable *ht, struct rhash_head *obj,
+				struct bucket_table *tbl, u32 hash)
+{
+	struct rhash_head *head = rht_dereference_bucket(tbl->buckets[hash],
+							 tbl, hash);
+
+	if (rht_is_a_nulls(head))
+		INIT_RHT_NULLS_HEAD(obj->next, ht, hash);
+	else
+		RCU_INIT_POINTER(obj->next, head);
+
+	rcu_assign_pointer(tbl->buckets[hash], obj);
+
+	atomic_inc(&ht->nelems);
+
+	/* Only grow the table if no resizing is currently in progress. */
+	rhashtable_wakeup_worker(ht);
+}
+
 /**
- * rhashtable_insert - insert object into hash hash table
+ * rhashtable_insert - insert object into hash table
  * @ht:		hash table
  * @obj:	pointer to hash head inside object
  *
@@ -522,7 +541,6 @@ static void rhashtable_wakeup_worker(struct rhashtable *ht)
 void rhashtable_insert(struct rhashtable *ht, struct rhash_head *obj)
 {
 	struct bucket_table *tbl;
-	struct rhash_head *head;
 	spinlock_t *lock;
 	unsigned hash;
 
@@ -533,20 +551,9 @@ void rhashtable_insert(struct rhashtable *ht, struct rhash_head *obj)
 	lock = bucket_lock(tbl, hash);
 
 	spin_lock_bh(lock);
-	head = rht_dereference_bucket(tbl->buckets[hash], tbl, hash);
-	if (rht_is_a_nulls(head))
-		INIT_RHT_NULLS_HEAD(obj->next, ht, hash);
-	else
-		RCU_INIT_POINTER(obj->next, head);
-
-	rcu_assign_pointer(tbl->buckets[hash], obj);
+	__rhashtable_insert(ht, obj, tbl, hash);
 	spin_unlock_bh(lock);
 
-	atomic_inc(&ht->nelems);
-
-	/* Only grow the table if no resizing is currently in progress. */
-	rhashtable_wakeup_worker(ht);
-
 	rcu_read_unlock();
 }
 EXPORT_SYMBOL_GPL(rhashtable_insert);
@@ -560,7 +567,7 @@ EXPORT_SYMBOL_GPL(rhashtable_insert);
  * walk the bucket chain upon removal. The removal operation is thus
  * considerable slow if the hash table is not correctly sized.
  *
- * Will automatically shrink the table via rhashtable_expand() if the the
+ * Will automatically shrink the table via rhashtable_expand() if the
  * shrink_decision function specified at rhashtable_init() returns true.
  *
  * The caller must ensure that no concurrent table mutations occur. It is
@@ -641,7 +648,7 @@ static bool rhashtable_compare(void *ptr, void *arg)
  * for a entry with an identical key. The first matching entry is returned.
  *
  * This lookup function may only be used for fixed key hash table (key_len
- * paramter set). It will BUG() if used inappropriately.
+ * parameter set). It will BUG() if used inappropriately.
  *
  * Lookups may occur in parallel with hashtable mutations and resizing.
  */
@@ -702,6 +709,65 @@ restart:
 }
 EXPORT_SYMBOL_GPL(rhashtable_lookup_compare);
 
+/**
+ * rhashtable_lookup_insert - lookup and insert object into hash table
+ * @ht:		hash table
+ * @obj:	pointer to hash head inside object
+ *
+ * Locks down the bucket chain in both the old and new table if a resize
+ * is in progress to ensure that writers can't remove from the old table
+ * and can't insert to the new table during the atomic operation of search
+ * and insertion. Searches for duplicates in both the old and new table if
+ * a resize is in progress.
+ *
+ * This lookup function may only be used for fixed key hash table (key_len
+ * parameter set). It will BUG() if used inappropriately.
+ *
+ * It is safe to call this function from atomic context.
+ *
+ * Will trigger an automatic deferred table resizing if the size grows
+ * beyond the watermark indicated by grow_decision() which can be passed
+ * to rhashtable_init().
+ */
+bool rhashtable_lookup_insert(struct rhashtable *ht, struct rhash_head *obj)
+{
+	struct bucket_table *new_tbl, *old_tbl;
+	spinlock_t *new_bucket_lock, *old_bucket_lock;
+	u32 new_hash, old_hash;
+	bool success = true;
+
+	BUG_ON(!ht->p.key_len);
+
+	rcu_read_lock();
+
+	old_tbl = rht_dereference_rcu(ht->tbl, ht);
+	old_hash = head_hashfn(ht, old_tbl, obj);
+	old_bucket_lock = bucket_lock(old_tbl, old_hash);
+	spin_lock_bh(old_bucket_lock);
+
+	new_tbl = rht_dereference_rcu(ht->future_tbl, ht);
+	new_hash = head_hashfn(ht, new_tbl, obj);
+	new_bucket_lock = bucket_lock(new_tbl, new_hash);
+	if (unlikely(old_tbl != new_tbl))
+		spin_lock_bh_nested(new_bucket_lock, RHT_LOCK_NESTED);
+
+	if (rhashtable_lookup(ht, rht_obj(ht, obj) + ht->p.key_offset)) {
+		success = false;
+		goto exit;
+	}
+
+	__rhashtable_insert(ht, obj, new_tbl, new_hash);
+
+exit:
+	if (unlikely(old_tbl != new_tbl))
+		spin_unlock_bh(new_bucket_lock);
+	spin_unlock_bh(old_bucket_lock);
+	rcu_read_unlock();
+
+	return success;
+}
+EXPORT_SYMBOL_GPL(rhashtable_lookup_insert);
+
 static size_t rounded_hashtable_size(struct rhashtable_params *params)
 {
 	return max(roundup_pow_of_two(params->nelem_hint * 4 / 3),
-- 
1.7.9.5


------------------------------------------------------------------------------
Dive into the World of Parallel Programming! The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net

^ permalink raw reply related

* [PATCH net-next 3/5] rhashtable: use future table size to make expansion decision
From: Ying Xue @ 2015-01-06  7:23 UTC (permalink / raw)
  To: tgraf; +Cc: jon.maloy, netdev, Paul.Gortmaker, tipc-discussion, davem
In-Reply-To: <1420529003-22244-1-git-send-email-ying.xue@windriver.com>

Should use future table size instead of old table size to decide
whether hash table is worth being expanded.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Cc: Thomas Graf <tgraf@suug.ch>
---
 lib/rhashtable.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 6eda22f..f5288b1 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -478,13 +478,14 @@ EXPORT_SYMBOL_GPL(rhashtable_shrink);
 static void rht_deferred_worker(struct work_struct *work)
 {
 	struct rhashtable *ht;
-	struct bucket_table *tbl;
+	struct bucket_table *tbl, *new_tbl;
 
 	ht = container_of(work, struct rhashtable, run_work.work);
 	mutex_lock(&ht->mutex);
 	tbl = rht_dereference(ht->tbl, ht);
+	new_tbl = rht_dereference(ht->future_tbl, ht);
 
-	if (ht->p.grow_decision && ht->p.grow_decision(ht, tbl->size))
+	if (ht->p.grow_decision && ht->p.grow_decision(ht, new_tbl->size))
 		rhashtable_expand(ht);
 	else if (ht->p.shrink_decision && ht->p.shrink_decision(ht, tbl->size))
 		rhashtable_shrink(ht);
-- 
1.7.9.5


------------------------------------------------------------------------------
Dive into the World of Parallel Programming! The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net

^ permalink raw reply related

* [PATCH net-next 2/5] rhashtable: introduce rhashtable_wakeup_worker helper function
From: Ying Xue @ 2015-01-06  7:23 UTC (permalink / raw)
  To: tgraf; +Cc: jon.maloy, netdev, Paul.Gortmaker, tipc-discussion, davem
In-Reply-To: <1420529003-22244-1-git-send-email-ying.xue@windriver.com>

Introduce rhashtable_wakeup_worker() helper function to reduce
duplicated code where to wake up worker.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Cc: Thomas Graf <tgraf@suug.ch>
---
 lib/rhashtable.c |   20 +++++++++++++-------
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index f2fdd7a..6eda22f 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -492,6 +492,17 @@ static void rht_deferred_worker(struct work_struct *work)
 	mutex_unlock(&ht->mutex);
 }
 
+static void rhashtable_wakeup_worker(struct rhashtable *ht)
+{
+	struct bucket_table *tbl = rht_dereference_rcu(ht->tbl, ht);
+	struct bucket_table *new_tbl = rht_dereference_rcu(ht->future_tbl, ht);
+
+	if (ht->tbl != ht->future_tbl &&
+	    ((ht->p.grow_decision && ht->p.grow_decision(ht, new_tbl->size)) ||
+	    (ht->p.shrink_decision && ht->p.shrink_decision(ht, tbl->size))))
+		schedule_delayed_work(&ht->run_work, 0);
+}
+
 /**
  * rhashtable_insert - insert object into hash hash table
  * @ht:		hash table
@@ -533,9 +544,7 @@ void rhashtable_insert(struct rhashtable *ht, struct rhash_head *obj)
 	atomic_inc(&ht->nelems);
 
 	/* Only grow the table if no resizing is currently in progress. */
-	if (ht->tbl != ht->future_tbl &&
-	    ht->p.grow_decision && ht->p.grow_decision(ht, tbl->size))
-		schedule_delayed_work(&ht->run_work, 0);
+	rhashtable_wakeup_worker(ht);
 
 	rcu_read_unlock();
 }
@@ -584,10 +593,7 @@ restart:
 
 		spin_unlock_bh(lock);
 
-		if (ht->tbl != ht->future_tbl &&
-		    ht->p.shrink_decision &&
-		    ht->p.shrink_decision(ht, tbl->size))
-			schedule_delayed_work(&ht->run_work, 0);
+		rhashtable_wakeup_worker(ht);
 
 		rcu_read_unlock();
 
-- 
1.7.9.5


------------------------------------------------------------------------------
Dive into the World of Parallel Programming! The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net

^ permalink raw reply related

* [PATCH net-next 1/5] rhashtable: optimize rhashtable_lookup routine
From: Ying Xue @ 2015-01-06  7:23 UTC (permalink / raw)
  To: tgraf; +Cc: jon.maloy, netdev, Paul.Gortmaker, tipc-discussion, davem
In-Reply-To: <1420529003-22244-1-git-send-email-ying.xue@windriver.com>

Define an internal compare function and relevant compare argument,
and then make use of rhashtable_lookup_compare() to lookup key in
hash table, reducing duplicated code between rhashtable_lookup()
and rhashtable_lookup_compare().

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Cc: Thomas Graf <tgraf@suug.ch>
---
 lib/rhashtable.c |   41 ++++++++++++++++++-----------------------
 1 file changed, 18 insertions(+), 23 deletions(-)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index cbad192..f2fdd7a 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -612,6 +612,19 @@ restart:
 }
 EXPORT_SYMBOL_GPL(rhashtable_remove);
 
+struct rhashtable_compare_arg {
+	struct rhashtable *ht;
+	const void *key;
+};
+
+static bool rhashtable_compare(void *ptr, void *arg)
+{
+	struct rhashtable_compare_arg *x = arg;
+	struct rhashtable *ht = x->ht;
+
+	return !memcmp(ptr + ht->p.key_offset, x->key, ht->p.key_len);
+}
+
 /**
  * rhashtable_lookup - lookup key in hash table
  * @ht:		hash table
@@ -627,32 +640,14 @@ EXPORT_SYMBOL_GPL(rhashtable_remove);
  */
 void *rhashtable_lookup(struct rhashtable *ht, const void *key)
 {
-	const struct bucket_table *tbl, *old_tbl;
-	struct rhash_head *he;
-	u32 hash;
+	struct rhashtable_compare_arg arg = {
+		.ht = ht,
+		.key = key,
+	};
 
 	BUG_ON(!ht->p.key_len);
 
-	rcu_read_lock();
-	old_tbl = rht_dereference_rcu(ht->tbl, ht);
-	tbl = rht_dereference_rcu(ht->future_tbl, ht);
-	hash = key_hashfn(ht, key, ht->p.key_len);
-restart:
-	rht_for_each_rcu(he, tbl, rht_bucket_index(tbl, hash)) {
-		if (memcmp(rht_obj(ht, he) + ht->p.key_offset, key,
-			   ht->p.key_len))
-			continue;
-		rcu_read_unlock();
-		return rht_obj(ht, he);
-	}
-
-	if (unlikely(tbl != old_tbl)) {
-		tbl = old_tbl;
-		goto restart;
-	}
-
-	rcu_read_unlock();
-	return NULL;
+	return rhashtable_lookup_compare(ht, key, &rhashtable_compare, &arg);
 }
 EXPORT_SYMBOL_GPL(rhashtable_lookup);
 
-- 
1.7.9.5


------------------------------------------------------------------------------
Dive into the World of Parallel Programming! The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net

^ permalink raw reply related

* [PATCH net-next 0/5] Involve rhashtable_lookup_insert routine
From: Ying Xue @ 2015-01-06  7:23 UTC (permalink / raw)
  To: tgraf; +Cc: jon.maloy, netdev, Paul.Gortmaker, tipc-discussion, davem

The series aims to involve rhashtable_lookup_insert() to guarantee
that the process of lookup and insertion of an object from/into hash
table is finished atomically, allowing rhashtable's users not to
introduce an extra lock during search and insertion. For example,
tipc socket is the first user benefiting from this enhancement. 

But before rhashtable_lookup_insert() is involved, the following
optimizations need to be first done:
- simplify rhashtable_lookup by reusing rhashtable_lookup_compare()
- introduce rhashtable_wakeup_worker() to further reduce duplicated
  code in patch #2
- fix an issue in patch #3
- involve rhashtable_lookup_insert(). But in this version, we firstly
  use rhashtable_lookup() to search duplicate key in both old and new
  bucket table; secondly introduce another __rhashtable_insert() helper
  function to reduce the duplicated code between rhashtable_insert()
  and rhashtable_lookup_insert().
- add patch #5 into the series as it depends on above patches. But in
  this version, no change is made comparing with its previous version.

Ying Xue (5):
  rhashtable: optimize rhashtable_lookup routine
  rhashtable: introduce rhashtable_wakeup_worker helper function
  rhashtable: use future table size to make expansion decision
  rhashtable: involve rhashtable_lookup_insert routine
  tipc: convert tipc reference table to use generic rhashtable

 include/linux/rhashtable.h |    1 +
 lib/rhashtable.c           |  162 ++++++++++-----
 net/tipc/Kconfig           |   12 --
 net/tipc/config.c          |   24 +--
 net/tipc/core.c            |   10 +-
 net/tipc/core.h            |    3 -
 net/tipc/socket.c          |  480 ++++++++++++++++----------------------------
 net/tipc/socket.h          |    4 +-
 8 files changed, 296 insertions(+), 400 deletions(-)

-- 
1.7.9.5


------------------------------------------------------------------------------
Dive into the World of Parallel Programming! The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net

^ permalink raw reply

* Re: [PATCH] GMAC: fix simple_return.cocci warnings
From: Roger @ 2015-01-06  7:13 UTC (permalink / raw)
  To: David Miller, joe
  Cc: fengguang.wu, kbuild-all, peppe.cavallaro, netdev, linux-kernel
In-Reply-To: <20150104.222022.55122203599464788.davem@davemloft.net>

Hi! David

What should I do now?

On 2015/1/5 11:20, David Miller wrote:
> From: Joe Perches <joe@perches.com>
> Date: Fri, 02 Jan 2015 16:46:45 -0800
>
>> On Sat, 2015-01-03 at 08:25 +0800, kbuild test robot wrote:
>>> drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c:425:1-4: WARNING: end returns can be simpified
>>>
>>>   Simplify a trivial if-return sequence.  Possibly combine with a
>>>   preceding function call.
>>> Generated by: scripts/coccinelle/misc/simple_return.cocci
>>>
>>> CC: Roger Chen <roger.chen@rock-chips.com>
>>> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
>>> ---
>>>
>>>   dwmac-rk.c |    6 +-----
>>>   1 file changed, 1 insertion(+), 5 deletions(-)
>>>
>>> --- a/drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c
>>> +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c
>>> @@ -422,11 +422,7 @@ static int rk_gmac_init(struct platform_
>>>   	if (ret)
>>>   		return ret;
>>>   
>>> -	ret = gmac_clk_enable(bsp_priv, true);
>>> -	if (ret)
>>> -		return ret;
>>> -
>>> -	return 0;
>>> +	return gmac_clk_enable(bsp_priv, true);
>> I think this change is not particularly better.
>>
>> When the pattern is multiply repeated like:
>   ...
>> I think it's better to not change the last
>> test in the sequence just to minimize overall
>> line count.
> I think it's a wash and that both ways are about the same to me.
>
> I won't apply this, sorry.
>
>
>

^ permalink raw reply

* Re: [net-next PATCH v1 04/11] rocker: add pipeline model for rocker switch
From: Scott Feldman @ 2015-01-06  7:01 UTC (permalink / raw)
  To: John Fastabend
  Cc: Thomas Graf, Jiří Pírko, Jamal Hadi Salim,
	simon.horman@netronome.com, Netdev, David S. Miller,
	Andy Gospodarek
In-Reply-To: <20141231194709.31070.16657.stgit@nitbit.x32>

On Wed, Dec 31, 2014 at 11:47 AM, John Fastabend
<john.fastabend@gmail.com> wrote:
> This adds rocker support for the net_flow_get_* operations. With this
> we can interrogate rocker.
>
> Here we see that for static configurations enabling the get operations
> is simply a matter of defining a pipeline model and returning the
> structures for the core infrastructure to encapsulate into netlink
> messages.
>
> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> ---
>  drivers/net/ethernet/rocker/rocker.c          |   35 +
>  drivers/net/ethernet/rocker/rocker_pipeline.h |  673 +++++++++++++++++++++++++
>  2 files changed, 708 insertions(+)
>  create mode 100644 drivers/net/ethernet/rocker/rocker_pipeline.h
>
> diff --git a/drivers/net/ethernet/rocker/rocker.c b/drivers/net/ethernet/rocker/rocker.c
> index fded127..4c6787a 100644
> --- a/drivers/net/ethernet/rocker/rocker.c
> +++ b/drivers/net/ethernet/rocker/rocker.c
> @@ -36,6 +36,7 @@
>  #include <generated/utsrelease.h>
>
>  #include "rocker.h"
> +#include "rocker_pipeline.h"
>
>  static const char rocker_driver_name[] = "rocker";
>
> @@ -3780,6 +3781,33 @@ static int rocker_port_switch_port_stp_update(struct net_device *dev, u8 state)
>         return rocker_port_stp_update(rocker_port, state);
>  }
>
> +#ifdef CONFIG_NET_FLOW_TABLES

Can this #ifdef test be moved out of driver?  The if_flow core code
can stub out operations if CONFIG_NET_FLOW_TABLES isn't defined.

> +static struct net_flow_table **rocker_get_tables(struct net_device *d)
> +{
> +       return rocker_table_list;
> +}
> +
> +static struct net_flow_header **rocker_get_headers(struct net_device *d)
> +{
> +       return rocker_header_list;
> +}
> +
> +static struct net_flow_action **rocker_get_actions(struct net_device *d)
> +{
> +       return rocker_action_list;
> +}
> +
> +static struct net_flow_tbl_node **rocker_get_tgraph(struct net_device *d)
> +{
> +       return rocker_table_nodes;
> +}
> +
> +static struct net_flow_hdr_node **rocker_get_hgraph(struct net_device *d)
> +{
> +       return rocker_header_nodes;
> +}
> +#endif
> +
>  static const struct net_device_ops rocker_port_netdev_ops = {
>         .ndo_open                       = rocker_port_open,
>         .ndo_stop                       = rocker_port_stop,
> @@ -3794,6 +3822,13 @@ static const struct net_device_ops rocker_port_netdev_ops = {
>         .ndo_bridge_getlink             = rocker_port_bridge_getlink,
>         .ndo_switch_parent_id_get       = rocker_port_switch_parent_id_get,
>         .ndo_switch_port_stp_update     = rocker_port_switch_port_stp_update,
> +#ifdef CONFIG_NET_FLOW_TABLES

same comment here

> +       .ndo_flow_get_tables            = rocker_get_tables,
> +       .ndo_flow_get_headers           = rocker_get_headers,
> +       .ndo_flow_get_actions           = rocker_get_actions,
> +       .ndo_flow_get_tbl_graph         = rocker_get_tgraph,
> +       .ndo_flow_get_hdr_graph         = rocker_get_hgraph,
> +#endif
>  };
>
>  /********************
> diff --git a/drivers/net/ethernet/rocker/rocker_pipeline.h b/drivers/net/ethernet/rocker/rocker_pipeline.h
> new file mode 100644
> index 0000000..9544339
> --- /dev/null
> +++ b/drivers/net/ethernet/rocker/rocker_pipeline.h

Add standard header info...copyright/license.

> @@ -0,0 +1,673 @@
> +#ifndef _MY_PIPELINE_H_
> +#define _MY_PIPELINE_H_

_ROCKER_PIPELINE_H_

> +
> +#include <linux/if_flow.h>
> +
> +/* header definition */
> +#define HEADER_ETHERNET_SRC_MAC 1
> +#define HEADER_ETHERNET_DST_MAC 2
> +#define HEADER_ETHERNET_ETHERTYPE 3

Use enum?

> +struct net_flow_field ethernet_fields[3] = {

ethernet_fields[] = ...   // let compiler size it

> +       { .name = "src_mac", .uid = HEADER_ETHERNET_SRC_MAC, .bitwidth = 48},
> +       { .name = "dst_mac", .uid = HEADER_ETHERNET_DST_MAC, .bitwidth = 48},
> +       { .name = "ethertype",
> +         .uid = HEADER_ETHERNET_ETHERTYPE,
> +         .bitwidth = 16},
> +};
> +
> +#define HEADER_ETHERNET 1
> +struct net_flow_header ethernet = {
> +       .name = "ethernet",
> +       .uid = HEADER_ETHERNET,
> +       .field_sz = 3,

ARRAY_SIZE()?

> +       .fields = ethernet_fields,
> +};
> +
> +#define HEADER_VLAN_PCP 1
> +#define HEADER_VLAN_CFI 2
> +#define HEADER_VLAN_VID 3
> +#define HEADER_VLAN_ETHERTYPE 4
> +struct net_flow_field vlan_fields[4] = {

[] = ...

> +       { .name = "pcp", .uid = HEADER_VLAN_PCP, .bitwidth = 3,},
> +       { .name = "cfi", .uid = HEADER_VLAN_CFI, .bitwidth = 1,},
> +       { .name = "vid", .uid = HEADER_VLAN_VID, .bitwidth = 12,},
> +       { .name = "ethertype", .uid = HEADER_VLAN_ETHERTYPE, .bitwidth = 16,},
> +};
> +
> +#define HEADER_VLAN 2
> +struct net_flow_header vlan = {
> +       .name = "vlan",
> +       .uid = HEADER_VLAN,
> +       .field_sz = 4,

ARRAY_SIZE()

> +       .fields = vlan_fields,
> +};
> +
> +#define HEADER_IPV4_VERSION 1
> +#define HEADER_IPV4_IHL 2
> +#define HEADER_IPV4_DSCP 3
> +#define HEADER_IPV4_ECN 4
> +#define HEADER_IPV4_LENGTH 5
> +#define HEADER_IPV4_IDENTIFICATION 6
> +#define HEADER_IPV4_FLAGS 7
> +#define HEADER_IPV4_FRAGMENT_OFFSET 8
> +#define HEADER_IPV4_TTL 9
> +#define HEADER_IPV4_PROTOCOL 10
> +#define HEADER_IPV4_CSUM 11
> +#define HEADER_IPV4_SRC_IP 12
> +#define HEADER_IPV4_DST_IP 13
> +#define HEADER_IPV4_OPTIONS 14
> +struct net_flow_field ipv4_fields[14] = {
> +       { .name = "version",
> +         .uid = HEADER_IPV4_VERSION,
> +         .bitwidth = 4,},
> +       { .name = "ihl",
> +         .uid = HEADER_IPV4_IHL,
> +         .bitwidth = 4,},
> +       { .name = "dscp",
> +         .uid = HEADER_IPV4_DSCP,
> +         .bitwidth = 6,},
> +       { .name = "ecn",
> +         .uid = HEADER_IPV4_ECN,
> +         .bitwidth = 2,},
> +       { .name = "length",
> +         .uid = HEADER_IPV4_LENGTH,
> +         .bitwidth = 8,},
> +       { .name = "identification",
> +         .uid = HEADER_IPV4_IDENTIFICATION,
> +         .bitwidth = 8,},
> +       { .name = "flags",
> +         .uid = HEADER_IPV4_FLAGS,
> +         .bitwidth = 3,},
> +       { .name = "fragment_offset",
> +         .uid = HEADER_IPV4_FRAGMENT_OFFSET,
> +         .bitwidth = 13,},
> +       { .name = "ttl",
> +         .uid = HEADER_IPV4_TTL,
> +         .bitwidth = 1,},
> +       { .name = "protocol",
> +         .uid = HEADER_IPV4_PROTOCOL,
> +         .bitwidth = 8,},
> +       { .name = "csum",
> +         .uid = HEADER_IPV4_CSUM,
> +         .bitwidth = 8,},
> +       { .name = "src_ip",
> +         .uid = HEADER_IPV4_SRC_IP,
> +         .bitwidth = 32,},
> +       { .name = "dst_ip",
> +         .uid = HEADER_IPV4_DST_IP,
> +         .bitwidth = 32,},
> +       { .name = "options",
> +         .uid = HEADER_IPV4_OPTIONS,
> +         .bitwidth = -1,},
> +};
> +
> +#define HEADER_IPV4 3
> +struct net_flow_header ipv4 = {
> +       .name = "ipv4",
> +       .uid = HEADER_IPV4,
> +       .field_sz = 14,
> +       .fields = ipv4_fields,
> +};
> +
> +#define HEADER_METADATA_IN_LPORT 1
> +#define HEADER_METADATA_GOTO_TBL 2
> +#define HEADER_METADATA_GROUP_ID 3
> +struct net_flow_field metadata_fields[3] = {
> +       { .name = "in_lport",
> +         .uid = HEADER_METADATA_IN_LPORT,
> +         .bitwidth = 32,},
> +       { .name = "goto_tbl",
> +         .uid = HEADER_METADATA_GOTO_TBL,
> +         .bitwidth = 16,},
> +       { .name = "group_id",
> +         .uid = HEADER_METADATA_GROUP_ID,
> +         .bitwidth = 32,},
> +};
> +
> +#define HEADER_METADATA 4
> +struct net_flow_header metadata_t = {
> +       .name = "metadata_t",
> +       .uid = HEADER_METADATA,
> +       .field_sz = 3,
> +       .fields = metadata_fields,
> +};
> +
> +struct net_flow_header null_hdr = {.name = "",
> +                                  .uid = 0,
> +                                  .field_sz = 0,
> +                                  .fields = NULL};
> +
> +struct net_flow_header *rocker_header_list[8] = {

[] = ...

Not sure where the [8] comes from

> +       &ethernet,
> +       &vlan,
> +       &ipv4,
> +       &metadata_t,
> +       &null_hdr,

Seems just setting last entry to NULL would be cleaner:

struct foo *foo[] = {
    &one,
    &two,
    &three,
    NULL,
};

> +};
> +
> +/* action definitions */
> +struct net_flow_action_arg null_args[1] = {
> +       {
> +               .name = "",
> +               .type = NET_FLOW_ACTION_ARG_TYPE_NULL,
> +       },
> +};
> +
> +struct net_flow_action null_action = {
> +       .name = "", .uid = 0, .args = NULL,
> +};
> +
> +struct net_flow_action_arg set_goto_table_args[2] = {
> +       {
> +               .name = "table",
> +               .type = NET_FLOW_ACTION_ARG_TYPE_U16,
> +               .value_u16 = 0,
> +       },
> +       {
> +               .name = "",
> +               .type = NET_FLOW_ACTION_ARG_TYPE_NULL,
> +       },
> +};
> +
> +#define ACTION_SET_GOTO_TABLE 1
> +struct net_flow_action set_goto_table = {
> +       .name = "set_goto_table",
> +       .uid = ACTION_SET_GOTO_TABLE,
> +       .args = set_goto_table_args,
> +};
> +
> +struct net_flow_action_arg set_vlan_id_args[2] = {
> +       {
> +               .name = "vlan_id",
> +               .type = NET_FLOW_ACTION_ARG_TYPE_U16,
> +               .value_u16 = 0,
> +       },
> +       {
> +               .name = "",
> +               .type = NET_FLOW_ACTION_ARG_TYPE_NULL,
> +       },
> +};
> +
> +#define ACTION_SET_VLAN_ID 2
> +struct net_flow_action set_vlan_id = {
> +       .name = "set_vlan_id",
> +       .uid = ACTION_SET_VLAN_ID,
> +       .args = set_vlan_id_args,
> +};
> +
> +/* TBD: what is the untagged bool about in vlan table */
> +#define ACTION_COPY_TO_CPU 3
> +struct net_flow_action copy_to_cpu = {
> +       .name = "copy_to_cpu",
> +       .uid = ACTION_COPY_TO_CPU,
> +       .args = null_args,
> +};
> +
> +struct net_flow_action_arg set_group_id_args[2] = {
> +       {
> +               .name = "group_id",
> +               .type = NET_FLOW_ACTION_ARG_TYPE_U32,
> +               .value_u32 = 0,
> +       },
> +       {
> +               .name = "",
> +               .type = NET_FLOW_ACTION_ARG_TYPE_NULL,
> +       },
> +};
> +
> +#define ACTION_SET_GROUP_ID 4
> +struct net_flow_action set_group_id = {
> +       .name = "set_group_id",
> +       .uid = ACTION_SET_GROUP_ID,
> +       .args = set_group_id_args,
> +};
> +
> +#define ACTION_POP_VLAN 5
> +struct net_flow_action pop_vlan = {
> +       .name = "pop_vlan",
> +       .uid = ACTION_POP_VLAN,
> +       .args = null_args,
> +};
> +
> +struct net_flow_action_arg set_eth_src_args[2] = {
> +       {
> +               .name = "eth_src",
> +               .type = NET_FLOW_ACTION_ARG_TYPE_U64,
> +               .value_u64 = 0,
> +       },
> +       {
> +               .name = "",
> +               .type = NET_FLOW_ACTION_ARG_TYPE_NULL,
> +       },
> +};
> +
> +#define ACTION_SET_ETH_SRC 6
> +struct net_flow_action set_eth_src = {
> +       .name = "set_eth_src",
> +       .uid = ACTION_SET_ETH_SRC,
> +       .args = set_eth_src_args,
> +};
> +
> +struct net_flow_action_arg set_eth_dst_args[2] = {
> +       {
> +               .name = "eth_dst",
> +               .type = NET_FLOW_ACTION_ARG_TYPE_U64,
> +               .value_u64 = 0,
> +       },
> +       {
> +               .name = "",
> +               .type = NET_FLOW_ACTION_ARG_TYPE_NULL,
> +       },
> +};
> +
> +#define ACTION_SET_ETH_DST 7
> +struct net_flow_action set_eth_dst = {
> +       .name = "set_eth_dst",
> +       .uid = ACTION_SET_ETH_DST,
> +       .args = set_eth_dst_args,
> +};
> +
> +struct net_flow_action_arg set_out_port_args[2] = {
> +       {
> +               .name = "set_out_port",
> +               .type = NET_FLOW_ACTION_ARG_TYPE_U32,
> +               .value_u32 = 0,
> +       },
> +       {
> +               .name = "",
> +               .type = NET_FLOW_ACTION_ARG_TYPE_NULL,
> +       },
> +};
> +
> +#define ACTION_SET_OUT_PORT 8
> +struct net_flow_action set_out_port = {
> +       .name = "set_out_port",
> +       .uid = ACTION_SET_OUT_PORT,
> +       .args = set_out_port_args,
> +};
> +
> +struct net_flow_action *rocker_action_list[8] = {
> +       &set_goto_table,
> +       &set_vlan_id,
> +       &copy_to_cpu,
> +       &set_group_id,
> +       &pop_vlan,
> +       &set_eth_src,
> +       &set_eth_dst,
> +       &null_action,
> +};
> +
> +/* headers graph */
> +#define HEADER_INSTANCE_ETHERNET 1
> +#define HEADER_INSTANCE_VLAN_OUTER 2
> +#define HEADER_INSTANCE_IPV4 3
> +#define HEADER_INSTANCE_IN_LPORT 4
> +#define HEADER_INSTANCE_GOTO_TABLE 5
> +#define HEADER_INSTANCE_GROUP_ID 6
> +
> +struct net_flow_jump_table parse_ethernet[3] = {
> +       {
> +               .field = {
> +                  .header = HEADER_ETHERNET,
> +                  .field = HEADER_ETHERNET_ETHERTYPE,
> +                  .type = NET_FLOW_FIELD_REF_ATTR_TYPE_U16,
> +                  .value_u16 = 0x0800,

How is htons/ntohs conversions happening here?

Since these are network header fields, seems you want htons(0x0800).

> +               },
> +               .node = HEADER_INSTANCE_IPV4,
> +       },
> +       {
> +               .field = {
> +                  .header = HEADER_ETHERNET,
> +                  .field = HEADER_ETHERNET_ETHERTYPE,
> +                  .type = NET_FLOW_FIELD_REF_ATTR_TYPE_U16,
> +                  .value_u16 = 0x8100,
> +               },
> +               .node = HEADER_INSTANCE_VLAN_OUTER,
> +       },
> +       {
> +               .field = {0},
> +               .node = 0,
> +       },

just use NULL,

> +};
> +
> +int ethernet_headers[2] = {HEADER_ETHERNET, 0};
> +
> +struct net_flow_hdr_node ethernet_header_node = {
> +       .name = "ethernet",
> +       .uid = HEADER_INSTANCE_ETHERNET,
> +       .hdrs = ethernet_headers,
> +       .jump = parse_ethernet,
> +};
> +
> +struct net_flow_jump_table parse_vlan[2] = {
> +       {
> +               .field = {
> +                  .header = HEADER_VLAN,
> +                  .field = HEADER_VLAN_ETHERTYPE,
> +                  .type = NET_FLOW_FIELD_REF_ATTR_TYPE_U16,
> +                  .value_u16 = 0x0800,
> +               },
> +               .node = HEADER_INSTANCE_IPV4,
> +       },
> +       {
> +               .field = {0},
> +               .node = 0,
> +       },
> +};
> +
> +int vlan_headers[2] = {HEADER_VLAN, 0};
> +struct net_flow_hdr_node vlan_header_node = {
> +       .name = "vlan",
> +       .uid = HEADER_INSTANCE_VLAN_OUTER,
> +       .hdrs = vlan_headers,
> +       .jump = parse_vlan,
> +};
> +
> +struct net_flow_jump_table terminal_headers[2] = {
> +       {
> +               .field = {0},
> +               .node = NET_FLOW_JUMP_TABLE_DONE,
> +       },
> +       {
> +               .field = {0},
> +               .node = 0,
> +       },
> +};
> +
> +int ipv4_headers[2] = {HEADER_IPV4, 0};
> +struct net_flow_hdr_node ipv4_header_node = {
> +       .name = "ipv4",
> +       .uid = HEADER_INSTANCE_IPV4,
> +       .hdrs = ipv4_headers,
> +       .jump = terminal_headers,
> +};
> +
> +int metadata_headers[2] = {HEADER_METADATA, 0};
> +struct net_flow_hdr_node in_lport_header_node = {
> +       .name = "in_lport",
> +       .uid = HEADER_INSTANCE_IN_LPORT,
> +       .hdrs = metadata_headers,
> +       .jump = terminal_headers,
> +};
> +
> +struct net_flow_hdr_node goto_table_header_node = {
> +       .name = "goto_table",
> +       .uid = HEADER_INSTANCE_GOTO_TABLE,
> +       .hdrs = metadata_headers,
> +       .jump = terminal_headers,
> +};
> +
> +struct net_flow_hdr_node group_id_header_node = {
> +       .name = "group_id",
> +       .uid = HEADER_INSTANCE_GROUP_ID,
> +       .hdrs = metadata_headers,
> +       .jump = terminal_headers,
> +};
> +
> +struct net_flow_hdr_node null_header = {.name = "", .uid = 0,};
> +
> +struct net_flow_hdr_node *rocker_header_nodes[7] = {
> +       &ethernet_header_node,
> +       &vlan_header_node,
> +       &ipv4_header_node,
> +       &in_lport_header_node,
> +       &goto_table_header_node,
> +       &group_id_header_node,
> +       &null_header,
> +};
> +
> +/* table definition */
> +struct net_flow_field_ref matches_ig_port[2] = {
> +       { .instance = HEADER_INSTANCE_IN_LPORT,
> +         .header = HEADER_METADATA,
> +         .field = HEADER_METADATA_IN_LPORT,
> +         .mask_type = NET_FLOW_MASK_TYPE_LPM},

Need other mask type, not LPM.


> +struct net_flow_table *rocker_table_list[7] = {
> +       &ingress_port_table,
> +       &vlan_table,
> +       &term_mac_table,
> +       &ucast_routing_table,
> +       &bridge_table,
> +       &acl_table,
> +       &null_table,
> +};

cool stuff

> +
> +/* Define the table graph layout */
> +struct net_flow_jump_table table_node_ig_port_next[2] = {
> +       { .field = {0}, .node = ROCKER_FLOW_TABLE_ID_VLAN},
> +       { .field = {0}, .node = 0},
> +};
> +
> +struct net_flow_tbl_node table_node_ingress_port = {
> +       .uid = ROCKER_FLOW_TABLE_ID_INGRESS_PORT,
> +       .jump = table_node_ig_port_next};
> +
> +struct net_flow_jump_table table_node_vlan_next[2] = {
> +       { .field = {0}, .node = ROCKER_FLOW_TABLE_ID_TERMINATION_MAC},
> +       { .field = {0}, .node = 0},
> +};
> +
> +struct net_flow_tbl_node table_node_vlan = {
> +       .uid = ROCKER_FLOW_TABLE_ID_VLAN,
> +       .jump = table_node_vlan_next};
> +
> +struct net_flow_jump_table table_node_term_mac_next[2] = {
> +       { .field = {0}, .node = ROCKER_FLOW_TABLE_ID_UNICAST_ROUTING},
> +       { .field = {0}, .node = 0},
> +};
> +
> +struct net_flow_tbl_node table_node_term_mac = {
> +       .uid = ROCKER_FLOW_TABLE_ID_TERMINATION_MAC,
> +       .jump = table_node_term_mac_next};
> +
> +struct net_flow_jump_table table_node_bridge_next[2] = {
> +       { .field = {0}, .node = ROCKER_FLOW_TABLE_ID_ACL_POLICY},
> +       { .field = {0}, .node = 0},
> +};
> +
> +struct net_flow_tbl_node table_node_bridge = {
> +       .uid = ROCKER_FLOW_TABLE_ID_BRIDGING,
> +       .jump = table_node_bridge_next};
> +
> +struct net_flow_jump_table table_node_ucast_routing_next[2] = {
> +       { .field = {0}, .node = ROCKER_FLOW_TABLE_ID_ACL_POLICY},
> +       { .field = {0}, .node = 0},
> +};
> +
> +struct net_flow_tbl_node table_node_ucast_routing = {
> +       .uid = ROCKER_FLOW_TABLE_ID_UNICAST_ROUTING,
> +       .jump = table_node_ucast_routing_next};
> +
> +struct net_flow_jump_table table_node_acl_next[1] = {
> +       { .field = {0}, .node = 0},
> +};
> +
> +struct net_flow_tbl_node table_node_acl = {
> +       .uid = ROCKER_FLOW_TABLE_ID_ACL_POLICY,
> +       .jump = table_node_acl_next};
> +
> +struct net_flow_tbl_node table_node_nil = {.uid = 0, .jump = NULL};
> +
> +struct net_flow_tbl_node *rocker_table_nodes[7] = {
> +       &table_node_ingress_port,
> +       &table_node_vlan,
> +       &table_node_term_mac,
> +       &table_node_ucast_routing,
> +       &table_node_bridge,
> +       &table_node_acl,
> +       &table_node_nil,
> +};

Cool...getting tired but will review this again in v2

> +#endif /*_MY_PIPELINE_H*/

ROCKER

>

^ permalink raw reply

* Re: [PATCH net-next v3 0/5]: ixgbevf: Allow querying VFs RSS indirection table and key
From: Gleb Natapov @ 2015-01-06  6:55 UTC (permalink / raw)
  To: Greg Rose; +Cc: Vlad Zolotarov, netdev, avi, jeffrey.t.kirsher
In-Reply-To: <CALgkqUojkfTwhsoAnPZdJv-oMg7PMB5m8Q2=k5=QqXKxSJgq7w@mail.gmail.com>

On Mon, Jan 05, 2015 at 03:54:52PM -0800, Greg Rose wrote:
> On Mon, Jan 5, 2015 at 6:15 AM, Vlad Zolotarov
> <vladz@cloudius-systems.com> wrote:
> > Add the ethtool ops to VF driver to allow querying the RSS indirection table
> > and RSS Random Key.
> >
> >  - PF driver: Add new VF-PF channel commands.
> >  - VF driver: Utilize these new commands and add the corresponding
> >               ethtool callbacks.
> >
> > New in v3:
> >    - Added a missing support for x550 devices.
> >    - Mask the indirection table values according to PSRTYPE[n].RQPL.
> >    - Minimized the number of added VF-PF commands.
> >
> > New in v2:
> >    - Added a detailed description to patches 4 and 5.
> >
> > New in v1 (compared to RFC):
> >    - Use "if-else" statement instead of a "switch-case" for a single option case.
> >      More specifically: in cases where the newly added API version is the only one
> >      allowed. We may consider using a "switch-case" back again when the list of
> >      allowed API versions in these specific places grows up.
> >
> > Vlad Zolotarov (5):
> >   ixgbe: Add a RETA query command to VF-PF channel API
> >   ixgbevf: Add a RETA query code
> >   ixgbe: Add GET_RSS_KEY command to VF-PF channel commands set
> >   ixgbevf: Add RSS Key query code
> >   ixgbevf: Add the appropriate ethtool ops to query RSS indirection
> >     table and key
> >
> >  drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h      |  10 ++
> >  drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c    |  91 +++++++++++++++
> >  drivers/net/ethernet/intel/ixgbevf/ethtool.c      |  43 +++++++
> >  drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c |   4 +-
> >  drivers/net/ethernet/intel/ixgbevf/mbx.h          |  10 ++
> >  drivers/net/ethernet/intel/ixgbevf/vf.c           | 132 ++++++++++++++++++++++
> >  drivers/net/ethernet/intel/ixgbevf/vf.h           |   2 +
> >  7 files changed, 291 insertions(+), 1 deletion(-)
> 
> I've given this code a review and I don't see a way to
> set a policy in the PF driver as to whether this request should be
> allowed or not.  We cannot enable this query by default - it is a
> security risk. To make this acceptable you need to do a
> couple of things.
> 
Can you please elaborate on the security risk this information poses?
Is toeplitz hash function cryptographically strong enough so that VF
cannot reconstruct the hash key from hash result provided in packet
descriptor? The abstract of this paper [1] claims it is not, but I do
not have access to the full article unfortunately hence the question.

[1] http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=5503170&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D5503170

--
			Gleb.

^ permalink raw reply

* Re: [net-next PATCH v1 01/11] net: flow_table: create interface for hw match/action tables
From: Scott Feldman @ 2015-01-06  6:40 UTC (permalink / raw)
  To: John Fastabend
  Cc: Thomas Graf, Jiří Pírko, Jamal Hadi Salim,
	simon.horman@netronome.com, Netdev, David S. Miller,
	Andy Gospodarek
In-Reply-To: <54AB7AD3.7050704@gmail.com>

On Mon, Jan 5, 2015 at 10:04 PM, John Fastabend
<john.fastabend@gmail.com> wrote:

>>> + * @uid unique identifier for table
>>> + * @source uid of parent table
>>
>>
>> Is parent table the table previous in the pipeline?  If so, what if
>> you can get to table from N different parent tables, what goes in
>> source?
>
>
> No, you can get the layout of tables from the table graph ops.
>
> Source is used when a single tcam or other implementation mechanism
> is sliced into a set of tables. The current rocker world doesn't use
> this very much at the moment because its static and I just assumed
> every table came out of the same virtual hardware namespace.
>
> A simple example world would be to come up with a set of large virtual
> TCAMs. Any given TCAM maybe sliced into a set of tables. Users may
> organize these either via some out of band configuration at init or
> power on time. In the rocker case we could specify this when we load
> qemu. For now it is just informational. But if we start allowing users
> to create delete tables at runtime it is important to "know" where the
> slices are being allocated/free'd from. The source gives you this
> information.
>
> The hardware devices I'm working on have multiple sources we can
> allocate/free tables from. The source values would provide a way to
> track down which tables are in which hardware namespaces.
>
> Hope that helps?

Got it, thanks.

Can source be encoded in tbl_id?


>> hdr or header?  pick one, probably hdr.
>
>
> hdr is shorter and doesn't lose any clarity IMO I'll use net_flow_hdr
> and net_flow_hdr_node
>
>>
>>> +       struct net_flow_tbl_node **(*ndo_flow_get_tbl_graph)(struct
>>> net_device *dev);
>>
>>
>> move this up next to get_tables
>
>
> sure also what do you think tbl instead of table.

+1 for tbl.

^ permalink raw reply

* RE: [PATCH v4] can: Convert to runtime_pm
From: Appana Durga Kedareswara Rao @ 2015-01-06  6:23 UTC (permalink / raw)
  To: Soren Brinkmann, mkl@pengutronix.de
  Cc: wg@grandegger.com, mkl@pengutronix.de, Michal Simek,
	grant.likely@linaro.org, linux-can@vger.kernel.org,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <20141223224308.GC4611@xsjandreislx>

Hi Marc,

        Could you please  provide your feedback on this patch ?

@Soren : Will update the driver with your comments during next version of the patch.

Regards,
Kedar.

-----Original Message-----
From: Sören Brinkmann [mailto:soren.brinkmann@xilinx.com]
Sent: Wednesday, December 24, 2014 4:13 AM
To: Appana Durga Kedareswara Rao
Cc: wg@grandegger.com; mkl@pengutronix.de; Michal Simek; grant.likely@linaro.org; linux-can@vger.kernel.org; netdev@vger.kernel.org; linux-kernel@vger.kernel.org; Appana Durga Kedareswara Rao
Subject: Re: [PATCH v4] can: Convert to runtime_pm

On Tue, 2014-12-23 at 05:55PM +0530, Kedareswara rao Appana wrote:
> Instead of enabling/disabling clocks at several locations in the
> driver, use the runtime_pm framework. This consolidates the actions
> for runtime PM in the appropriate callbacks and makes the driver more
> readable and mantainable.
>
> Signed-off-by: Soren Brinkmann <soren.brinkmann@xilinx.com>
> Signed-off-by: Kedareswara rao Appana <appanad@xilinx.com>
> ---
> Chnages for v4:
>  - Updated with the review comments.
> Changes for v3:
>   - Converted the driver to use runtime_pm.
> Changes for v2:
>   - Removed the struct platform_device* from suspend/resume
>     as suggest by Lothar.
>
>  drivers/net/can/xilinx_can.c |  123
> +++++++++++++++++++++++++-----------------
>  1 files changed, 74 insertions(+), 49 deletions(-)
>
> diff --git a/drivers/net/can/xilinx_can.c
> b/drivers/net/can/xilinx_can.c index 6c67643..c71f683 100644
> --- a/drivers/net/can/xilinx_can.c
> +++ b/drivers/net/can/xilinx_can.c
> @@ -32,6 +32,7 @@
>  #include <linux/can/dev.h>
>  #include <linux/can/error.h>
>  #include <linux/can/led.h>
> +#include <linux/pm_runtime.h>
>
>  #define DRIVER_NAME  "xilinx_can"
>
> @@ -138,7 +139,7 @@ struct xcan_priv {
>       u32 (*read_reg)(const struct xcan_priv *priv, enum xcan_reg reg);
>       void (*write_reg)(const struct xcan_priv *priv, enum xcan_reg reg,
>                       u32 val);
> -     struct net_device *dev;
> +     struct device *dev;
>       void __iomem *reg_base;
>       unsigned long irq_flags;
>       struct clk *bus_clk;
> @@ -842,6 +843,13 @@ static int xcan_open(struct net_device *ndev)
>       struct xcan_priv *priv = netdev_priv(ndev);
>       int ret;
>
> +     ret = pm_runtime_get_sync(priv->dev);
> +     if (ret < 0) {
> +             netdev_err(ndev, "%s: pm_runtime_get failed\r(%d)\n\r",

Does this create the intended output? I haven't seen '\r' anywhere else.
Shouldn't this simply be:
        netdev_err(ndev, "%s: pm_runtime_get failed (%d)\n",

[...]
> @@ -934,27 +927,20 @@ static int xcan_get_berr_counter(const struct net_device *ndev,
>       struct xcan_priv *priv = netdev_priv(ndev);
>       int ret;
>
> -     ret = clk_prepare_enable(priv->can_clk);
> -     if (ret)
> -             goto err;
> -
> -     ret = clk_prepare_enable(priv->bus_clk);
> -     if (ret)
> -             goto err_clk;
> +     ret = pm_runtime_get_sync(priv->dev);
> +     if (ret < 0) {
> +             netdev_err(ndev, "%s: pm_runtime_get failed\r(%d)\n\r",

ditto

        Sören


This email and any attachments are intended for the sole use of the named recipient(s) and contain(s) confidential information that may be proprietary, privileged or copyrighted under applicable law. If you are not the intended recipient, do not read, copy, or forward this email message or any attachments. Delete this email message and any attachments immediately.


^ permalink raw reply

* Re: [net-next PATCH v1 02/11] net: flow_table: add flow, delete flow
From: Scott Feldman @ 2015-01-06  6:19 UTC (permalink / raw)
  To: John Fastabend
  Cc: Thomas Graf, Jiří Pírko, Jamal Hadi Salim,
	simon.horman@netronome.com, Netdev, David S. Miller,
	Andy Gospodarek
In-Reply-To: <20141231194615.31070.18038.stgit@nitbit.x32>

On Wed, Dec 31, 2014 at 11:46 AM, John Fastabend
<john.fastabend@gmail.com> wrote:
> Now that the device capabilities are exposed we can add support to
> add and delete flows from the tables.
>
> The two operations are
>
> table_set_flows :
>
>   The set flow operations is used to program a set of flows into a
>   hardware device table. The message is consumed via netlink encoded
>   message which is then decoded into a null terminated  array of
>   flow entry structures. A flow entry structure is defined as
>
>      struct net_flow_flow {
>                           int table_id;
>                           int uid;
>                           int priority;
>                           struct net_flow_field_ref *matches;
>                           struct net_flow_action *actions;
>      }
>
>   The table id is the _uid_ returned from 'get_tables' operatoins.
>   Matches is a set of match criteria for packets with a logical AND
>   operation done on the set so packets match the entire criteria.
>   Actions provide a set of actions to perform when the flow rule is
>   hit. Both matches and actions are null terminated arrays.
>
>   The flows are configured in hardware using an ndo op. We do not
>   provide a commit operation at the moment and expect hardware
>   commits the flows one at a time. Future work may require a commit
>   operation to tell the hardware we are done loading flow rules. On
>   some hardware this will help bulk updates.
>
>   Its possible for hardware to return an error from a flow set
>   operation. This can occur for many reasons both transient and
>   resource constraints. We have different error handling strategies
>   built in and listed here,
>
>     *_ERROR_ABORT      abort on first error with errmsg
>
>     *_ERROR_CONTINUE   continue programming flows no errmsg
>
>     *_ERROR_ABORT_LOG  abort on first error and return flow that
>                        failed to user space in reply msg
>
>     *_ERROR_CONT_LOG   continue programming flows and return a list
>                        of flows that failed to user space in a reply
>                        msg.
>
>   notably missing is a rollback error strategy. I don't have a
>   use for this in software yet but the strategy can be added with
>   *_ERROR_ROLLBACK for example.
>
> table_del_flows
>
>   The delete flow operation uses the same structures and error
>   handling strategies as the table_set_flows operations. Although on
>   delete messges ommit the matches/actions arrays because they are
>   not needed to lookup the flow.
>
> Also thanks to Simon Horman for fixes and other help.
>
> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> ---
>  include/linux/if_flow.h      |   21 ++
>  include/linux/netdevice.h    |    8 +
>  include/uapi/linux/if_flow.h |   49 ++++
>  net/core/flow_table.c        |  501 ++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 579 insertions(+)
>
> diff --git a/include/linux/if_flow.h b/include/linux/if_flow.h
> index 1b6c1ea..20fa752 100644
> --- a/include/linux/if_flow.h
> +++ b/include/linux/if_flow.h
> @@ -90,4 +90,25 @@ struct net_flow_tbl_node {
>         __u32 flags;
>         struct net_flow_jump_table *jump;
>  };
> +
> +/**
> + * @struct net_flow_flow
> + * @brief describes the match/action entry
> + *
> + * @uid unique identifier for flow
> + * @priority priority to execute flow match/action in table

What is the convention on priority?  0 is lowest priority or highest?

> + * @match null terminated set of match uids match criteria
> + * @actoin null terminated set of action uids to apply to match
> + *
> + * Flows must match all entries in match set.
> + */
> +struct net_flow_flow {
> +       int table_id;
> +       int uid;
> +       int priority;
> +       struct net_flow_field_ref *matches;
> +       struct net_flow_action *actions;
> +};
> +
> +int net_flow_put_flow(struct sk_buff *skb, struct net_flow_flow *flow);
>  #endif
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 3c3c856..be8d4e4 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1197,6 +1197,14 @@ struct net_device_ops {
>         struct net_flow_header  **(*ndo_flow_get_headers)(struct net_device *dev);
>         struct net_flow_hdr_node **(*ndo_flow_get_hdr_graph)(struct net_device *dev);
>         struct net_flow_tbl_node **(*ndo_flow_get_tbl_graph)(struct net_device *dev);
> +       int                     (*ndo_flow_get_flows)(struct sk_buff *skb,
> +                                                     struct net_device *dev,
> +                                                     int table,
> +                                                     int min, int max);
> +       int                     (*ndo_flow_set_flows)(struct net_device *dev,
> +                                                     struct net_flow_flow *f);
> +       int                     (*ndo_flow_del_flows)(struct net_device *dev,
> +                                                     struct net_flow_flow *f);

Need doc for these in BIG comment block above this struct.  Same for
ndo_flow_xxx added in previous patch.

>  #endif
>  };
>
> diff --git a/include/uapi/linux/if_flow.h b/include/uapi/linux/if_flow.h
> index 2acdb38..125cdc6 100644
> --- a/include/uapi/linux/if_flow.h
> +++ b/include/uapi/linux/if_flow.h
> @@ -329,6 +329,48 @@ enum {
>  #define NET_FLOW_TABLE_GRAPH_MAX (__NET_FLOW_TABLE_GRAPH_MAX - 1)
>
>  enum {
> +       NET_FLOW_NET_FLOW_UNSPEC,
> +       NET_FLOW_FLOW,
> +       __NET_FLOW_NET_FLOW_MAX,
> +};
> +#define NET_FLOW_NET_FLOW_MAX (__NET_FLOW_NET_FLOW_MAX - 1)
> +
> +enum {
> +       NET_FLOW_TABLE_FLOWS_UNSPEC,
> +       NET_FLOW_TABLE_FLOWS_TABLE,
> +       NET_FLOW_TABLE_FLOWS_MINPRIO,
> +       NET_FLOW_TABLE_FLOWS_MAXPRIO,
> +       NET_FLOW_TABLE_FLOWS_FLOWS,
> +       __NET_FLOW_TABLE_FLOWS_MAX,
> +};
> +#define NET_FLOW_TABLE_FLOWS_MAX (__NET_FLOW_TABLE_FLOWS_MAX - 1)
> +
> +enum {

NET_FLOW_FLOWS_ERROR_UNSPEC?

> +       /* Abort with normal errmsg */
> +       NET_FLOW_FLOWS_ERROR_ABORT,
> +       /* Ignore errors and continue without logging */
> +       NET_FLOW_FLOWS_ERROR_CONTINUE,
> +       /* Abort and reply with invalid flow fields */
> +       NET_FLOW_FLOWS_ERROR_ABORT_LOG,
> +       /* Continue and reply with list of invalid flows */
> +       NET_FLOW_FLOWS_ERROR_CONT_LOG,
> +       __NET_FLOWS_FLOWS_ERROR_MAX,
> +};
> +#define NET_FLOWS_FLOWS_ERROR_MAX (__NET_FLOWS_FLOWS_ERROR_MAX - 1)
> +
> +enum {
> +       NET_FLOW_ATTR_UNSPEC,
> +       NET_FLOW_ATTR_ERROR,
> +       NET_FLOW_ATTR_TABLE,
> +       NET_FLOW_ATTR_UID,
> +       NET_FLOW_ATTR_PRIORITY,
> +       NET_FLOW_ATTR_MATCHES,
> +       NET_FLOW_ATTR_ACTIONS,
> +       __NET_FLOW_ATTR_MAX,
> +};
> +#define NET_FLOW_ATTR_MAX (__NET_FLOW_ATTR_MAX - 1)
> +
> +enum {
>         NET_FLOW_IDENTIFIER_IFINDEX, /* net_device ifindex */
>  };
>
> @@ -343,6 +385,9 @@ enum {
>         NET_FLOW_HEADER_GRAPH,
>         NET_FLOW_TABLE_GRAPH,
>
> +       NET_FLOW_FLOWS,
> +       NET_FLOW_FLOWS_ERROR,
> +
>         __NET_FLOW_MAX,
>         NET_FLOW_MAX = (__NET_FLOW_MAX - 1),
>  };
> @@ -354,6 +399,10 @@ enum {
>         NET_FLOW_TABLE_CMD_GET_HDR_GRAPH,
>         NET_FLOW_TABLE_CMD_GET_TABLE_GRAPH,
>
> +       NET_FLOW_TABLE_CMD_GET_FLOWS,
> +       NET_FLOW_TABLE_CMD_SET_FLOWS,
> +       NET_FLOW_TABLE_CMD_DEL_FLOWS,
> +
>         __NET_FLOW_CMD_MAX,
>         NET_FLOW_CMD_MAX = (__NET_FLOW_CMD_MAX - 1),
>  };
> diff --git a/net/core/flow_table.c b/net/core/flow_table.c
> index ec3f06d..f4cf293 100644
> --- a/net/core/flow_table.c
> +++ b/net/core/flow_table.c
> @@ -774,6 +774,489 @@ static int net_flow_cmd_get_table_graph(struct sk_buff *skb,
>         return genlmsg_reply(msg, info);
>  }
>
> +static struct sk_buff *net_flow_build_flows_msg(struct net_device *dev,
> +                                               u32 portid, int seq, u8 cmd,
> +                                               int min, int max, int table)
> +{
> +       struct genlmsghdr *hdr;
> +       struct nlattr *flows;
> +       struct sk_buff *skb;
> +       int err = -ENOBUFS;
> +
> +       skb = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);

Does netlink msg size limit the number of flows we can return?  If I
have 10K unique prefix routes, resulting in 10K flows in my L3 table,
all at same priority, can it be dumped?

I'm wondering (out loud) if get_flows is something that should be
supported at the driver level, or should it be managed above the
driver.  In other words, whoever calls set_flows and del_flows knows
what's what and could return get_flows rather than calling down to
driver/device to get_flows.  Unless driver/device could create flows
independent of set_flows and del_flows.

> +       if (!skb)
> +               return ERR_PTR(-ENOBUFS);
> +
> +       hdr = genlmsg_put(skb, portid, seq, &net_flow_nl_family, 0, cmd);
> +       if (!hdr)
> +               goto out;
> +
> +       if (nla_put_u32(skb,
> +                       NET_FLOW_IDENTIFIER_TYPE,
> +                       NET_FLOW_IDENTIFIER_IFINDEX) ||
> +           nla_put_u32(skb, NET_FLOW_IDENTIFIER, dev->ifindex)) {
> +               err = -ENOBUFS;
> +               goto out;
> +       }
> +
> +       flows = nla_nest_start(skb, NET_FLOW_FLOWS);
> +       if (!flows) {
> +               err = -EMSGSIZE;
> +               goto out;
> +       }
> +
> +       err = dev->netdev_ops->ndo_flow_get_flows(skb, dev, table, min, max);
> +       if (err < 0)
> +               goto out_cancel;
> +
> +       nla_nest_end(skb, flows);
> +
> +       err = genlmsg_end(skb, hdr);
> +       if (err < 0)
> +               goto out;
> +
> +       return skb;
> +out_cancel:
> +       nla_nest_cancel(skb, flows);
> +out:
> +       nlmsg_free(skb);
> +       return ERR_PTR(err);
> +}
> +
> +static const
> +struct nla_policy net_flow_table_flows_policy[NET_FLOW_TABLE_FLOWS_MAX + 1] = {
> +       [NET_FLOW_TABLE_FLOWS_TABLE]   = { .type = NLA_U32,},
> +       [NET_FLOW_TABLE_FLOWS_MINPRIO] = { .type = NLA_U32,},
> +       [NET_FLOW_TABLE_FLOWS_MAXPRIO] = { .type = NLA_U32,},
> +       [NET_FLOW_TABLE_FLOWS_FLOWS]   = { .type = NLA_NESTED,},
> +};
> +
> +static int net_flow_table_cmd_get_flows(struct sk_buff *skb,
> +                                       struct genl_info *info)
> +{
> +       struct nlattr *tb[NET_FLOW_TABLE_FLOWS_MAX+1];
> +       int table, min = -1, max = -1;
> +       struct net_device *dev;
> +       struct sk_buff *msg;
> +       int err = -EINVAL;
> +
> +       dev = net_flow_get_dev(info);
> +       if (!dev)
> +               return -EINVAL;
> +
> +       if (!dev->netdev_ops->ndo_flow_get_flows) {
> +               dev_put(dev);
> +               return -EOPNOTSUPP;
> +       }
> +
> +       if (!info->attrs[NET_FLOW_IDENTIFIER_TYPE] ||
> +           !info->attrs[NET_FLOW_IDENTIFIER] ||
> +           !info->attrs[NET_FLOW_FLOWS])
> +               goto out;
> +
> +       err = nla_parse_nested(tb, NET_FLOW_TABLE_FLOWS_MAX,
> +                              info->attrs[NET_FLOW_FLOWS],
> +                              net_flow_table_flows_policy);
> +       if (err)
> +               goto out;
> +
> +       if (!tb[NET_FLOW_TABLE_FLOWS_TABLE])
> +               goto out;
> +
> +       table = nla_get_u32(tb[NET_FLOW_TABLE_FLOWS_TABLE]);
> +
> +       if (tb[NET_FLOW_TABLE_FLOWS_MINPRIO])
> +               min = nla_get_u32(tb[NET_FLOW_TABLE_FLOWS_MINPRIO]);
> +       if (tb[NET_FLOW_TABLE_FLOWS_MAXPRIO])
> +               max = nla_get_u32(tb[NET_FLOW_TABLE_FLOWS_MAXPRIO]);

Just curious, what is the intended use of min/max prio?  Was it to
reduce number of flows returned in one get_flows call?


> +       msg = net_flow_build_flows_msg(dev,
> +                                      info->snd_portid,
> +                                      info->snd_seq,
> +                                      NET_FLOW_TABLE_CMD_GET_FLOWS,
> +                                      min, max, table);
> +       dev_put(dev);
> +
> +       if (IS_ERR(msg))
> +               return PTR_ERR(msg);
> +
> +       return genlmsg_reply(msg, info);
> +out:
> +       dev_put(dev);
> +       return err;
> +}
> +
> +static struct sk_buff *net_flow_start_errmsg(struct net_device *dev,
> +                                            struct genlmsghdr **hdr,
> +                                            u32 portid, int seq, u8 cmd)
> +{
> +       struct genlmsghdr *h;
> +       struct sk_buff *skb;
> +
> +       skb = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
> +       if (!skb)
> +               return ERR_PTR(-EMSGSIZE);
> +
> +       h = genlmsg_put(skb, portid, seq, &net_flow_nl_family, 0, cmd);
> +       if (!h)
> +               return ERR_PTR(-EMSGSIZE);
> +
> +       if (nla_put_u32(skb,
> +                       NET_FLOW_IDENTIFIER_TYPE,
> +                       NET_FLOW_IDENTIFIER_IFINDEX) ||
> +           nla_put_u32(skb, NET_FLOW_IDENTIFIER, dev->ifindex))
> +               return ERR_PTR(-EMSGSIZE);
> +
> +       *hdr = h;
> +       return skb;
> +}
> +
> +static struct sk_buff *net_flow_end_flow_errmsg(struct sk_buff *skb,
> +                                               struct genlmsghdr *hdr)
> +{
> +       int err;
> +
> +       err = genlmsg_end(skb, hdr);
> +       if (err < 0) {
> +               nlmsg_free(skb);
> +               return ERR_PTR(err);
> +       }
> +
> +       return skb;
> +}
> +
> +static int net_flow_put_flow_action(struct sk_buff *skb,
> +                                   struct net_flow_action *a)
> +{
> +       struct nlattr *action, *sigs;
> +       int i, err = 0;
> +
> +       action = nla_nest_start(skb, NET_FLOW_ACTION);
> +       if (!action)
> +               return -EMSGSIZE;
> +
> +       if (nla_put_u32(skb, NET_FLOW_ACTION_ATTR_UID, a->uid))
> +               return -EMSGSIZE;
> +
> +       if (!a->args)
> +               goto done;
> +
> +       for (i = 0; a->args[i].type; i++) {
> +               sigs = nla_nest_start(skb, NET_FLOW_ACTION_ATTR_SIGNATURE);
> +               if (!sigs) {
> +                       nla_nest_cancel(skb, action);
> +                       return -EMSGSIZE;
> +               }
> +
> +               err = net_flow_put_act_types(skb, a[i].args);
> +               if (err) {
> +                       nla_nest_cancel(skb, action);
> +                       nla_nest_cancel(skb, sigs);

order seems backwards.  i think you can just cancel outer.

> +                       return err;
> +               }
> +               nla_nest_end(skb, sigs);
> +       }
> +
> +done:
> +       nla_nest_end(skb, action);
> +       return 0;
> +}
> +
> +int net_flow_put_flow(struct sk_buff *skb, struct net_flow_flow *flow)
> +{
> +       struct nlattr *flows, *matches;
> +       struct nlattr *actions = NULL; /* must be null to unwind */
> +       int err, j, i = 0;
> +
> +       flows = nla_nest_start(skb, NET_FLOW_FLOW);
> +       if (!flows)
> +               goto put_failure;
> +
> +       if (nla_put_u32(skb, NET_FLOW_ATTR_TABLE, flow->table_id) ||
> +           nla_put_u32(skb, NET_FLOW_ATTR_UID, flow->uid) ||
> +           nla_put_u32(skb, NET_FLOW_ATTR_PRIORITY, flow->priority))
> +               goto flows_put_failure;
> +
> +       if (flow->matches) {
> +               matches = nla_nest_start(skb, NET_FLOW_ATTR_MATCHES);
> +               if (!matches)
> +                       goto flows_put_failure;
> +
> +               for (j = 0; flow->matches && flow->matches[j].header; j++) {

for(match = flow->matches; match->header; match++)

> +                       struct net_flow_field_ref *f = &flow->matches[j];
> +
> +                       if (!f->header)
> +                               continue;

Already checked in for loop?

> +
> +                       nla_put(skb, NET_FLOW_FIELD_REF, sizeof(*f), f);

no err check

> +               }
> +               nla_nest_end(skb, matches);
> +       }
> +
> +       if (flow->actions) {
> +               actions = nla_nest_start(skb, NET_FLOW_ATTR_ACTIONS);
> +               if (!actions)
> +                       goto flows_put_failure;
> +
> +               for (i = 0; flow->actions && flow->actions[i].uid; i++) {
> +                       err = net_flow_put_flow_action(skb, &flow->actions[i]);
> +                       if (err) {
> +                               nla_nest_cancel(skb, actions);

just cancel outer (I think)

> +                               goto flows_put_failure;
> +                       }
> +               }
> +               nla_nest_end(skb, actions);
> +       }
> +
> +       nla_nest_end(skb, flows);
> +       return 0;
> +
> +flows_put_failure:
> +       nla_nest_cancel(skb, flows);
> +put_failure:
> +       return -EMSGSIZE;
> +}
> +EXPORT_SYMBOL(net_flow_put_flow);
> +
> +static int net_flow_get_field(struct net_flow_field_ref *field,
> +                             struct nlattr *nla)
> +{
> +       if (nla_type(nla) != NET_FLOW_FIELD_REF)
> +               return -EINVAL;
> +
> +       if (nla_len(nla) < sizeof(*field))
> +               return -EINVAL;
> +
> +       *field = *(struct net_flow_field_ref *)nla_data(nla);
> +       return 0;
> +}

maybe return struct net_flow_field_ref * to simplify return logic.

> +
> +static int net_flow_get_action(struct net_flow_action *a, struct nlattr *attr)
> +{
> +       struct nlattr *act[NET_FLOW_ACTION_ATTR_MAX+1];
> +       struct nlattr *args;
> +       int rem;
> +       int err, count = 0;
> +
> +       if (nla_type(attr) != NET_FLOW_ACTION) {
> +               pr_warn("%s: expected NET_FLOW_ACTION\n", __func__);
> +               return 0;
> +       }
> +
> +       err = nla_parse_nested(act, NET_FLOW_ACTION_ATTR_MAX,
> +                              attr, net_flow_action_policy);
> +       if (err < 0)
> +               return err;
> +
> +       if (!act[NET_FLOW_ACTION_ATTR_UID] ||
> +           !act[NET_FLOW_ACTION_ATTR_SIGNATURE])
> +               return -EINVAL;
> +
> +       a->uid = nla_get_u32(act[NET_FLOW_ACTION_ATTR_UID]);
> +
> +       nla_for_each_nested(args, act[NET_FLOW_ACTION_ATTR_SIGNATURE], rem)
> +               count++; /* unoptimized max possible */
> +
> +       a->args = kcalloc(count + 1,
> +                         sizeof(struct net_flow_action_arg),
> +                         GFP_KERNEL);

kcalloc failure?

> +       count = 0;
> +
> +       nla_for_each_nested(args, act[NET_FLOW_ACTION_ATTR_SIGNATURE], rem) {
> +               if (nla_type(args) != NET_FLOW_ACTION_ARG)
> +                       continue;
> +
> +               if (nla_len(args) < sizeof(struct net_flow_action_arg)) {
> +                       kfree(a->args);
> +                       return -EINVAL;
> +               }
> +
> +               a->args[count] = *(struct net_flow_action_arg *)nla_data(args);

count++?

^ permalink raw reply

* Re: [PATCH 1/1] bridge: remove BR_GROUPFWD_RESTRICTED for arbitrary forwarding of reserved addresses
From: Stephen Hemminger @ 2015-01-06  6:10 UTC (permalink / raw)
  To: Bernhard Thaler; +Cc: netdev, bridge, davem
In-Reply-To: <1420505776-26827-1-git-send-email-bernhard.thaler@wvnet.at>

On Tue,  6 Jan 2015 01:56:15 +0100
Bernhard Thaler <bernhard.thaler@wvnet.at> wrote:

> BR_GROUPFWD_RESTRICTED bitmask restricts users from setting values to
> /sys/class/net/brX/bridge/group_fwd_mask that allow forwarding of
> some IEEE 802.1D Table 7-10 Reserved addresses:
> 	(MAC Control) 802.3		01-80-C2-00-00-01
> 	(Link Aggregation) 802.3	01-80-C2-00-00-02
> 	802.1AB LLDP			01-80-C2-00-00-0E
> BR_GROUPFWD_RESTRICTED may have been set as an extra protection against
> forwarding these control frames as forwarding 802.1X PAE (01-80-C2-00-00-03)
> in 802.1X setups satisfies most common use-cases.
> Other situations, such as placing a software based bridge as a "TAP" between two
> devices may require to forward e.g. LLDP frames while debugging network problems
> or actively changing/filtering traffic with ebtables.
> 
> This patch allows to set e.g.:
> 	echo 65535 > /sys/class/net/brX/bridge/group_fwd_mask
> which sets no restrictions on the forwardable reserved addresses.
> 
> - the default value 0 will still comply with 802.1D and not forward any
>   reserved addresses
> - values such as 8 for forwarding 802.1X related frames will behave the
>   same way as with BR_GROUPFWD_RESTRICTED currently in place, so backward
>   compatibility to current scripts using group_fwd_masks shoudl be possible
> 
> Administrators and network engineers however will be able to arbitrarily
> forward any reserved addresses without BR_GROUPFWD_RESTRICTED. This will
> be non-standard compliant behavior, but forwarding of any reserved address
> right from the beginning is. Users should be aware of this anyway and
> know what/why they are doing when setting values such as 65535, 32768, 16384,
> 4, 2 for group_fwd_mask
> 
> This patch was tested on a bridge with two interfaces created with bridge-utils.
> 
> Signed-off-by: Bernhard Thaler <bernhard.thaler@wvnet.at>

I am ok with forwarding LLDP because some people need it.
But allowing forwarding STP or PAUSE frames is bad.

We don't let people do things that break networks. Other examples
already exist like set all 0 ethernet addresses, or the restrictions
on allowing net 127 in IP addresses.

^ permalink raw reply

* Re: [net-next PATCH v1 01/11] net: flow_table: create interface for hw match/action tables
From: John Fastabend @ 2015-01-06  6:04 UTC (permalink / raw)
  To: Scott Feldman
  Cc: Thomas Graf, Jiří Pírko, Jamal Hadi Salim,
	simon.horman@netronome.com, Netdev, David S. Miller,
	Andy Gospodarek
In-Reply-To: <CAE4R7bBeCbq1yiRHQaQSX4JwKMWxn2hMoOptyOZe3Ak=9WzrVw@mail.gmail.com>

[...]

>> +/**
>> + * @struct net_flow_action
>> + * @brief a description of a endpoint defined action
>> + *
>> + * @name printable name
>> + * @uid unique action identifier
>> + * @types NET_FLOW_ACTION_TYPE_NULL terminated list of action types
>
> s/types/args?
>

yep typo fixed in upcoming v2.

>> + */
>> +struct net_flow_action {
>> +       char name[NET_FLOW_NAMSIZ];
>> +       int uid;
>> +       struct net_flow_action_arg *args;
>> +};
>> +
>> +/**
>> + * @struct net_flow_table
>> + * @brief define flow table with supported match/actions
>> + *
>> + * @uid unique identifier for table
>> + * @source uid of parent table
>
> Is parent table the table previous in the pipeline?  If so, what if
> you can get to table from N different parent tables, what goes in
> source?

No, you can get the layout of tables from the table graph ops.

Source is used when a single tcam or other implementation mechanism
is sliced into a set of tables. The current rocker world doesn't use
this very much at the moment because its static and I just assumed
every table came out of the same virtual hardware namespace.

A simple example world would be to come up with a set of large virtual
TCAMs. Any given TCAM maybe sliced into a set of tables. Users may
organize these either via some out of band configuration at init or
power on time. In the rocker case we could specify this when we load
qemu. For now it is just informational. But if we start allowing users
to create delete tables at runtime it is important to "know" where the
slices are being allocated/free'd from. The source gives you this
information.

The hardware devices I'm working on have multiple sources we can
allocate/free tables from. The source values would provide a way to
track down which tables are in which hardware namespaces.

Hope that helps?

>
>> + * @size max number of entries for table or -1 for unbounded
>> + * @matches null terminated set of supported match types given by match uid
>> + * @actions null terminated set of supported action types given by action uid
>> + * @flows set of flows
>> + */
>> +struct net_flow_table {
>> +       char name[NET_FLOW_NAMSIZ];
>> +       int uid;
>> +       int source;
>> +       int size;
>> +       struct net_flow_field_ref *matches;
>> +       int *actions;
>> +};
>> +
>> +/* net_flow_hdr_node: node in a header graph of header fields.
>> + *
>> + * @uid : unique id of the graph node
>> + * @flwo_header_ref : identify the hdrs that can handled by this node
>
> s/flwo_header_ref/hdrs?
>
>> + * @net_flow_jump_table : give a case jump statement
>
> s/net_flow_jump_table/jump

yep thanks.

>
>> + */
>> +struct net_flow_hdr_node {
>> +       char name[NET_FLOW_NAMSIZ];
>> +       int uid;
>> +       int *hdrs;
>> +       struct net_flow_jump_table *jump;
>> +};
>> +
>> +struct net_flow_tbl_node {
>> +       int uid;
>> +       __u32 flags;
>> +       struct net_flow_jump_table *jump;
>> +};
>> +#endif
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index 29c92ee..3c3c856 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -52,6 +52,11 @@
>>   #include <linux/neighbour.h>
>>   #include <uapi/linux/netdevice.h>
>>
>> +#ifdef CONFIG_NET_FLOW_TABLES
>> +#include <linux/if_flow.h>
>> +#include <uapi/linux/if_flow.h>
>
> linux/if_flow.h already included uapi file
>

fixed.

>> +#endif
>> +
>>   struct netpoll_info;
>>   struct device;
>>   struct phy_device;
>> @@ -1186,6 +1191,13 @@ struct net_device_ops {
>>          int                     (*ndo_switch_port_stp_update)(struct net_device *dev,
>>                                                                u8 state);
>>   #endif
>> +#ifdef CONFIG_NET_FLOW_TABLES
>> +       struct net_flow_action  **(*ndo_flow_get_actions)(struct net_device *dev);
>> +       struct net_flow_table   **(*ndo_flow_get_tables)(struct net_device *dev);
>> +       struct net_flow_header  **(*ndo_flow_get_headers)(struct net_device *dev);
>> +       struct net_flow_hdr_node **(*ndo_flow_get_hdr_graph)(struct net_device *dev);
>
> hdr or header?  pick one, probably hdr.

hdr is shorter and doesn't lose any clarity IMO I'll use net_flow_hdr
and net_flow_hdr_node

>
>> +       struct net_flow_tbl_node **(*ndo_flow_get_tbl_graph)(struct net_device *dev);
>
> move this up next to get_tables

sure also what do you think tbl instead of table.

>
>> +#endif
>>   };
>>
>>   /**
>> diff --git a/include/uapi/linux/if_flow.h b/include/uapi/linux/if_flow.h
>> new file mode 100644
>> index 0000000..2acdb38
>> --- /dev/null
>> +++ b/include/uapi/linux/if_flow.h
>> @@ -0,0 +1,363 @@
>> +/*
>> + * include/uapi/linux/if_flow.h - Flow table interface for Switch devices
>> + * Copyright (c) 2014 John Fastabend <john.r.fastabend@intel.com>
>> + *
>> + *
>> + * This program is free software; you can redistribute it and/or modify it
>> + * under the terms and conditions of the GNU General Public License,
>> + * version 2, as published by the Free Software Foundation.
>> + *
>> + * This program is distributed in the hope it will be useful, but WITHOUT
>> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
>> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
>> + * more details.
>> + *
>> + * The full GNU General Public License is included in this distribution in
>> + * the file called "COPYING".
>> + *
>> + * Author: John Fastabend <john.r.fastabend@intel.com>
>> + */
>> +
>> +/* Netlink description:
>> + *
>> + * Table definition used to describe running tables. The following
>> + * describes the netlink message returned from a flow API messages.
>
> message?
>

That sentence is a bit awkward all around. Changed it to

"The following describes the netlink format used by the flow API."

maybe that is better.

>
>> +
>> +enum {
>> +       NET_FLOW_MASK_TYPE_UNSPEC,
>> +       NET_FLOW_MASK_TYPE_EXACT,
>> +       NET_FLOW_MASK_TYPE_LPM,
>
> As discussed in another thread, need third mask type that's not LPM;
> e.g. 0b0101.
>

yep.

>
>> +#define NET_FLOW_TABLE_GRAPH_NODE_MAX (__NET_FLOW_TABLE_GRAPH_NODE_MAX - 1)
>> +
>> +enum {
>> +       NET_FLOW_TABLE_GRAPH_UNSPEC,
>> +       NET_FLOW_TABLE_GRAPH_NODE,
>> +       __NET_FLOW_TABLE_GRAPH_MAX,
>> +};
>> +#define NET_FLOW_TABLE_GRAPH_MAX (__NET_FLOW_TABLE_GRAPH_MAX - 1)
>> +
>> +enum {
>> +       NET_FLOW_IDENTIFIER_IFINDEX, /* net_device ifindex */
>
> Maybe add an NET_FLOW_IDENTIFIER_UNSPEC so NET_FLOW_IDENTIFIER_IFINDEX
> isn't zero.
>

agreed I tend to like being able to test things with if (foo) { ... }

[...]

>> +
>> +static int net_flow_put_action(struct sk_buff *skb, struct net_flow_action *a)
>> +{
>> +       struct net_flow_action_arg *this;
>> +       struct nlattr *nest;
>> +       int err, args = 0;
>> +
>> +       if (a->name && nla_put_string(skb, NET_FLOW_ACTION_ATTR_NAME, a->name))
>> +               return -EMSGSIZE;
>> +
>> +       if (nla_put_u32(skb, NET_FLOW_ACTION_ATTR_UID, a->uid))
>> +               return -EMSGSIZE;
>> +
>> +       if (!a->args)
>> +               return 0;
>> +
>> +       for (this = &a->args[0]; strlen(this->name) > 0; this++)
>> +               args++;
>> +
>
> Since you only need to know that there are > 0 args, but don't need
> the actual count, can you simplify test with something like:
>

good catch, this is a hold over from some code I rewrote I'll clean
this up like,


  static int net_flow_put_action(struct sk_buff *skb, struct 
net_flow_action *a)
  {
          struct net_flow_action_arg *this;
          struct nlattr *nest;
          int err;

          if (a->name && nla_put_string(skb, NET_FLOW_ACTION_ATTR_NAME, 
a->name))
                  return -EMSGSIZE;

          if (nla_put_u32(skb, NET_FLOW_ACTION_ATTR_UID, a->uid))
                  return -EMSGSIZE;

          if (a->args && a->args[0].type) {
                  nest = nla_nest_start(skb, 
NET_FLOW_ACTION_ATTR_SIGNATURE);
                  if (!nest)
                          return -EMSGSIZE;

                  err = net_flow_put_act_types(skb, a->args);
                  if (err) {
                          nla_nest_cancel(skb, nest);
                          return err;
                  }
                  nla_nest_end(skb, nest);
        }

         return 0;
  }

I think that should probably work. Of course I'll compile it and test
it.


>     bool has_args = strlen(a->args->name) > 0;
>
> or
>
>    bool has_args = !!a->args->type;
>
>> +       if (args) {
>> +               nest = nla_nest_start(skb, NET_FLOW_ACTION_ATTR_SIGNATURE);
>> +               if (!nest)
>> +                       goto nest_put_failure;
>
> Maybe just return -EMSGSIZE here and skip goto.
>
>> +
>> +               err = net_flow_put_act_types(skb, a->args);
>> +               if (err) {
>> +                       nla_nest_cancel(skb, nest);
>> +                       return err;
>> +               }
>> +               nla_nest_end(skb, nest);
>> +       }
>> +
>> +       return 0;
>> +nest_put_failure:
>> +       return -EMSGSIZE;
>> +}
>> +
>> +static int net_flow_put_actions(struct sk_buff *skb,
>> +                               struct net_flow_action **acts)
>> +{
>> +       struct nlattr *actions;
>> +       int err, i;
>> +
>> +       actions = nla_nest_start(skb, NET_FLOW_ACTIONS);
>> +       if (!actions)
>> +               return -EMSGSIZE;
>> +
>> +       for (i = 0; acts[i]->uid; i++) {
>
> Using for(act = acts; act->udi; act++) will make code a little nicer.
>

not entirely convinced its any nicer that way but sure I'll convert it.

[...]

>> +static int net_flow_cmd_get_actions(struct sk_buff *skb,
>> +                                   struct genl_info *info)
>> +{
>> +       struct net_flow_action **a;
>> +       struct net_device *dev;
>> +       struct sk_buff *msg;
>> +
>> +       dev = net_flow_get_dev(info);
>> +       if (!dev)
>> +               return -EINVAL;
>> +
>> +       if (!dev->netdev_ops->ndo_flow_get_actions) {
>> +               dev_put(dev);
>> +               return -EOPNOTSUPP;
>> +       }
>> +
>> +       a = dev->netdev_ops->ndo_flow_get_actions(dev);
>> +       if (!a)
>> +               return -EBUSY;
>
> Is it assumed ndo_flow_get_actions() returns a pointer to a static
> list of actions?  What if the device wants to give up a dynamic list
> of actions?  I'm trying to understand the lifetime of pointer 'a'.
> What would cause -EBUSY condition?
>

Ah this is a good point. At the moment if a driver dynamically changes
a structure then its going to break because there is no locking
involved. I think the best way to do this is to use RCU here. We can
return rcu dereferenced pointers and then drivers will need to wait a
grace period before free'ing the old pointer. To simplify drivers we
can do this from helper calls and document the semantics.

Currently rocker is static so we don't have any issues. If no one minds
I would like to do this in a follow up series.

>> +
>> +       msg = net_flow_build_actions_msg(a, dev,
>> +                                        info->snd_portid,
>> +                                        info->snd_seq,
>> +                                        NET_FLOW_TABLE_CMD_GET_ACTIONS);
>> +       dev_put(dev);
>> +
>> +       if (IS_ERR(msg))
>> +               return PTR_ERR(msg);
>> +
>> +       return genlmsg_reply(msg, info);
>> +}
>> +
>> +static int net_flow_put_table(struct net_device *dev,
>> +                             struct sk_buff *skb,
>> +                             struct net_flow_table *t)
>> +{
>> +       struct nlattr *matches, *actions;
>> +       int i;
>> +
>> +       if (nla_put_string(skb, NET_FLOW_TABLE_ATTR_NAME, t->name) ||
>> +           nla_put_u32(skb, NET_FLOW_TABLE_ATTR_UID, t->uid) ||
>> +           nla_put_u32(skb, NET_FLOW_TABLE_ATTR_SOURCE, t->source) ||
>> +           nla_put_u32(skb, NET_FLOW_TABLE_ATTR_SIZE, t->size))
>> +               return -EMSGSIZE;
>> +
>> +       matches = nla_nest_start(skb, NET_FLOW_TABLE_ATTR_MATCHES);
>> +       if (!matches)
>> +               return -EMSGSIZE;
>> +
>> +       for (i = 0; t->matches[i].instance; i++)
>
> pointer-based loop better than i-based?  my personal preference, i guess.

hmm I guess I tended to write these with indices. I might leave them
for now but can change them if the consensus is pointer loops are easier
to read.

>
>> +               nla_put(skb, NET_FLOW_FIELD_REF,
>> +                       sizeof(struct net_flow_field_ref),
>> +                       &t->matches[i]);
>> +       nla_nest_end(skb, matches);
>> +


-- 
John Fastabend         Intel Corporation

^ permalink raw reply

* [PATCH v4 net-next 2/2 tuntap: Increase the number of queues in tun.
From: Pankaj Gupta @ 2015-01-06  5:39 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: davem, jasowang, mst, dgibson, vfalico, edumazet, vyasevic, hkchu,
	wuzhy, xemul, therbert, bhutchings, xii, stephen, jiri,
	sergei.shtylyov, Pankaj Gupta
In-Reply-To: <1420522756-15127-1-git-send-email-pagupta@redhat.com>

Networking under kvm works best if we allocate a per-vCPU RX and TX
queue in a virtual NIC. This requires a per-vCPU queue on the host side.

It is now safe to increase the maximum number of queues.
Preceding patch: 'net: allow large number of rx queues'
made sure this won't cause failures due to high order memory
allocations. Increase it to 256: this is the max number of vCPUs
KVM supports.

Size of tun_struct changes from 8512 to 10496 after this patch. This keeps 
pages allocated for tun_struct before and after the patch to 3. 

Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
Reviewed-by: David Gibson <dgibson@redhat.com>
---
 drivers/net/tun.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index e3fa65a..a19dc5f8 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -113,10 +113,11 @@ struct tap_filter {
 	unsigned char	addr[FLT_EXACT_COUNT][ETH_ALEN];
 };
 
-/* DEFAULT_MAX_NUM_RSS_QUEUES were chosen to let the rx/tx queues allocated for
- * the netdevice to be fit in one page. So we can make sure the success of
- * memory allocation. TODO: increase the limit. */
-#define MAX_TAP_QUEUES DEFAULT_MAX_NUM_RSS_QUEUES
+/* MAX_TAP_QUEUES 256 is chosen to allow rx/tx queues to be equal
+ * to max number of vCPUS in guest. Also, we are making sure here
+ * queue memory allocation do not fail.
+ */
+#define MAX_TAP_QUEUES 256
 #define MAX_TAP_FLOWS  4096
 
 #define TUN_FLOW_EXPIRE (3 * HZ)
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH v4 net-next 1/2] net: allow large number of rx queues
From: Pankaj Gupta @ 2015-01-06  5:39 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: davem, jasowang, mst, dgibson, vfalico, edumazet, vyasevic, hkchu,
	wuzhy, xemul, therbert, bhutchings, xii, stephen, jiri,
	sergei.shtylyov, Pankaj Gupta
In-Reply-To: <1420522756-15127-1-git-send-email-pagupta@redhat.com>

netif_alloc_rx_queues() uses kcalloc() to allocate memory
for "struct netdev_queue *_rx" array.
If we are doing large rx queue allocation kcalloc() might
fail, so this patch does a fallback to vzalloc().
Similar implementation is done for tx queue allocation in
netif_alloc_netdev_queues().

We avoid failure of high order memory allocation
with the help of vzalloc(), this allows us to do large
rx and tx queue allocation which in turn helps us to
increase the number of queues in tun.

As vmalloc() adds overhead on a critical network path,
__GFP_REPEAT flag is used with kzalloc() to do this fallback
only when really needed.

Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: David Gibson <dgibson@redhat.com>
---
 net/core/dev.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 1ab168e..954591a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6145,13 +6145,16 @@ static int netif_alloc_rx_queues(struct net_device *dev)
 {
 	unsigned int i, count = dev->num_rx_queues;
 	struct netdev_rx_queue *rx;
+	size_t sz = count * sizeof(*rx);
 
 	BUG_ON(count < 1);
 
-	rx = kcalloc(count, sizeof(struct netdev_rx_queue), GFP_KERNEL);
-	if (!rx)
-		return -ENOMEM;
-
+	rx = kzalloc(sz, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
+	if (!rx) {
+		rx = vzalloc(sz);
+		if (!rx)
+			return -ENOMEM;
+	}
 	dev->_rx = rx;
 
 	for (i = 0; i < count; i++)
@@ -6781,7 +6784,7 @@ void free_netdev(struct net_device *dev)
 
 	netif_free_tx_queues(dev);
 #ifdef CONFIG_SYSFS
-	kfree(dev->_rx);
+	kvfree(dev->_rx);
 #endif
 
 	kfree(rcu_dereference_protected(dev->ingress_queue, 1));
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH v4 net-net 0/2] Increase the limit of tuntap queues
From: Pankaj Gupta @ 2015-01-06  5:39 UTC (permalink / raw)
  To: linux-kernel, netdev
  Cc: davem, jasowang, mst, dgibson, vfalico, edumazet, vyasevic, hkchu,
	wuzhy, xemul, therbert, bhutchings, xii, stephen, jiri,
	sergei.shtylyov, Pankaj Gupta

Networking under KVM works best if we allocate a per-vCPU rx and tx
queue in a virtual NIC. This requires a per-vCPU queue on the host side.
Modern physical NICs have multiqueue support for large number of queues.
To scale vNIC to run multiple queues parallel to maximum number of vCPU's
we need to increase number of queues support in tuntap.   

Changes from v3:

PATCH1: Michael.S.Tsirkin - Some cleanups and updated commit message.

Changes from v2:
PATCH 3: David Miller     - flex array adds extra level of indirection
                            for preallocated array.(dropped, as flow array
			    is allocated using kzalloc with failover to zalloc). 
Changes from v1:
PATCH 2: David Miller     - sysctl changes to limit number of queues 
                            not required for unprivileged users(dropped).

Changes from RFC
PATCH 1: Sergei Shtylyov  - Add an empty line after declarations.
PATCH 2: Jiri Pirko -       Do not introduce new module paramaters.
	 Michael.S.Tsirkin- We can use sysctl for limiting max number
                            of queues.

This series is to increase the number of tuntap queues. Original work is being 
done by 'jasowang@redhat.com'. I am taking this 'https://lkml.org/lkml/2013/6/19/29' 
patch series as a reference. As per discussion in the patch series:

There were two reasons which prevented us from increasing number of tun queues:

- The netdev_queue array in netdevice were allocated through kmalloc, which may 
  cause a high order memory allocation too when we have several queues. 
  E.g. sizeof(netdev_queue) is 320, which means a high order allocation would 
  happens when the device has more than 16 queues.

- We store the hash buckets in tun_struct which results a very large size of
  tun_struct, this high order memory allocation fail easily when the memory is
  fragmented.

The patch 60877a32bce00041528576e6b8df5abe9251fa73 increases the number of tx 
queues. Memory allocation fallback to vzalloc() when kmalloc() fails.

This series tries to address following issues:

- Increase the number of netdev_queue queues for rx similarly its done for tx 
  queues by falling back to vzalloc() when memory allocation with kmalloc() fails.

- Increase number of queues to 256, maximum number is equal to maximum number 
  of vCPUS allowed in a guest.

I have also done testing with multiple parallel Netperf sessions for different 
combination of queues and CPU's. It seems to be working fine without much increase 
in cpu load with increase in number of queues. I also see good increase in throughput
with increase in number of queues. Though i had limitation of 8 physical CPU's. 

For this test: Two Hosts(Host1 & Host2) are directly connected with cable  
Host1 is running Guest1. Data is sent from Host2 to Guest1 via Host1.

Host kernel: 3.19.0-rc2+, AMD Opteron(tm) Processor 6320
NIC : Emulex Corporation OneConnect 10Gb NIC (be3)

Patch Applied  %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle  throughput
Single Queue, 2 vCPU's
-------------
Before Patch :all    0.19    0.00    0.16    0.07    0.04    0.10    0.00    0.18    0.00   99.26  57864.18
After  Patch :all    0.99    0.00    0.64    0.69    0.07    0.26    0.00    1.58    0.00   95.77  57735.77

With 2 Queues, 2 vCPU's    
---------------
Before Patch :all    0.19    0.00    0.19    0.10    0.04    0.11    0.00    0.28    0.00   99.08  63083.09
After  Patch :all    0.87    0.00    0.73    0.78    0.09    0.35    0.00    2.04    0.00   95.14  62917.03

With 4 Queues, 4 vCPU's
--------------
Before Patch :all    0.20    0.00    0.21    0.11    0.04    0.12    0.00    0.32    0.00   99.00  80865.06
After  Patch :all    0.71    0.00    0.93    0.85    0.11    0.51    0.00    2.62    0.00   94.27  86463.19

With 8 Queues, 8 vCPU's
--------------
Before Patch :all    0.19    0.00    0.18    0.09    0.04    0.11    0.00    0.23    0.00   99.17  86795.31
After  Patch :all    0.65    0.00    1.18    0.93    0.13    0.68    0.00    3.38    0.00   93.05  89459.93

With 16 Queues, 8 vCPU's
--------------
After  Patch :all    0.61    0.00    1.59    0.97    0.18    0.92    0.00    4.32    0.00   91.41  120951.60

Patches Summary:
  net: allow large number of rx queues
  tuntap: Increase the number of queues in tun

 drivers/net/tun.c |    9 +++++----
 net/core/dev.c    |   13 ++++++++-----
 2 files changed, 13 insertions(+), 9 deletions(-)

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox