Re: [PATCH net-next] openvswitch: Introduce per-cpu upcall dispatch

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Mark Gray <mark.d.gray@redhat.com>
To: Pravin Shelar <pravin.ovn@gmail.com>
Cc: ovs dev <dev@openvswitch.org>,
	Linux Kernel Network Developers <netdev@vger.kernel.org>,
	Flavio Leitner <fbl@sysclose.org>,
	dan.carpenter@oracle.com
Subject: Re: [PATCH net-next] openvswitch: Introduce per-cpu upcall dispatch
Date: Thu, 15 Jul 2021 12:55:56 +0100	[thread overview]
Message-ID: <f14e1e3d-5908-dfc8-dcb1-3fe5903dbf19@redhat.com> (raw)
In-Reply-To: <CAOrHB_A0BcA0OOGmceYFyS2V72858tW-WWX_i9WSEhz63O7Scg@mail.gmail.com>

On 15/07/2021 05:45, Pravin Shelar wrote:
> On Wed, Jun 30, 2021 at 2:53 AM Mark Gray <mark.d.gray@redhat.com> wrote:
>>
>> The Open vSwitch kernel module uses the upcall mechanism to send
>> packets from kernel space to user space when it misses in the kernel
>> space flow table. The upcall sends packets via a Netlink socket.
>> Currently, a Netlink socket is created for every vport. In this way,
>> there is a 1:1 mapping between a vport and a Netlink socket.
>> When a packet is received by a vport, if it needs to be sent to
>> user space, it is sent via the corresponding Netlink socket.
>>
>> This mechanism, with various iterations of the corresponding user
>> space code, has seen some limitations and issues:
>>
>> * On systems with a large number of vports, there is a correspondingly
>> large number of Netlink sockets which can limit scaling.
>> (https://bugzilla.redhat.com/show_bug.cgi?id=1526306)
>> * Packet reordering on upcalls.
>> (https://bugzilla.redhat.com/show_bug.cgi?id=1844576)
>> * A thundering herd issue.
>> (https://bugzilla.redhat.com/show_bug.cgi?id=1834444)
>>
>> This patch introduces an alternative, feature-negotiated, upcall
>> mode using a per-cpu dispatch rather than a per-vport dispatch.
>>
>> In this mode, the Netlink socket to be used for the upcall is
>> selected based on the CPU of the thread that is executing the upcall.
>> In this way, it resolves the issues above as:
>>
>> a) The number of Netlink sockets scales with the number of CPUs
>> rather than the number of vports.
>> b) Ordering per-flow is maintained as packets are distributed to
>> CPUs based on mechanisms such as RSS and flows are distributed
>> to a single user space thread.
>> c) Packets from a flow can only wake up one user space thread.
>>
>> The corresponding user space code can be found at:
>> https://mail.openvswitch.org/pipermail/ovs-dev/2021-April/382618.html
>>
>> Bugzilla: https://bugzilla.redhat.com/1844576
>> Signed-off-by: Mark Gray <mark.d.gray@redhat.com>
>> ---
>>
>> Notes:
>>     v1 - Reworked based on Flavio's comments:
>>          * Fixed handling of userspace action case
>>          * Renamed 'struct dp_portids'
>>          * Fixed handling of return from kmalloc()
>>          * Removed check for dispatch type from ovs_dp_get_upcall_portid()
>>        - Reworked based on Dan's comments:
>>          * Fixed handling of return from kmalloc()
>>        - Reworked based on Pravin's comments:
>>          * Fixed handling of userspace action case
>>        - Added kfree() in destroy_dp_rcu() to cleanup netlink port ids
>>
> Patch looks good to me. I have the following minor comments.
>
>>  include/uapi/linux/openvswitch.h |  8 ++++
>>  net/openvswitch/actions.c        |  6 ++-
>>  net/openvswitch/datapath.c       | 70 +++++++++++++++++++++++++++++++-
>>  net/openvswitch/datapath.h       | 20 +++++++++
>>  4 files changed, 101 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
>> index 8d16744edc31..6571b57b2268 100644
>> --- a/include/uapi/linux/openvswitch.h
>> +++ b/include/uapi/linux/openvswitch.h
>> @@ -70,6 +70,8 @@ enum ovs_datapath_cmd {
>>   * set on the datapath port (for OVS_ACTION_ATTR_MISS).  Only valid on
>>   * %OVS_DP_CMD_NEW requests. A value of zero indicates that upcalls should
>>   * not be sent.
>> + * OVS_DP_ATTR_PER_CPU_PIDS: Per-cpu array of PIDs for upcalls when
>> + * OVS_DP_F_DISPATCH_UPCALL_PER_CPU feature is set.
>>   * @OVS_DP_ATTR_STATS: Statistics about packets that have passed through the
>>   * datapath.  Always present in notifications.
>>   * @OVS_DP_ATTR_MEGAFLOW_STATS: Statistics about mega flow masks usage for the
>> @@ -87,6 +89,9 @@ enum ovs_datapath_attr {
>>         OVS_DP_ATTR_USER_FEATURES,      /* OVS_DP_F_*  */
>>         OVS_DP_ATTR_PAD,
>>         OVS_DP_ATTR_MASKS_CACHE_SIZE,
>> +       OVS_DP_ATTR_PER_CPU_PIDS,   /* Netlink PIDS to receive upcalls in per-cpu
>> +                                    * dispatch mode
>> +                                    */
>>         __OVS_DP_ATTR_MAX
>>  };
>>
>> @@ -127,6 +132,9 @@ struct ovs_vport_stats {
>>  /* Allow tc offload recirc sharing */
>>  #define OVS_DP_F_TC_RECIRC_SHARING     (1 << 2)
>>
>> +/* Allow per-cpu dispatch of upcalls */
>> +#define OVS_DP_F_DISPATCH_UPCALL_PER_CPU       (1 << 3)
>> +
>>  /* Fixed logical ports. */
>>  #define OVSP_LOCAL      ((__u32)0)
>>
>> diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
>> index ef15d9eb4774..f79679746c62 100644
>> --- a/net/openvswitch/actions.c
>> +++ b/net/openvswitch/actions.c
>> @@ -924,7 +924,11 @@ static int output_userspace(struct datapath *dp, struct sk_buff *skb,
>>                         break;
>>
>>                 case OVS_USERSPACE_ATTR_PID:
>> -                       upcall.portid = nla_get_u32(a);
>> +                       if (dp->user_features & OVS_DP_F_DISPATCH_UPCALL_PER_CPU)
>> +                               upcall.portid =
>> +                                  ovs_dp_get_upcall_portid(dp, smp_processor_id());
>> +                       else
>> +                               upcall.portid = nla_get_u32(a);
>>                         break;
>>
>>                 case OVS_USERSPACE_ATTR_EGRESS_TUN_PORT: {
>> diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
>> index bc164b35e67d..8d54fa323543 100644
>> --- a/net/openvswitch/datapath.c
>> +++ b/net/openvswitch/datapath.c
>> @@ -166,6 +166,7 @@ static void destroy_dp_rcu(struct rcu_head *rcu)
>>         free_percpu(dp->stats_percpu);
>>         kfree(dp->ports);
>>         ovs_meters_exit(dp);
>> +       kfree(dp->upcall_portids);
>>         kfree(dp);
>>  }
>>
>> @@ -239,7 +240,12 @@ void ovs_dp_process_packet(struct sk_buff *skb, struct sw_flow_key *key)
>>
>>                 memset(&upcall, 0, sizeof(upcall));
>>                 upcall.cmd = OVS_PACKET_CMD_MISS;
>> -               upcall.portid = ovs_vport_find_upcall_portid(p, skb);
>> +
>> +               if (dp->user_features & OVS_DP_F_DISPATCH_UPCALL_PER_CPU)
>> +                       upcall.portid = ovs_dp_get_upcall_portid(dp, smp_processor_id());
>> +               else
>> +                       upcall.portid = ovs_vport_find_upcall_portid(p, skb);
>> +
>>                 upcall.mru = OVS_CB(skb)->mru;
>>                 error = ovs_dp_upcall(dp, skb, key, &upcall, 0);
>>                 if (unlikely(error))
>> @@ -1594,16 +1600,67 @@ static void ovs_dp_reset_user_features(struct sk_buff *skb,
>>
>>  DEFINE_STATIC_KEY_FALSE(tc_recirc_sharing_support);
>>
>> +int ovs_dp_set_upcall_portids(struct datapath *dp,
>> +                             const struct nlattr *ids)
> this can be static function.

Yes

> 
>> +{
>> +       struct dp_nlsk_pids *old, *dp_nlsk_pids;
>> +
>> +       if (!nla_len(ids) || nla_len(ids) % sizeof(u32))
>> +               return -EINVAL;
>> +
>> +       old = ovsl_dereference(dp->upcall_portids);
>> +
>> +       dp_nlsk_pids = kmalloc(sizeof(*dp_nlsk_pids) + nla_len(ids),
>> +                              GFP_KERNEL);
>> +       if (!dp_nlsk_pids)
>> +               return -ENOMEM;
>> +
>> +       dp_nlsk_pids->n_pids = nla_len(ids) / sizeof(u32);
>> +       nla_memcpy(dp_nlsk_pids->pids, ids, nla_len(ids));
>> +
>> +       rcu_assign_pointer(dp->upcall_portids, dp_nlsk_pids);
>> +
>> +       kfree_rcu(old, rcu);
>> +
>> +       return 0;
>> +}
>> +
>> +u32 ovs_dp_get_upcall_portid(const struct datapath *dp, uint32_t cpu_id)
> same here, it can be static.

This cannot as it gets called in actions.c
> 
>> +{
>> +       struct dp_nlsk_pids *dp_nlsk_pids;
>> +
>> +       dp_nlsk_pids = rcu_dereference_ovsl(dp->upcall_portids);
> I dont think this function is only called under ovs-lock, so we can
> change it to rcu_dereference().

I had a quick look through the code and I think you are right so I will
change it.

> 
>> +
>> +       if (dp_nlsk_pids) {
>> +               if (cpu_id < dp_nlsk_pids->n_pids) {
>> +                       return dp_nlsk_pids->pids[cpu_id];
>> +               } else if (dp_nlsk_pids->n_pids > 0 && cpu_id >= dp_nlsk_pids->n_pids) {
>> +                       /* If the number of netlink PIDs is mismatched with the number of
>> +                        * CPUs as seen by the kernel, log this and send the upcall to an
>> +                        * arbitrary socket (0) in order to not drop packets
>> +                        */
>> +                       pr_info_ratelimited("cpu_id mismatch with handler threads");
>> +                       return dp_nlsk_pids->pids[cpu_id % dp_nlsk_pids->n_pids];
>> +               } else {
>> +                       return 0;
>> +               }
>> +       } else {
>> +               return 0;
>> +       }
>> +}
>> +
>

     prev parent reply	other threads:[~2021-07-15 11:56 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-30  9:53 [PATCH net-next] openvswitch: Introduce per-cpu upcall dispatch Mark Gray
2021-06-30 16:29 ` kernel test robot
2021-07-01 20:59 ` Flavio Leitner
2021-07-05 23:54 ` Flavio Leitner
2021-07-08 14:40 ` Flavio Leitner
2021-07-12 18:07   ` [ovs-dev] " Flavio Leitner
2021-07-15  4:45 ` Pravin Shelar
2021-07-15 11:55   ` Mark Gray [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f14e1e3d-5908-dfc8-dcb1-3fe5903dbf19@redhat.com \
    --to=mark.d.gray@redhat.com \
    --cc=dan.carpenter@oracle.com \
    --cc=dev@openvswitch.org \
    --cc=fbl@sysclose.org \
    --cc=netdev@vger.kernel.org \
    --cc=pravin.ovn@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).