Re: [PATCH RFC net-next] openvswitch: Queue upcalls to userspace in per-port round-robin order

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Ben Pfaff <blp-LZ6Gd1LRuIk@public.gmane.org>
To: Matteo Croce <mcroce-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: dev-yBygre7rU0TnMu66kgdUjQ@public.gmane.org,
	jpettit-pghWNbHTmq7QT0dZR+AlfA@public.gmane.org,
	netdev <netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	Jiri Benc <jbenc-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>,
	Stefano Brivio <sbrivio-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Subject: Re: [PATCH RFC net-next] openvswitch: Queue upcalls to userspace in per-port round-robin order
Date: Tue, 31 Jul 2018 15:06:57 -0700	[thread overview]
Message-ID: <20180731220657.GC29662@ovn.org> (raw)
In-Reply-To: <CAGnkfhyxQSz=8OsgTsjR3NfZ2FPwv+FjPZNPEY5VHZRsEiQ68w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Tue, Jul 31, 2018 at 07:43:34PM +0000, Matteo Croce wrote:
> On Mon, Jul 16, 2018 at 4:54 PM Matteo Croce <mcroce-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> >
> > On Tue, Jul 10, 2018 at 6:31 PM Pravin Shelar <pshelar-LZ6Gd1LRuIk@public.gmane.org> wrote:
> > >
> > > On Wed, Jul 4, 2018 at 7:23 AM, Matteo Croce <mcroce-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > > > From: Stefano Brivio <sbrivio-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > > >
> > > > Open vSwitch sends to userspace all received packets that have
> > > > no associated flow (thus doing an "upcall"). Then the userspace
> > > > program creates a new flow and determines the actions to apply
> > > > based on its configuration.
> > > >
> > > > When a single port generates a high rate of upcalls, it can
> > > > prevent other ports from dispatching their own upcalls. vswitchd
> > > > overcomes this problem by creating many netlink sockets for each
> > > > port, but it quickly exceeds any reasonable maximum number of
> > > > open files when dealing with huge amounts of ports.
> > > >
> > > > This patch queues all the upcalls into a list, ordering them in
> > > > a per-port round-robin fashion, and schedules a deferred work to
> > > > queue them to userspace.
> > > >
> > > > The algorithm to queue upcalls in a round-robin fashion,
> > > > provided by Stefano, is based on these two rules:
> > > >  - upcalls for a given port must be inserted after all the other
> > > >    occurrences of upcalls for the same port already in the queue,
> > > >    in order to avoid out-of-order upcalls for a given port
> > > >  - insertion happens once the highest upcall count for any given
> > > >    port (excluding the one currently at hand) is greater than the
> > > >    count for the port we're queuing to -- if this condition is
> > > >    never true, upcall is queued at the tail. This results in a
> > > >    per-port round-robin order.
> > > >
> > > > In order to implement a fair round-robin behaviour, a variable
> > > > queueing delay is introduced. This will be zero if the upcalls
> > > > rate is below a given threshold, and grows linearly with the
> > > > queue utilisation (i.e. upcalls rate) otherwise.
> > > >
> > > > This ensures fairness among ports under load and with few
> > > > netlink sockets.
> > > >
> > > Thanks for the patch.
> > > This patch is adding following overhead for upcall handling:
> > > 1. kmalloc.
> > > 2. global spin-lock.
> > > 3. context switch to single worker thread.
> > > I think this could become bottle neck on most of multi core systems.
> > > You have mentioned issue with existing fairness mechanism, Can you
> > > elaborate on those, I think we could improve that before implementing
> > > heavy weight fairness in upcall handling.
> >
> > Hi Pravin,
> >
> > vswitchd allocates N * P netlink sockets, where N is the number of
> > online CPU cores, and P the number of ports.
> > With some setups, this number can grow quite fast, also exceeding the
> > system maximum file descriptor limit.
> > I've seen a 48 core server failing with -EMFILE when trying to create
> > more than 65535 netlink sockets needed for handling 1800+ ports.
> >
> > I made a previous attempt to reduce the sockets to one per CPU, but
> > this was discussed and rejected on ovs-dev because it would remove
> > fairness among ports[1].
> > I think that the current approach of opening a huge number of sockets
> > doesn't really work, (it doesn't scale for sure), it still needs some
> > queueing logic (either in kernel or user space) if we really want to
> > be sure that low traffic ports gets their upcalls quota when other
> > ports are doing way more traffic.
> >
> > If you are concerned about the kmalloc or spinlock, we can solve them
> > with kmem_cache or two copies of the list and rcu, I'll happy to
> > discuss the implementation details, as long as we all agree that the
> > current implementation doesn't scale well and has an issue.
> >
> > [1] https://mail.openvswitch.org/pipermail/ovs-dev/2018-February/344279.html
> >
> > --
> > Matteo Croce
> > per aspera ad upstream
> 
> Hi all,
> 
> any idea on how to solve the file descriptor limit hit by the netlink sockets?
> I see this issue happen very often, and raising the FD limit to 400k
> seems not the right way to solve it.
> Any other suggestion on how to improve the patch, or solve the problem
> in a different way?

This is an awkward problem to try to solve with sockets because of the
nature of sockets, which are strictly first-in first-out.  What you
really want is something closer to the algorithm that we use in
ovs-vswitchd to send packets to an OpenFlow controller.  When the
channel becomes congested, then for each packet to be sent to the
controller, OVS appends it to a queue associated with its input port.
(This could be done on a more granular basis than just port.)  If the
maximum amount of queued packets is reached, then OVS discards a packet
from the longest queue.  When space becomes available in the channel,
OVS round-robins through the queues to send a packet.  This achieves
pretty good fairness but it can't be done with sockets because you can't
drop a packet that is already queued to one.

My current thought is that any fairness scheme we implement directly in
the kernel is going to need to evolve over time.  Maybe we could do
something flexible with BPF and maps, instead of hard-coding it.

next prev parent reply	other threads:[~2018-07-31 22:06 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-07-04 14:23 [PATCH RFC net-next] openvswitch: Queue upcalls to userspace in per-port round-robin order Matteo Croce
     [not found] ` <20180704142342.21740-1-mcroce-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2018-07-10 18:31   ` Pravin Shelar
2018-07-16 16:54     ` Matteo Croce
2018-07-31 19:43       ` Matteo Croce
     [not found]         ` <CAGnkfhyxQSz=8OsgTsjR3NfZ2FPwv+FjPZNPEY5VHZRsEiQ68w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-07-31 22:06           ` Ben Pfaff [this message]
2018-08-03 16:52             ` Stefano Brivio
2018-08-03 23:01               ` Ben Pfaff
2018-08-04  0:43                 ` Stefano Brivio
2018-08-04  0:54                   ` Ben Pfaff
2018-08-10 14:11                   ` William Tu
2018-08-14 15:25                     ` Stefano Brivio
2018-07-31 23:12         ` Pravin Shelar
2018-08-07 13:31           ` Stefano Brivio
2018-08-07 13:39             ` Stefano Brivio
2018-08-15  7:19             ` Pravin Shelar
     [not found]               ` <CAOrHB_DaA-+J=jzNOdQiUYrA7RJi30HmRESjsmGs7_z1ffpVOA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-08-16 21:07                 ` Stefano Brivio
2018-09-26  9:58               ` Stefano Brivio
2018-09-28 17:15                 ` Pravin Shelar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180731220657.GC29662@ovn.org \
    --to=blp-lz6gd1lruik@public.gmane.org \
    --cc=dev-yBygre7rU0TnMu66kgdUjQ@public.gmane.org \
    --cc=jbenc-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org \
    --cc=jpettit-pghWNbHTmq7QT0dZR+AlfA@public.gmane.org \
    --cc=mcroce-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org \
    --cc=netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=sbrivio-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.