All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ido Schimmel <idosch@idosch.org>
To: Jesse Hathaway <jesse@mbuki-mvuki.org>
Cc: netdev@vger.kernel.org
Subject: Re: Race condition in route lookup
Date: Thu, 10 Oct 2019 11:31:02 +0300	[thread overview]
Message-ID: <20191010083102.GA1336@splinter> (raw)
In-Reply-To: <CANSNSoV1M9stB7CnUcEhsz3FHi4NV_yrBtpYsZ205+rqnvMbvA@mail.gmail.com>

On Wed, Oct 09, 2019 at 11:00:07AM -0500, Jesse Hathaway wrote:
> We have been experiencing a route lookup race condition on our internet facing
> Linux routers. I have been able to reproduce the issue, but would love more
> help in isolating the cause.
> 
> Looking up a route found in the main table returns `*` rather than the directly
> connected interface about once for every 10-20 million requests. From my
> reading of the iproute2 source code an asterisk is indicative of the kernel
> returning and interface index of 0 rather than the correct directly connected
> interface.
> 
> This is reproducible with the following bash snippet on 5.4-rc2:
> 
>   $ cat route-race
>   #!/bin/bash
> 
>   # Generate 50 million individual route gets to feed as batch input to `ip`
>   function ip-cmds() {
>           route_get='route get 192.168.11.142 from 192.168.180.10 iif vlan180'
>           for ((i = 0; i < 50000000; i++)); do
>                   printf '%s\n' "${route_get}"
>           done
> 
>   }
> 
>   ip-cmds | ip -d -o -batch - | grep -E 'dev \*' | uniq -c
> 
> Example output:
> 
>   $ ./route-race
>         6 unicast 192.168.11.142 from 192.168.180.10 dev * table main
> \    cache iif vlan180
> 
> These routers have multiple routing tables and are ingesting full BGP routing
> tables from multiple ISPs:
> 
>   $ ip route show table all | wc -l
>   3105543
> 
>   $ ip route show table main | wc -l
>   54
> 
> Please let me know what other information I can provide, thanks in advance,

I think it's working as expected. Here is my theory:

If CPU0 is executing both the route get request and forwarding packets
through the directly connected interface, then the following can happen:

<CPU0, t0> - In process context, per-CPU dst entry cached in the nexthop
is found. Not yet dumped to user space

<Any CPU, t1> - Routes are added / removed, therefore invalidating the
cache by bumping 'net->ipv4.rt_genid'

<CPU0, t2> - In softirq, packet is forwarded through the nexthop. The
cached dst entry is found to be invalid. Therefore, it is replaced by a
newer dst entry. dst_dev_put() is called on old entry which assigns the
blackhole netdev to 'dst->dev'. This netdev has an ifindex of 0 because
it is not registered.

<CPU0, t3> - After softirq finished executing, your route get request
from t0 is resumed and the old dst entry is dumped to user space with
ifindex of 0.

I tested this on my system using your script to generate the route get
requests. I pinned it to the same CPU forwarding packets through the
nexthop. To constantly invalidate the cache I created another script
that simply adds and removes IP addresses from an interface.

If I stop the packet forwarding or the script that invalidates the
cache, then I don't see any '*' answers to my route get requests.

BTW, the blackhole netdev was added in 5.3. I assume (didn't test) that
with older kernel versions you'll see 'lo' instead of '*'.

  reply	other threads:[~2019-10-10  8:31 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-09 16:00 Race condition in route lookup Jesse Hathaway
2019-10-10  8:31 ` Ido Schimmel [this message]
2019-10-10  8:46   ` Ido Schimmel
2019-10-11 14:36   ` Jesse Hathaway
2019-10-11 15:42     ` Ido Schimmel
2019-10-11 16:09       ` Jesse Hathaway
2019-10-11 17:54       ` Wei Wang
2019-10-11 18:17         ` Ido Schimmel
2019-10-11 18:25           ` Ido Schimmel
2019-10-11 18:47             ` Wei Wang
2019-10-11 18:52               ` Ido Schimmel
2019-10-11 21:01                 ` Jesse Hathaway
2019-10-11 21:27                 ` David Ahern
2019-10-12  6:56         ` Martin Lau
2019-10-14  0:23           ` Wei Wang
2019-10-14 17:26             ` Martin Lau
2019-10-15 14:45               ` David Ahern
2019-10-15 16:42                 ` Wei Wang
2019-10-16  6:35                   ` Martin Lau
2019-10-15 14:29         ` Jesse Hathaway
2019-10-15 16:44           ` Wei Wang
2019-10-16  6:39             ` Martin Lau
2019-10-16 16:35               ` Wei Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191010083102.GA1336@splinter \
    --to=idosch@idosch.org \
    --cc=jesse@mbuki-mvuki.org \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.