All of lore.kernel.org
 help / color / mirror / Atom feed
From: Nikolay Borisov <kernel-6AxghH7DbtA@public.gmane.org>
To: Erez Shitrit <erezsh-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
Cc: shlomop-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
	Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>,
	Roland Dreier <roland-BHEL68pLQRGGvPXPguhicg@public.gmane.org>,
	"linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org"
	<linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
Subject: Re: [IPOIB] Excessive TX packet drops due to IPOIB_MAX_PATH_REC_QUEUE
Date: Mon, 1 Aug 2016 11:20:44 +0300	[thread overview]
Message-ID: <579F065C.602@kyup.com> (raw)
In-Reply-To: <CAAk-MO83mJTq=E_MC=izqq8fEmVujY=5egVmKfFjxAz4jO3hHg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>



On 08/01/2016 11:01 AM, Erez Shitrit wrote:
> Hi Nikolay,
> 
> IPoIB is a special driver because it plays in 2 "courts", in one hand
> it is a network driver and in the other hand it is IB driver, this is
> the reason for what you are seeing. (be carefull more details are
> coming ..)
> 
> After ARP reply the kernel which threats ipoib driver as network
> driver (like ethernet, and doesn't aware of the IB aspect of the ipoib
> driver)
> the kernel thinks that now after it has the layer 2 address (from ARP)
> it can send the packets to the destination, it doesn't aware of the IB
> aspect which needs the AV (by Path Record) in order to get the right
> destination, ipoib tries to do best effort and while it asks the SM
> for the PathRecord it keeps theses packets (skb's) from the kernel in
> the neigh structure, the number of packets that are kept is 3, (3 is a
> good number, right after 2 .. and for almost all of the topologies we
> will not get more than 1 or 2 drops)
> 
> Now, for your case, i think you have other problem, the connectivity
> with the SM is bad, or the destination is no longer exists.
> check that via the saquery tool (saquery PR <> <>)

Thanks a lot for explaining this!

Actually right after I posted that email further investigation revealed
that the infiniband is indeed somehow confused. So when I initiate a
connection from machine A, which is connected to machine B via
infiniband (and ipoib ipv6 connectivity) everything works as expected.
However, if I do the same sequence but instead of connecting to machine
B I connected to a container, hosted on machine B and accessible via a
veth address I see the following bogus path records:

GID: 9000:0:2800:0:bc00:7500:6e:d8a4
  complete:     no

Clearly, this is a wrong address, while the bottom part is a valid GUID
of the infiniband port of machine A, the 9000:0:2800 part isn't. Here is
how the the actual path record for machine A (from the point of view of
Machine B) looks like:

GID: fe80:0:0:0:11:7500:6e:d8a4
  complete:    yes
  DLID:     0x004f
  SL:            0
  rate:       40.0 Gb/sec


Naturally if I do a saquery -p for 9000:0:2800:0:bc00:7500:6e:d8a4 I get
nothing, while for the second address it works. Further tracing revealed
that in ipoib_start_xmit on machine B the ipoib_cb->hwaddr is set to
9000:0:2800:0:bc00:7500:6e:d8a4 which is passed as an argument to
ipoib_neigh_get and this function returns NULL. This causes
neigh_add_path to be called to add a path but results in -EINVAL. Here
are the respective debug messages:

ib0: Start path record lookup for 9000:0000:2800:0000:bc00:7500:006e:d8a4

ib0: PathRec status -22 for GID 9000:0000:2800:0000:bc00:7500:006e:d8a4
ib0: neigh free for 0002f3 9000:0000:2800:0000:bc00:7500:006e:d8a4

And this is what is casuing the packet drops, since this neighbour is
considered dead (because it doesn't exist). For me this moves the
problem on a slightly different abstraction, because now it seems the
veth pair is somehow confusing the ipoib driver.



> 
> Thanks, Erez
> 
> On Thu, Jul 28, 2016 at 2:00 PM, Nikolay Borisov <kernel-6AxghH7DbtA@public.gmane.org> wrote:
>> Hello,
>>
>> While investigating excessive (> 50%) packet drops on an ipoib
>> interface as reported by ifconfig :
>>
>> TX packets:16565 errors:1 dropped:9058 overruns:0 carrier:0
>>
>> I discovered that this is happening due to the following check
>> in ipoib_start_xmit failing:
>>
>> if (skb_queue_len(&neigh->queue) < IPOIB_MAX_PATH_REC_QUEUE) {
>>                spin_lock_irqsave(&priv->lock, flags);
>>                __skb_queue_tail(&neigh->queue, skb);
>>                spin_unlock_irqrestore(&priv->lock, flags);
>> } else {
>>                ++dev->stats.tx_dropped;
>>                dev_kfree_skb_any(skb);
>> }
>>
>> With the following stacktrace:
>>
>> [1629744.927799]  [<ffffffffa048e6a1>] ipoib_start_xmit+0x651/0x6c0 [ib_ipoib]
>> [1629744.927804]  [<ffffffff8154ecf6>] dev_hard_start_xmit+0x266/0x410
>> [1629744.927807]  [<ffffffff81571b1b>] sch_direct_xmit+0xdb/0x210
>> [1629744.927808]  [<ffffffff8154f22a>] __dev_queue_xmit+0x24a/0x580
>> [1629744.927810]  [<ffffffff8154f570>] dev_queue_xmit+0x10/0x20
>> [1629744.927813]  [<ffffffff81557cf8>] neigh_resolve_output+0x118/0x1c0
>> [1629744.927828]  [<ffffffffa0003c7e>] ip6_finish_output2+0x18e/0x490 [ipv6]
>> [1629744.927831]  [<ffffffffa03b7374>] ? ipv6_confirm+0xc4/0x130 [nf_conntrack_ipv6]
>> [1629744.927837]  [<ffffffffa00052a6>] ip6_finish_output+0xa6/0x100 [ipv6]
>> [1629744.927843]  [<ffffffffa0005344>] ip6_output+0x44/0xe0 [ipv6]
>> [1629744.927850]  [<ffffffffa0005200>] ? ip6_fragment+0x9b0/0x9b0 [ipv6]
>> [1629744.927858]  [<ffffffffa000447c>] ip6_forward+0x4fc/0x8d0 [ipv6]
>> [1629744.927867]  [<ffffffffa00142ad>] ? ip6_route_input+0xfd/0x130 [ipv6]
>> [1629744.927872]  [<ffffffffa0001b70>] ? dst_output+0x20/0x20 [ipv6]
>> [1629744.927877]  [<ffffffffa0005be7>] ip6_rcv_finish+0x57/0xa0 [ipv6]
>> [1629744.927882]  [<ffffffffa0006374>] ipv6_rcv+0x314/0x4e0 [ipv6]
>> [1629744.927887]  [<ffffffffa0005b90>] ? ip6_make_skb+0x1b0/0x1b0 [ipv6]
>> [1629744.927890]  [<ffffffff8154c66b>] __netif_receive_skb_core+0x2cb/0xa30
>> [1629744.927893]  [<ffffffff8108310c>] ? __enqueue_entity+0x6c/0x70
>> [1629744.927894]  [<ffffffff8154cde6>] __netif_receive_skb+0x16/0x70
>> [1629744.927896]  [<ffffffff8154dc63>] process_backlog+0xb3/0x160
>> [1629744.927898]  [<ffffffff8154d36c>] net_rx_action+0x1ec/0x330
>> [1629744.927900]  [<ffffffff810821e1>] ? sched_clock_cpu+0xa1/0xb0
>> [1629744.927902]  [<ffffffff81057337>] __do_softirq+0x147/0x310
>> [1629744.927907]  [<ffffffffa0003c80>] ? ip6_finish_output2+0x190/0x490 [ipv6]
>> [1629744.927909]  [<ffffffff8161618c>] do_softirq_own_stack+0x1c/0x30
>> [1629744.927910]  <EOI>  [<ffffffff810567bb>] do_softirq.part.17+0x3b/0x40
>> [1629744.927913]  [<ffffffff81056876>] __local_bh_enable_ip+0xb6/0xc0
>> [1629744.927918]  [<ffffffffa0003c91>] ip6_finish_output2+0x1a1/0x490 [ipv6]
>> [1629744.927920]  [<ffffffffa03b7374>] ? ipv6_confirm+0xc4/0x130 [nf_conntrack_ipv6]
>> [1629744.927925]  [<ffffffffa00052a6>] ip6_finish_output+0xa6/0x100 [ipv6]
>> [1629744.927930]  [<ffffffffa0005344>] ip6_output+0x44/0xe0 [ipv6]
>> [1629744.927935]  [<ffffffffa0005200>] ? ip6_fragment+0x9b0/0x9b0 [ipv6]
>> [1629744.927939]  [<ffffffffa0002e1f>] ip6_xmit+0x23f/0x4f0 [ipv6]
>> [1629744.927944]  [<ffffffffa0001b50>] ? ac6_proc_exit+0x20/0x20 [ipv6]
>> [1629744.927952]  [<ffffffffa0033ce5>] inet6_csk_xmit+0x85/0xd0 [ipv6]
>> [1629744.927955]  [<ffffffff815aa56d>] tcp_transmit_skb+0x53d/0x910
>> [1629744.927957]  [<ffffffff815aab13>] tcp_write_xmit+0x1d3/0xe90
>> [1629744.927959]  [<ffffffff815aba31>] __tcp_push_pending_frames+0x31/0xa0
>> [1629744.927961]  [<ffffffff8159a19f>] tcp_push+0xef/0x120
>> [1629744.927963]  [<ffffffff8159e219>] tcp_sendmsg+0x6c9/0xac0
>> [1629744.927965]  [<ffffffff815c84d3>] inet_sendmsg+0x73/0xb0
>> [1629744.927967]  [<ffffffff81531728>] sock_sendmsg+0x38/0x50
>> [1629744.927969]  [<ffffffff815317bb>] sock_write_iter+0x7b/0xd0
>> [1629744.927972]  [<ffffffff811988ba>] __vfs_write+0xaa/0xe0
>> [1629744.927974]  [<ffffffff81198f29>] vfs_write+0xa9/0x190
>> [1629744.927975]  [<ffffffff81198e63>] ? vfs_read+0x113/0x130
>> [1629744.927977]  [<ffffffff81199c16>] SyS_write+0x46/0xa0
>> [1629744.927979]  [<ffffffff8161465b>] entry_SYSCALL_64_fastpath+0x16/0x6e
>> [1629744.927988] ---[ end trace 08584e4165caf3df ]---
>>
>>
>> IPOIB_MAX_PATH_REC_QUEUE is set to 3. If I'm reading the code correctly
>> if there are more than 3 outstanding packets for a neighbour this would
>> cause the code to drop the packets. Is this correct? Also I tried bumping
> 
> yes.
> 
>> IPOIB_MAX_PATH_REC_QUEUE to 150 to see what will happen and this instead
> 
> it is a bad idea to move it to 150 ...
> 
>> moved the dropping to occur in ipoib_neigh_dtor:
>>
>> [1629558.306405]  [<ffffffffa04788ec>] ipoib_neigh_dtor+0x9c/0x130 [ib_ipoib]
>> [1629558.306407]  [<ffffffffa0478999>] ipoib_neigh_reclaim+0x19/0x20 [ib_ipoib]
>> [1629558.306411]  [<ffffffff810ad0fb>] rcu_process_callbacks+0x21b/0x620
>> [1629558.306413]  [<ffffffff81057337>] __do_softirq+0x147/0x310
>>
>> Since you've taken part in the development of the said code I'd like
>> to ask what's the purpose of the IPOIB_MAX_PATH_REC_QUEUE limit and why
>> do we drop packets if there are more than this many outstanding packets,
>> since having 50% packet drops is a very large amount of drops?
>>
>> Regards,
>> Nikolay
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

  parent reply	other threads:[~2016-08-01  8:20 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-07-28 11:00 [IPOIB] Excessive TX packet drops due to IPOIB_MAX_PATH_REC_QUEUE Nikolay Borisov
     [not found] ` <5799E5E6.3060104-6AxghH7DbtA@public.gmane.org>
2016-08-01  8:01   ` Erez Shitrit
     [not found]     ` <CAAk-MO83mJTq=E_MC=izqq8fEmVujY=5egVmKfFjxAz4jO3hHg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-08-01  8:20       ` Nikolay Borisov [this message]
     [not found]         ` <579F065C.602-6AxghH7DbtA@public.gmane.org>
2016-08-01  8:56           ` Erez Shitrit
     [not found]             ` <CAAk-MO9C7i0en5ZE=pufz6tMecUi23kL=5FR36JNfPzuO1G5-g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-08-01  9:46               ` Nikolay Borisov
2016-08-04 13:34             ` Slow veth performance over ipoib interface on 4.7.0 (and earlier) (Was Re: [IPOIB] Excessive TX packet drops due to IPOIB_MAX_PATH_REC_QUEUE) Nikolay Borisov
     [not found]               ` <57A34448.1040600-6AxghH7DbtA@public.gmane.org>
2016-08-04 14:08                 ` Doug Ledford

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=579F065C.602@kyup.com \
    --to=kernel-6axghh7dbta@public.gmane.org \
    --cc=erezsh-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org \
    --cc=linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org \
    --cc=roland-BHEL68pLQRGGvPXPguhicg@public.gmane.org \
    --cc=shlomop-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.