From mboxrd@z Thu Jan 1 00:00:00 1970 From: Stephen Hemminger Subject: Re: [rtnetlink] Potential bug in Linux (rt)netlink code Date: Fri, 12 Oct 2018 11:51:59 -0700 Message-ID: <20181012115159.7ead2f97@xeon-e3> References: <4d7a11b7-1f43-5669-6f19-3c746cc88306@fkie.fraunhofer.de> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Cc: To: Henning Rogge Return-path: Received: from mail-pf1-f195.google.com ([209.85.210.195]:40564 "EHLO mail-pf1-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726040AbeJMCZx (ORCPT ); Fri, 12 Oct 2018 22:25:53 -0400 Received: by mail-pf1-f195.google.com with SMTP id s5-v6so6631009pfj.7 for ; Fri, 12 Oct 2018 11:52:02 -0700 (PDT) In-Reply-To: <4d7a11b7-1f43-5669-6f19-3c746cc88306@fkie.fraunhofer.de> Sender: netdev-owner@vger.kernel.org List-ID: On Fri, 12 Oct 2018 09:30:40 +0200 Henning Rogge wrote: > Hi, > > I am working on a self-written routing agent > (https://github.com/OLSR/OONF) and am stuck on a problem with netlink > that I cannot explain with an userspace error. > > I am using a netlink socket for setting routes > (RTM_NEWROUTE/RTM_DELROUTE), querying the kernel for the current routes > in the database (via a RTM_GETROUTE dump) and for getting multicast > messages for ongoing routing changes. > > After a few netlink messages I get to the point where the kernel just > does not responst to a RTM_NEWROUTE. No error, no answer, despite the > NLM_F_ACK flag set)... but sometime when (during shutdown of the routing > agent) the program sends another route command (most times a > RTM_DELROUTE) I get a single netlink packet with a "successful" response > for both the "missing" RTM_NEWROUTE and one for the new RTM DELROUTE > sequence number. > > I am testing two routing agents, each of them in a systemd-nspawn based > container connected over a bridge on the host system on a current Debian > Testing (kernel 4.18.0-1-amd64). > > I am directly using the netlink sockets, without any other userspace > library in between. > > I have checked the hexdumps of a couple of netlink messages (including > the ones just before the bug happens) by hand and they seem to be okay. > > When I tried to add a "netlink listener" socket for futher debugging (ip > link add nlmon0 type nlmon) the problem vanished until I removed the > listener socket again. > > Any ideas how to debug this problem? Unfortunately I have no short > example program to trigger the bug... I have rarely seen the problem for > years (once every couple of months), but until a few days ago I never > managed to reproduce it. > > Henning Rogge Are you reading the responses to your requests? If you don't read the response, the socket will get flow blocked.