netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [rtnetlink] Potential bug in Linux (rt)netlink code
@ 2018-10-12  7:30 Henning Rogge
  2018-10-12 18:51 ` Stephen Hemminger
  0 siblings, 1 reply; 4+ messages in thread
From: Henning Rogge @ 2018-10-12  7:30 UTC (permalink / raw)
  To: netdev

Hi,

I am working on a self-written routing agent 
(https://github.com/OLSR/OONF) and am stuck on a problem with netlink 
that I cannot explain with an userspace error.

I am using a netlink socket for setting routes 
(RTM_NEWROUTE/RTM_DELROUTE), querying the kernel for the current routes 
in the database (via a RTM_GETROUTE dump) and for getting multicast 
messages for ongoing routing changes.

After a few netlink messages I get to the point where the kernel just 
does not responst to a RTM_NEWROUTE. No error, no answer, despite the 
NLM_F_ACK flag set)... but sometime when (during shutdown of the routing 
agent) the program sends another route command (most times a 
RTM_DELROUTE) I get a single netlink packet with a "successful" response 
for both the "missing" RTM_NEWROUTE and one for the new RTM DELROUTE 
sequence number.

I am testing two routing agents, each of them in a systemd-nspawn based 
container connected over a bridge on the host system on a current Debian 
Testing (kernel 4.18.0-1-amd64).

I am directly using the netlink sockets, without any other userspace 
library in between.

I have checked the hexdumps of a couple of netlink messages (including 
the ones just before the bug happens) by hand and they seem to be okay.

When I tried to add a "netlink listener" socket for futher debugging (ip 
link add nlmon0 type nlmon) the problem vanished until I removed the 
listener socket again.

Any ideas how to debug this problem? Unfortunately I have no short 
example program to trigger the bug... I have rarely seen the problem for 
years (once every couple of months), but until a few days ago I never 
managed to reproduce it.

Henning Rogge
-- 
Diplom-Informatiker Henning Rogge , Fraunhofer-Institut für
Kommunikation, Informationsverarbeitung und Ergonomie FKIE
Kommunikationssysteme (KOM)
Zanderstrasse 5, 53177 Bonn, Germany
Telefon +49 228 50212-469
mailto:henning.rogge@fkie.fraunhofer.de http://www.fkie.fraunhofer.de

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [rtnetlink] Potential bug in Linux (rt)netlink code
  2018-10-12  7:30 [rtnetlink] Potential bug in Linux (rt)netlink code Henning Rogge
@ 2018-10-12 18:51 ` Stephen Hemminger
  2018-10-15  5:25   ` Henning Rogge
  0 siblings, 1 reply; 4+ messages in thread
From: Stephen Hemminger @ 2018-10-12 18:51 UTC (permalink / raw)
  To: Henning Rogge; +Cc: netdev

On Fri, 12 Oct 2018 09:30:40 +0200
Henning Rogge <henning.rogge@fkie.fraunhofer.de> wrote:

> Hi,
> 
> I am working on a self-written routing agent 
> (https://github.com/OLSR/OONF) and am stuck on a problem with netlink 
> that I cannot explain with an userspace error.
> 
> I am using a netlink socket for setting routes 
> (RTM_NEWROUTE/RTM_DELROUTE), querying the kernel for the current routes 
> in the database (via a RTM_GETROUTE dump) and for getting multicast 
> messages for ongoing routing changes.
> 
> After a few netlink messages I get to the point where the kernel just 
> does not responst to a RTM_NEWROUTE. No error, no answer, despite the 
> NLM_F_ACK flag set)... but sometime when (during shutdown of the routing 
> agent) the program sends another route command (most times a 
> RTM_DELROUTE) I get a single netlink packet with a "successful" response 
> for both the "missing" RTM_NEWROUTE and one for the new RTM DELROUTE 
> sequence number.
> 
> I am testing two routing agents, each of them in a systemd-nspawn based 
> container connected over a bridge on the host system on a current Debian 
> Testing (kernel 4.18.0-1-amd64).
> 
> I am directly using the netlink sockets, without any other userspace 
> library in between.
> 
> I have checked the hexdumps of a couple of netlink messages (including 
> the ones just before the bug happens) by hand and they seem to be okay.
> 
> When I tried to add a "netlink listener" socket for futher debugging (ip 
> link add nlmon0 type nlmon) the problem vanished until I removed the 
> listener socket again.
> 
> Any ideas how to debug this problem? Unfortunately I have no short 
> example program to trigger the bug... I have rarely seen the problem for 
> years (once every couple of months), but until a few days ago I never 
> managed to reproduce it.
> 
> Henning Rogge

Are you reading the responses to your requests?  If you don't read
the response, the socket will get flow blocked.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [rtnetlink] Potential bug in Linux (rt)netlink code
  2018-10-12 18:51 ` Stephen Hemminger
@ 2018-10-15  5:25   ` Henning Rogge
  2018-10-22  5:22     ` Henning Rogge
  0 siblings, 1 reply; 4+ messages in thread
From: Henning Rogge @ 2018-10-15  5:25 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev

Am 12.10.2018 um 20:51 schrieb Stephen Hemminger:
> On Fri, 12 Oct 2018 09:30:40 +0200
> Henning Rogge <henning.rogge@fkie.fraunhofer.de> wrote:
> 
>> Hi,
>>
>> I am working on a self-written routing agent
>> (https://github.com/OLSR/OONF) and am stuck on a problem with netlink
>> that I cannot explain with an userspace error.
>>
>> I am using a netlink socket for setting routes
>> (RTM_NEWROUTE/RTM_DELROUTE), querying the kernel for the current routes
>> in the database (via a RTM_GETROUTE dump) and for getting multicast
>> messages for ongoing routing changes.
>>
>> After a few netlink messages I get to the point where the kernel just
>> does not responst to a RTM_NEWROUTE. No error, no answer, despite the
>> NLM_F_ACK flag set)... but sometime when (during shutdown of the routing
>> agent) the program sends another route command (most times a
>> RTM_DELROUTE) I get a single netlink packet with a "successful" response
>> for both the "missing" RTM_NEWROUTE and one for the new RTM DELROUTE
>> sequence number.
>>
>> I am testing two routing agents, each of them in a systemd-nspawn based
>> container connected over a bridge on the host system on a current Debian
>> Testing (kernel 4.18.0-1-amd64).
>>
>> I am directly using the netlink sockets, without any other userspace
>> library in between.
>>
>> I have checked the hexdumps of a couple of netlink messages (including
>> the ones just before the bug happens) by hand and they seem to be okay.
>>
>> When I tried to add a "netlink listener" socket for futher debugging (ip
>> link add nlmon0 type nlmon) the problem vanished until I removed the
>> listener socket again.
>>
>> Any ideas how to debug this problem? Unfortunately I have no short
>> example program to trigger the bug... I have rarely seen the problem for
>> years (once every couple of months), but until a few days ago I never
>> managed to reproduce it.
>>
>> Henning Rogge
> 
> Are you reading the responses to your requests?  If you don't read
> the response, the socket will get flow blocked.

Yes, I do...

all netlink sockets the program uses are constantly watched for traffic 
coming from the kernel (with an epoll()-based event loop, no edge-trigger).

I even have a rate limitation towards the kernel, only sending a 
"pagesize" full of netlink data towards the kernel, then waiting for the 
reply before sending more (I had the blocking problem a few years ago 
when experimenting with LOTS of routes).

Henning Rogge
-- 
Diplom-Informatiker Henning Rogge , Fraunhofer-Institut für
Kommunikation, Informationsverarbeitung und Ergonomie FKIE
Kommunikationssysteme (KOM)
Zanderstrasse 5, 53177 Bonn, Germany
Telefon +49 228 50212-469
mailto:henning.rogge@fkie.fraunhofer.de http://www.fkie.fraunhofer.de

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [rtnetlink] Potential bug in Linux (rt)netlink code
  2018-10-15  5:25   ` Henning Rogge
@ 2018-10-22  5:22     ` Henning Rogge
  0 siblings, 0 replies; 4+ messages in thread
From: Henning Rogge @ 2018-10-22  5:22 UTC (permalink / raw)
  To: netdev; +Cc: Stephen Hemminger

Does anyone else have an idea how to debug this problem?

Henning Rogge

Am 15.10.2018 um 07:25 schrieb Henning Rogge:
> Am 12.10.2018 um 20:51 schrieb Stephen Hemminger:
>> On Fri, 12 Oct 2018 09:30:40 +0200
>> Henning Rogge <henning.rogge@fkie.fraunhofer.de> wrote:
>>
>>> Hi,
>>>
>>> I am working on a self-written routing agent
>>> (https://github.com/OLSR/OONF) and am stuck on a problem with netlink
>>> that I cannot explain with an userspace error.
>>>
>>> I am using a netlink socket for setting routes
>>> (RTM_NEWROUTE/RTM_DELROUTE), querying the kernel for the current routes
>>> in the database (via a RTM_GETROUTE dump) and for getting multicast
>>> messages for ongoing routing changes.
>>>
>>> After a few netlink messages I get to the point where the kernel just
>>> does not responst to a RTM_NEWROUTE. No error, no answer, despite the
>>> NLM_F_ACK flag set)... but sometime when (during shutdown of the routing
>>> agent) the program sends another route command (most times a
>>> RTM_DELROUTE) I get a single netlink packet with a "successful" response
>>> for both the "missing" RTM_NEWROUTE and one for the new RTM DELROUTE
>>> sequence number.
>>>
>>> I am testing two routing agents, each of them in a systemd-nspawn based
>>> container connected over a bridge on the host system on a current Debian
>>> Testing (kernel 4.18.0-1-amd64).
>>>
>>> I am directly using the netlink sockets, without any other userspace
>>> library in between.
>>>
>>> I have checked the hexdumps of a couple of netlink messages (including
>>> the ones just before the bug happens) by hand and they seem to be okay.
>>>
>>> When I tried to add a "netlink listener" socket for futher debugging (ip
>>> link add nlmon0 type nlmon) the problem vanished until I removed the
>>> listener socket again.
>>>
>>> Any ideas how to debug this problem? Unfortunately I have no short
>>> example program to trigger the bug... I have rarely seen the problem for
>>> years (once every couple of months), but until a few days ago I never
>>> managed to reproduce it.
>>>
>>> Henning Rogge
>>
>> Are you reading the responses to your requests?  If you don't read
>> the response, the socket will get flow blocked.
> 
> Yes, I do...
> 
> all netlink sockets the program uses are constantly watched for traffic 
> coming from the kernel (with an epoll()-based event loop, no edge-trigger).
> 
> I even have a rate limitation towards the kernel, only sending a 
> "pagesize" full of netlink data towards the kernel, then waiting for the 
> reply before sending more (I had the blocking problem a few years ago 
> when experimenting with LOTS of routes).
> 
> Henning Rogge

Henning Rogge
-- 
Diplom-Informatiker Henning Rogge , Fraunhofer-Institut für
Kommunikation, Informationsverarbeitung und Ergonomie FKIE
Kommunikationssysteme (KOM)
Zanderstrasse 5, 53177 Bonn, Germany
Telefon +49 228 50212-469
mailto:henning.rogge@fkie.fraunhofer.de http://www.fkie.fraunhofer.de

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2018-10-22 13:39 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-10-12  7:30 [rtnetlink] Potential bug in Linux (rt)netlink code Henning Rogge
2018-10-12 18:51 ` Stephen Hemminger
2018-10-15  5:25   ` Henning Rogge
2018-10-22  5:22     ` Henning Rogge

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).