* [rtnetlink] Potential bug in Linux (rt)netlink code
@ 2018-10-12 7:30 Henning Rogge
2018-10-12 18:51 ` Stephen Hemminger
0 siblings, 1 reply; 4+ messages in thread
From: Henning Rogge @ 2018-10-12 7:30 UTC (permalink / raw)
To: netdev
Hi,
I am working on a self-written routing agent
(https://github.com/OLSR/OONF) and am stuck on a problem with netlink
that I cannot explain with an userspace error.
I am using a netlink socket for setting routes
(RTM_NEWROUTE/RTM_DELROUTE), querying the kernel for the current routes
in the database (via a RTM_GETROUTE dump) and for getting multicast
messages for ongoing routing changes.
After a few netlink messages I get to the point where the kernel just
does not responst to a RTM_NEWROUTE. No error, no answer, despite the
NLM_F_ACK flag set)... but sometime when (during shutdown of the routing
agent) the program sends another route command (most times a
RTM_DELROUTE) I get a single netlink packet with a "successful" response
for both the "missing" RTM_NEWROUTE and one for the new RTM DELROUTE
sequence number.
I am testing two routing agents, each of them in a systemd-nspawn based
container connected over a bridge on the host system on a current Debian
Testing (kernel 4.18.0-1-amd64).
I am directly using the netlink sockets, without any other userspace
library in between.
I have checked the hexdumps of a couple of netlink messages (including
the ones just before the bug happens) by hand and they seem to be okay.
When I tried to add a "netlink listener" socket for futher debugging (ip
link add nlmon0 type nlmon) the problem vanished until I removed the
listener socket again.
Any ideas how to debug this problem? Unfortunately I have no short
example program to trigger the bug... I have rarely seen the problem for
years (once every couple of months), but until a few days ago I never
managed to reproduce it.
Henning Rogge
--
Diplom-Informatiker Henning Rogge , Fraunhofer-Institut für
Kommunikation, Informationsverarbeitung und Ergonomie FKIE
Kommunikationssysteme (KOM)
Zanderstrasse 5, 53177 Bonn, Germany
Telefon +49 228 50212-469
mailto:henning.rogge@fkie.fraunhofer.de http://www.fkie.fraunhofer.de
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [rtnetlink] Potential bug in Linux (rt)netlink code
2018-10-12 7:30 [rtnetlink] Potential bug in Linux (rt)netlink code Henning Rogge
@ 2018-10-12 18:51 ` Stephen Hemminger
2018-10-15 5:25 ` Henning Rogge
0 siblings, 1 reply; 4+ messages in thread
From: Stephen Hemminger @ 2018-10-12 18:51 UTC (permalink / raw)
To: Henning Rogge; +Cc: netdev
On Fri, 12 Oct 2018 09:30:40 +0200
Henning Rogge <henning.rogge@fkie.fraunhofer.de> wrote:
> Hi,
>
> I am working on a self-written routing agent
> (https://github.com/OLSR/OONF) and am stuck on a problem with netlink
> that I cannot explain with an userspace error.
>
> I am using a netlink socket for setting routes
> (RTM_NEWROUTE/RTM_DELROUTE), querying the kernel for the current routes
> in the database (via a RTM_GETROUTE dump) and for getting multicast
> messages for ongoing routing changes.
>
> After a few netlink messages I get to the point where the kernel just
> does not responst to a RTM_NEWROUTE. No error, no answer, despite the
> NLM_F_ACK flag set)... but sometime when (during shutdown of the routing
> agent) the program sends another route command (most times a
> RTM_DELROUTE) I get a single netlink packet with a "successful" response
> for both the "missing" RTM_NEWROUTE and one for the new RTM DELROUTE
> sequence number.
>
> I am testing two routing agents, each of them in a systemd-nspawn based
> container connected over a bridge on the host system on a current Debian
> Testing (kernel 4.18.0-1-amd64).
>
> I am directly using the netlink sockets, without any other userspace
> library in between.
>
> I have checked the hexdumps of a couple of netlink messages (including
> the ones just before the bug happens) by hand and they seem to be okay.
>
> When I tried to add a "netlink listener" socket for futher debugging (ip
> link add nlmon0 type nlmon) the problem vanished until I removed the
> listener socket again.
>
> Any ideas how to debug this problem? Unfortunately I have no short
> example program to trigger the bug... I have rarely seen the problem for
> years (once every couple of months), but until a few days ago I never
> managed to reproduce it.
>
> Henning Rogge
Are you reading the responses to your requests? If you don't read
the response, the socket will get flow blocked.
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [rtnetlink] Potential bug in Linux (rt)netlink code
2018-10-12 18:51 ` Stephen Hemminger
@ 2018-10-15 5:25 ` Henning Rogge
2018-10-22 5:22 ` Henning Rogge
0 siblings, 1 reply; 4+ messages in thread
From: Henning Rogge @ 2018-10-15 5:25 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: netdev
Am 12.10.2018 um 20:51 schrieb Stephen Hemminger:
> On Fri, 12 Oct 2018 09:30:40 +0200
> Henning Rogge <henning.rogge@fkie.fraunhofer.de> wrote:
>
>> Hi,
>>
>> I am working on a self-written routing agent
>> (https://github.com/OLSR/OONF) and am stuck on a problem with netlink
>> that I cannot explain with an userspace error.
>>
>> I am using a netlink socket for setting routes
>> (RTM_NEWROUTE/RTM_DELROUTE), querying the kernel for the current routes
>> in the database (via a RTM_GETROUTE dump) and for getting multicast
>> messages for ongoing routing changes.
>>
>> After a few netlink messages I get to the point where the kernel just
>> does not responst to a RTM_NEWROUTE. No error, no answer, despite the
>> NLM_F_ACK flag set)... but sometime when (during shutdown of the routing
>> agent) the program sends another route command (most times a
>> RTM_DELROUTE) I get a single netlink packet with a "successful" response
>> for both the "missing" RTM_NEWROUTE and one for the new RTM DELROUTE
>> sequence number.
>>
>> I am testing two routing agents, each of them in a systemd-nspawn based
>> container connected over a bridge on the host system on a current Debian
>> Testing (kernel 4.18.0-1-amd64).
>>
>> I am directly using the netlink sockets, without any other userspace
>> library in between.
>>
>> I have checked the hexdumps of a couple of netlink messages (including
>> the ones just before the bug happens) by hand and they seem to be okay.
>>
>> When I tried to add a "netlink listener" socket for futher debugging (ip
>> link add nlmon0 type nlmon) the problem vanished until I removed the
>> listener socket again.
>>
>> Any ideas how to debug this problem? Unfortunately I have no short
>> example program to trigger the bug... I have rarely seen the problem for
>> years (once every couple of months), but until a few days ago I never
>> managed to reproduce it.
>>
>> Henning Rogge
>
> Are you reading the responses to your requests? If you don't read
> the response, the socket will get flow blocked.
Yes, I do...
all netlink sockets the program uses are constantly watched for traffic
coming from the kernel (with an epoll()-based event loop, no edge-trigger).
I even have a rate limitation towards the kernel, only sending a
"pagesize" full of netlink data towards the kernel, then waiting for the
reply before sending more (I had the blocking problem a few years ago
when experimenting with LOTS of routes).
Henning Rogge
--
Diplom-Informatiker Henning Rogge , Fraunhofer-Institut für
Kommunikation, Informationsverarbeitung und Ergonomie FKIE
Kommunikationssysteme (KOM)
Zanderstrasse 5, 53177 Bonn, Germany
Telefon +49 228 50212-469
mailto:henning.rogge@fkie.fraunhofer.de http://www.fkie.fraunhofer.de
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [rtnetlink] Potential bug in Linux (rt)netlink code
2018-10-15 5:25 ` Henning Rogge
@ 2018-10-22 5:22 ` Henning Rogge
0 siblings, 0 replies; 4+ messages in thread
From: Henning Rogge @ 2018-10-22 5:22 UTC (permalink / raw)
To: netdev; +Cc: Stephen Hemminger
Does anyone else have an idea how to debug this problem?
Henning Rogge
Am 15.10.2018 um 07:25 schrieb Henning Rogge:
> Am 12.10.2018 um 20:51 schrieb Stephen Hemminger:
>> On Fri, 12 Oct 2018 09:30:40 +0200
>> Henning Rogge <henning.rogge@fkie.fraunhofer.de> wrote:
>>
>>> Hi,
>>>
>>> I am working on a self-written routing agent
>>> (https://github.com/OLSR/OONF) and am stuck on a problem with netlink
>>> that I cannot explain with an userspace error.
>>>
>>> I am using a netlink socket for setting routes
>>> (RTM_NEWROUTE/RTM_DELROUTE), querying the kernel for the current routes
>>> in the database (via a RTM_GETROUTE dump) and for getting multicast
>>> messages for ongoing routing changes.
>>>
>>> After a few netlink messages I get to the point where the kernel just
>>> does not responst to a RTM_NEWROUTE. No error, no answer, despite the
>>> NLM_F_ACK flag set)... but sometime when (during shutdown of the routing
>>> agent) the program sends another route command (most times a
>>> RTM_DELROUTE) I get a single netlink packet with a "successful" response
>>> for both the "missing" RTM_NEWROUTE and one for the new RTM DELROUTE
>>> sequence number.
>>>
>>> I am testing two routing agents, each of them in a systemd-nspawn based
>>> container connected over a bridge on the host system on a current Debian
>>> Testing (kernel 4.18.0-1-amd64).
>>>
>>> I am directly using the netlink sockets, without any other userspace
>>> library in between.
>>>
>>> I have checked the hexdumps of a couple of netlink messages (including
>>> the ones just before the bug happens) by hand and they seem to be okay.
>>>
>>> When I tried to add a "netlink listener" socket for futher debugging (ip
>>> link add nlmon0 type nlmon) the problem vanished until I removed the
>>> listener socket again.
>>>
>>> Any ideas how to debug this problem? Unfortunately I have no short
>>> example program to trigger the bug... I have rarely seen the problem for
>>> years (once every couple of months), but until a few days ago I never
>>> managed to reproduce it.
>>>
>>> Henning Rogge
>>
>> Are you reading the responses to your requests? If you don't read
>> the response, the socket will get flow blocked.
>
> Yes, I do...
>
> all netlink sockets the program uses are constantly watched for traffic
> coming from the kernel (with an epoll()-based event loop, no edge-trigger).
>
> I even have a rate limitation towards the kernel, only sending a
> "pagesize" full of netlink data towards the kernel, then waiting for the
> reply before sending more (I had the blocking problem a few years ago
> when experimenting with LOTS of routes).
>
> Henning Rogge
Henning Rogge
--
Diplom-Informatiker Henning Rogge , Fraunhofer-Institut für
Kommunikation, Informationsverarbeitung und Ergonomie FKIE
Kommunikationssysteme (KOM)
Zanderstrasse 5, 53177 Bonn, Germany
Telefon +49 228 50212-469
mailto:henning.rogge@fkie.fraunhofer.de http://www.fkie.fraunhofer.de
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2018-10-22 13:39 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-10-12 7:30 [rtnetlink] Potential bug in Linux (rt)netlink code Henning Rogge
2018-10-12 18:51 ` Stephen Hemminger
2018-10-15 5:25 ` Henning Rogge
2018-10-22 5:22 ` Henning Rogge
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).