* Segmentation fault when taking device for a walk
@ 2024-08-16 0:24 Richard Acayan
2024-08-16 11:53 ` James Prestwood
0 siblings, 1 reply; 8+ messages in thread
From: Richard Acayan @ 2024-08-16 0:24 UTC (permalink / raw)
To: iwd
Hi,
A segmentation fault occurs in station_start_roam() when the station is
disconnected from an access point, or in other words, when the station's
connected_bss is NULL. Usually, this is triggered by a timeout, possibly
scheduled in response to a weak signal event.
This is occurring on my Pixel 3a running postmarketOS/Alpine Linux, when
receding from an access point, on iwd 2.19. I have collected 6 coredumps
of the crash in the span of around 2 weeks and would be willing to use
GDB if more information is necessary for a patch.
Sample:
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x0000aaaadf2086a0 in station_start_roam (station=0xffff8776ae50) at src/station.c:2880
warning: 2880 src/station.c: No such file or directory
(gdb) bt
#0 0x0000aaaadf2086a0 in station_start_roam (station=0xffff8776ae50) at src/station.c:2880
#1 0x0000aaaadf28c544 in timeout_callback (fd=<optimized out>, events=<optimized out>,
user_data=0xffff876b2e20) at ell/timeout.c:68
#2 timeout_callback (fd=<optimized out>, events=<optimized out>, user_data=0xffff876b2e20)
at ell/timeout.c:57
#3 0x0000aaaadf28b9d0 in l_main_iterate (timeout=<optimized out>) at ell/main.c:461
#4 0x0000aaaadf28bac0 in l_main_run () at ell/main.c:508
#5 l_main_run () at ell/main.c:490
#6 0x0000aaaadf28bce4 in l_main_run_with_signal (
callback=callback@entry=0xaaaadf1f1110 <signal_handler>, user_data=user_data@entry=0x0)
at ell/main.c:630
#7 0x0000aaaadf1f0b0c in main (argc=<optimized out>, argv=<optimized out>) at src/main.c:611
(gdb) p station->connected_bss
$1 = (struct scan_bss *) 0x0
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Segmentation fault when taking device for a walk
2024-08-16 0:24 Segmentation fault when taking device for a walk Richard Acayan
@ 2024-08-16 11:53 ` James Prestwood
2024-08-19 21:59 ` Richard Acayan
0 siblings, 1 reply; 8+ messages in thread
From: James Prestwood @ 2024-08-16 11:53 UTC (permalink / raw)
To: Richard Acayan, iwd
Hi Richard,
On 8/15/24 5:24 PM, Richard Acayan wrote:
> Hi,
>
> A segmentation fault occurs in station_start_roam() when the station is
> disconnected from an access point, or in other words, when the station's
> connected_bss is NULL. Usually, this is triggered by a timeout, possibly
> scheduled in response to a weak signal event.
>
> This is occurring on my Pixel 3a running postmarketOS/Alpine Linux, when
> receding from an access point, on iwd 2.19. I have collected 6 coredumps
> of the crash in the span of around 2 weeks and would be willing to use
> GDB if more information is necessary for a patch.
>
> Sample:
>
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0 0x0000aaaadf2086a0 in station_start_roam (station=0xffff8776ae50) at src/station.c:2880
>
> warning: 2880 src/station.c: No such file or directory
> (gdb) bt
> #0 0x0000aaaadf2086a0 in station_start_roam (station=0xffff8776ae50) at src/station.c:2880
> #1 0x0000aaaadf28c544 in timeout_callback (fd=<optimized out>, events=<optimized out>,
> user_data=0xffff876b2e20) at ell/timeout.c:68
> #2 timeout_callback (fd=<optimized out>, events=<optimized out>, user_data=0xffff876b2e20)
> at ell/timeout.c:57
> #3 0x0000aaaadf28b9d0 in l_main_iterate (timeout=<optimized out>) at ell/main.c:461
> #4 0x0000aaaadf28bac0 in l_main_run () at ell/main.c:508
> #5 l_main_run () at ell/main.c:490
> #6 0x0000aaaadf28bce4 in l_main_run_with_signal (
> callback=callback@entry=0xaaaadf1f1110 <signal_handler>, user_data=user_data@entry=0x0)
> at ell/main.c:630
> #7 0x0000aaaadf1f0b0c in main (argc=<optimized out>, argv=<optimized out>) at src/main.c:611
> (gdb) p station->connected_bss
> $1 = (struct scan_bss *) 0x0
>
Its hard to say without any debug logs as well but it appears the
disconnect never cleared out the timer used for the next roam attempt. I
did fix a hang due to a disconnect coming in during a roam attempt after
2.19, but I can't really make heads or tails without debug logs to see
what happened before/after the disconnect.
Thanks,
James
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Segmentation fault when taking device for a walk
2024-08-16 11:53 ` James Prestwood
@ 2024-08-19 21:59 ` Richard Acayan
2024-08-20 15:04 ` James Prestwood
0 siblings, 1 reply; 8+ messages in thread
From: Richard Acayan @ 2024-08-19 21:59 UTC (permalink / raw)
To: James Prestwood; +Cc: iwd
On Fri, Aug 16, 2024 at 04:53:41AM -0700, James Prestwood wrote:
> Hi Richard,
>
> On 8/15/24 5:24 PM, Richard Acayan wrote:
>> Hi,
>>
>> A segmentation fault occurs in station_start_roam() when the station is
>> disconnected from an access point, or in other words, when the station's
>> connected_bss is NULL. Usually, this is triggered by a timeout, possibly
>> scheduled in response to a weak signal event.
>>
>> This is occurring on my Pixel 3a running postmarketOS/Alpine Linux, when
>> receding from an access point, on iwd 2.19. I have collected 6 coredumps
>> of the crash in the span of around 2 weeks and would be willing to use
>> GDB if more information is necessary for a patch.
>>
>> Sample:
>>
>> Program terminated with signal SIGSEGV, Segmentation fault.
>> #0 0x0000aaaadf2086a0 in station_start_roam (station=0xffff8776ae50) at src/station.c:2880
>>
>> warning: 2880 src/station.c: No such file or directory
>> (gdb) bt
>> #0 0x0000aaaadf2086a0 in station_start_roam (station=0xffff8776ae50) at src/station.c:2880
>> #1 0x0000aaaadf28c544 in timeout_callback (fd=<optimized out>, events=<optimized out>,
>> user_data=0xffff876b2e20) at ell/timeout.c:68
>> #2 timeout_callback (fd=<optimized out>, events=<optimized out>, user_data=0xffff876b2e20)
>> at ell/timeout.c:57
>> #3 0x0000aaaadf28b9d0 in l_main_iterate (timeout=<optimized out>) at ell/main.c:461
>> #4 0x0000aaaadf28bac0 in l_main_run () at ell/main.c:508
>> #5 l_main_run () at ell/main.c:490
>> #6 0x0000aaaadf28bce4 in l_main_run_with_signal (
>> callback=callback@entry=0xaaaadf1f1110 <signal_handler>, user_data=user_data@entry=0x0)
>> at ell/main.c:630
>> #7 0x0000aaaadf1f0b0c in main (argc=<optimized out>, argv=<optimized out>) at src/main.c:611
>> (gdb) p station->connected_bss
>> $1 = (struct scan_bss *) 0x0
>>
> Its hard to say without any debug logs as well but it appears the disconnect
> never cleared out the timer used for the next roam attempt. I did fix a hang
> due to a disconnect coming in during a roam attempt after 2.19, but I can't
> really make heads or tails without debug logs to see what happened
> before/after the disconnect.
It happened again with debug logs enabled. Relevant snippet (from
logread):
[Aug 17 21:22:12] daemon iwd: src/station.c:station_roam_state_clear() 5
[Aug 17 21:22:12] daemon iwd: event: state, old: connected, new: disconnecting
[Aug 17 21:22:15] daemon iwd: src/netdev.c:netdev_mlme_notify() MLME notification Del Station(20)
[Aug 17 21:22:15] daemon iwd: src/netdev.c:netdev_link_notify() event 16 on ifindex 5
[Aug 17 21:22:15] daemon iwd: src/netdev.c:netdev_mlme_notify() MLME notification Deauthenticate(39)
[Aug 17 21:22:15] daemon iwd: src/netdev.c:netdev_deauthenticate_event()
[Aug 17 21:22:15] daemon iwd: src/netdev.c:netdev_mlme_notify() MLME notification Disconnect(48)
[Aug 17 21:22:15] daemon iwd: src/netdev.c:netdev_disconnect_event()
[Aug 17 21:22:15] daemon iwd: src/station.c:station_disconnect_cb() 5, success: 1
[Aug 17 21:22:15] daemon iwd: event: state, old: disconnecting, new: disconnected
[Aug 17 21:22:15] daemon iwd: src/wiphy.c:wiphy_reg_notify() Notification of command Reg Change(36)
[Aug 17 21:22:15] daemon iwd: src/wiphy.c:wiphy_update_reg_domain() New reg domain country code for (global) is XX
[Aug 17 21:22:36] daemon iwd: src/station.c:station_roam_trigger_cb() 5
Afterwards is the segmentation fault.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Segmentation fault when taking device for a walk
2024-08-19 21:59 ` Richard Acayan
@ 2024-08-20 15:04 ` James Prestwood
2024-08-20 16:00 ` Richard Acayan
0 siblings, 1 reply; 8+ messages in thread
From: James Prestwood @ 2024-08-20 15:04 UTC (permalink / raw)
To: Richard Acayan; +Cc: iwd
Hi Richard,
On 8/19/24 2:59 PM, Richard Acayan wrote:
> On Fri, Aug 16, 2024 at 04:53:41AM -0700, James Prestwood wrote:
>> Hi Richard,
>>
>> On 8/15/24 5:24 PM, Richard Acayan wrote:
>>> Hi,
>>>
>>> A segmentation fault occurs in station_start_roam() when the station is
>>> disconnected from an access point, or in other words, when the station's
>>> connected_bss is NULL. Usually, this is triggered by a timeout, possibly
>>> scheduled in response to a weak signal event.
>>>
>>> This is occurring on my Pixel 3a running postmarketOS/Alpine Linux, when
>>> receding from an access point, on iwd 2.19. I have collected 6 coredumps
>>> of the crash in the span of around 2 weeks and would be willing to use
>>> GDB if more information is necessary for a patch.
>>>
>>> Sample:
>>>
>>> Program terminated with signal SIGSEGV, Segmentation fault.
>>> #0 0x0000aaaadf2086a0 in station_start_roam (station=0xffff8776ae50) at src/station.c:2880
>>>
>>> warning: 2880 src/station.c: No such file or directory
>>> (gdb) bt
>>> #0 0x0000aaaadf2086a0 in station_start_roam (station=0xffff8776ae50) at src/station.c:2880
>>> #1 0x0000aaaadf28c544 in timeout_callback (fd=<optimized out>, events=<optimized out>,
>>> user_data=0xffff876b2e20) at ell/timeout.c:68
>>> #2 timeout_callback (fd=<optimized out>, events=<optimized out>, user_data=0xffff876b2e20)
>>> at ell/timeout.c:57
>>> #3 0x0000aaaadf28b9d0 in l_main_iterate (timeout=<optimized out>) at ell/main.c:461
>>> #4 0x0000aaaadf28bac0 in l_main_run () at ell/main.c:508
>>> #5 l_main_run () at ell/main.c:490
>>> #6 0x0000aaaadf28bce4 in l_main_run_with_signal (
>>> callback=callback@entry=0xaaaadf1f1110 <signal_handler>, user_data=user_data@entry=0x0)
>>> at ell/main.c:630
>>> #7 0x0000aaaadf1f0b0c in main (argc=<optimized out>, argv=<optimized out>) at src/main.c:611
>>> (gdb) p station->connected_bss
>>> $1 = (struct scan_bss *) 0x0
>>>
>> Its hard to say without any debug logs as well but it appears the disconnect
>> never cleared out the timer used for the next roam attempt. I did fix a hang
>> due to a disconnect coming in during a roam attempt after 2.19, but I can't
>> really make heads or tails without debug logs to see what happened
>> before/after the disconnect.
> It happened again with debug logs enabled. Relevant snippet (from
> logread):
>
> [Aug 17 21:22:12] daemon iwd: src/station.c:station_roam_state_clear() 5
> [Aug 17 21:22:12] daemon iwd: event: state, old: connected, new: disconnecting
> [Aug 17 21:22:15] daemon iwd: src/netdev.c:netdev_mlme_notify() MLME notification Del Station(20)
> [Aug 17 21:22:15] daemon iwd: src/netdev.c:netdev_link_notify() event 16 on ifindex 5
> [Aug 17 21:22:15] daemon iwd: src/netdev.c:netdev_mlme_notify() MLME notification Deauthenticate(39)
> [Aug 17 21:22:15] daemon iwd: src/netdev.c:netdev_deauthenticate_event()
> [Aug 17 21:22:15] daemon iwd: src/netdev.c:netdev_mlme_notify() MLME notification Disconnect(48)
> [Aug 17 21:22:15] daemon iwd: src/netdev.c:netdev_disconnect_event()
> [Aug 17 21:22:15] daemon iwd: src/station.c:station_disconnect_cb() 5, success: 1
> [Aug 17 21:22:15] daemon iwd: event: state, old: disconnecting, new: disconnected
> [Aug 17 21:22:15] daemon iwd: src/wiphy.c:wiphy_reg_notify() Notification of command Reg Change(36)
> [Aug 17 21:22:15] daemon iwd: src/wiphy.c:wiphy_update_reg_domain() New reg domain country code for (global) is XX
> [Aug 17 21:22:36] daemon iwd: src/station.c:station_roam_trigger_cb() 5
>
> Afterwards is the segmentation fault.
Do you happen to have the logs a few minutes prior. The roam timeout is
defaulted to 60 seconds, so at some point it was re-armed but the logs
don't go back that far. Its trivial to handle the segfault but I suspect
the roam timeout being rearmed is also leaking memory so we should
address that as the root cause.
Thanks,
James
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Segmentation fault when taking device for a walk
2024-08-20 15:04 ` James Prestwood
@ 2024-08-20 16:00 ` Richard Acayan
2024-08-21 14:27 ` James Prestwood
0 siblings, 1 reply; 8+ messages in thread
From: Richard Acayan @ 2024-08-20 16:00 UTC (permalink / raw)
To: James Prestwood; +Cc: iwd
On Tue, Aug 20, 2024 at 08:04:20AM -0700, James Prestwood wrote:
> Hi Richard,
>
> On 8/19/24 2:59 PM, Richard Acayan wrote:
> > On Fri, Aug 16, 2024 at 04:53:41AM -0700, James Prestwood wrote:
> > > Hi Richard,
> > >
> > > On 8/15/24 5:24 PM, Richard Acayan wrote:
> > > > Hi,
> > > >
> > > > A segmentation fault occurs in station_start_roam() when the station is
> > > > disconnected from an access point, or in other words, when the station's
> > > > connected_bss is NULL. Usually, this is triggered by a timeout, possibly
> > > > scheduled in response to a weak signal event.
> > > >
> > > > This is occurring on my Pixel 3a running postmarketOS/Alpine Linux, when
> > > > receding from an access point, on iwd 2.19. I have collected 6 coredumps
> > > > of the crash in the span of around 2 weeks and would be willing to use
> > > > GDB if more information is necessary for a patch.
> > > >
> > > > Sample:
> > > >
> > > > Program terminated with signal SIGSEGV, Segmentation fault.
> > > > #0 0x0000aaaadf2086a0 in station_start_roam (station=0xffff8776ae50) at src/station.c:2880
> > > >
> > > > warning: 2880 src/station.c: No such file or directory
> > > > (gdb) bt
> > > > #0 0x0000aaaadf2086a0 in station_start_roam (station=0xffff8776ae50) at src/station.c:2880
> > > > #1 0x0000aaaadf28c544 in timeout_callback (fd=<optimized out>, events=<optimized out>,
> > > > user_data=0xffff876b2e20) at ell/timeout.c:68
> > > > #2 timeout_callback (fd=<optimized out>, events=<optimized out>, user_data=0xffff876b2e20)
> > > > at ell/timeout.c:57
> > > > #3 0x0000aaaadf28b9d0 in l_main_iterate (timeout=<optimized out>) at ell/main.c:461
> > > > #4 0x0000aaaadf28bac0 in l_main_run () at ell/main.c:508
> > > > #5 l_main_run () at ell/main.c:490
> > > > #6 0x0000aaaadf28bce4 in l_main_run_with_signal (
> > > > callback=callback@entry=0xaaaadf1f1110 <signal_handler>, user_data=user_data@entry=0x0)
> > > > at ell/main.c:630
> > > > #7 0x0000aaaadf1f0b0c in main (argc=<optimized out>, argv=<optimized out>) at src/main.c:611
> > > > (gdb) p station->connected_bss
> > > > $1 = (struct scan_bss *) 0x0
> > > >
> > > Its hard to say without any debug logs as well but it appears the disconnect
> > > never cleared out the timer used for the next roam attempt. I did fix a hang
> > > due to a disconnect coming in during a roam attempt after 2.19, but I can't
> > > really make heads or tails without debug logs to see what happened
> > > before/after the disconnect.
> > It happened again with debug logs enabled. Relevant snippet (from
> > logread):
> >
> > [Aug 17 21:22:12] daemon iwd: src/station.c:station_roam_state_clear() 5
> > [Aug 17 21:22:12] daemon iwd: event: state, old: connected, new: disconnecting
> > [Aug 17 21:22:15] daemon iwd: src/netdev.c:netdev_mlme_notify() MLME notification Del Station(20)
> > [Aug 17 21:22:15] daemon iwd: src/netdev.c:netdev_link_notify() event 16 on ifindex 5
> > [Aug 17 21:22:15] daemon iwd: src/netdev.c:netdev_mlme_notify() MLME notification Deauthenticate(39)
> > [Aug 17 21:22:15] daemon iwd: src/netdev.c:netdev_deauthenticate_event()
> > [Aug 17 21:22:15] daemon iwd: src/netdev.c:netdev_mlme_notify() MLME notification Disconnect(48)
> > [Aug 17 21:22:15] daemon iwd: src/netdev.c:netdev_disconnect_event()
> > [Aug 17 21:22:15] daemon iwd: src/station.c:station_disconnect_cb() 5, success: 1
> > [Aug 17 21:22:15] daemon iwd: event: state, old: disconnecting, new: disconnected
> > [Aug 17 21:22:15] daemon iwd: src/wiphy.c:wiphy_reg_notify() Notification of command Reg Change(36)
> > [Aug 17 21:22:15] daemon iwd: src/wiphy.c:wiphy_update_reg_domain() New reg domain country code for (global) is XX
> > [Aug 17 21:22:36] daemon iwd: src/station.c:station_roam_trigger_cb() 5
> >
> > Afterwards is the segmentation fault.
>
> Do you happen to have the logs a few minutes prior. The roam timeout is
> defaulted to 60 seconds, so at some point it was re-armed but the logs don't
> go back that far. Its trivial to handle the segfault but I suspect the roam
> timeout being rearmed is also leaking memory so we should address that as
> the root cause.
Unfortunately not, only for the last half-minute (with personal
information). Here's a bit more that I don't need to redact:
[Aug 17 21:22:07] daemon iwd: src/station.c:station_roam_failed() 5
[Aug 17 21:22:07] daemon iwd: src/wiphy.c:wiphy_radio_work_done() Work item 279 done
[Aug 17 21:22:12] daemon iwd: src/station.c:station_dbus_disconnect()
[Aug 17 21:22:12] daemon iwd: src/station.c:station_reset_connection_state() 5
[Aug 17 21:22:12] daemon iwd: src/station.c:station_roam_state_clear() 5
[Aug 17 21:22:12] daemon iwd: event: state, old: connected, new: disconnecting
[Aug 17 21:22:15] daemon iwd: src/netdev.c:netdev_mlme_notify() MLME notification Del Station(20)
[Aug 17 21:22:15] daemon iwd: src/netdev.c:netdev_link_notify() event 16 on ifindex 5
[Aug 17 21:22:15] daemon iwd: src/netdev.c:netdev_mlme_notify() MLME notification Deauthenticate(39)
[Aug 17 21:22:15] daemon iwd: src/netdev.c:netdev_deauthenticate_event()
[Aug 17 21:22:15] daemon iwd: src/netdev.c:netdev_mlme_notify() MLME notification Disconnect(48)
[Aug 17 21:22:15] daemon iwd: src/netdev.c:netdev_disconnect_event()
[Aug 17 21:22:15] daemon iwd: src/station.c:station_disconnect_cb() 5, success: 1
[Aug 17 21:22:15] daemon iwd: event: state, old: disconnecting, new: disconnected
[Aug 17 21:22:15] daemon iwd: src/wiphy.c:wiphy_reg_notify() Notification of command Reg Change(36)
[Aug 17 21:22:15] daemon iwd: src/wiphy.c:wiphy_update_reg_domain() New reg domain country code for (global) is XX
[Aug 17 21:22:36] daemon iwd: src/station.c:station_roam_trigger_cb() 5
I can confirm, though, that there are more timeouts than expected, even
after (presumably) clearing the last timeout:
Core was generated by `/usr/libexec/iwd -d'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x0000aaaadf1686a0 in station_start_roam (station=0xffffb0527b30) at src/station.c:2880
warning: 2880 src/station.c: No such file or directory
(gdb) p *watch_list[12]
$1 = {fd = 12, events = 1073741825, flags = 1, callback = 0xaaaadf1ec4f0 <timeout_callback>,
destroy = 0xaaaadf1ec350 <timeout_destroy>, user_data = 0xffffb0530dd0}
(gdb) p *(struct l_timeout *) watch_list[12]->user_data
$2 = {fd = 12, callback = 0xaaaadf1687a0 <station_roam_trigger_cb>, destroy = 0x0,
user_data = 0xffffb0527b30}
(gdb) p *watch_list[13]
$3 = {fd = 13, events = 1073741825, flags = 0, callback = 0xaaaadf1ec4f0 <timeout_callback>,
destroy = 0xaaaadf1ec350 <timeout_destroy>, user_data = 0xffffb0530ef0}
(gdb) p *(struct l_timeout *) watch_list[13]->user_data
$4 = {fd = 13, callback = 0xaaaadf1687a0 <station_roam_trigger_cb>, destroy = 0x0,
user_data = 0xffffb0527b30}
(gdb) p *watch_list[14]
$5 = {fd = 14, events = 1073741825, flags = 0, callback = 0xaaaadf1ec4f0 <timeout_callback>,
destroy = 0xaaaadf1ec350 <timeout_destroy>, user_data = 0xffffb0530c50}
(gdb) p *(struct l_timeout *) watch_list[14]->user_data
$6 = {fd = 14, callback = 0xaaaadf1687a0 <station_roam_trigger_cb>, destroy = 0x0,
user_data = 0xffffb0527b30}
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Segmentation fault when taking device for a walk
2024-08-20 16:00 ` Richard Acayan
@ 2024-08-21 14:27 ` James Prestwood
2024-08-27 0:40 ` Richard Acayan
0 siblings, 1 reply; 8+ messages in thread
From: James Prestwood @ 2024-08-21 14:27 UTC (permalink / raw)
To: Richard Acayan; +Cc: iwd
[-- Attachment #1: Type: text/plain, Size: 7619 bytes --]
Hi Richard,
On 8/20/24 9:00 AM, Richard Acayan wrote:
> On Tue, Aug 20, 2024 at 08:04:20AM -0700, James Prestwood wrote:
>> Hi Richard,
>>
>> On 8/19/24 2:59 PM, Richard Acayan wrote:
>>> On Fri, Aug 16, 2024 at 04:53:41AM -0700, James Prestwood wrote:
>>>> Hi Richard,
>>>>
>>>> On 8/15/24 5:24 PM, Richard Acayan wrote:
>>>>> Hi,
>>>>>
>>>>> A segmentation fault occurs in station_start_roam() when the station is
>>>>> disconnected from an access point, or in other words, when the station's
>>>>> connected_bss is NULL. Usually, this is triggered by a timeout, possibly
>>>>> scheduled in response to a weak signal event.
>>>>>
>>>>> This is occurring on my Pixel 3a running postmarketOS/Alpine Linux, when
>>>>> receding from an access point, on iwd 2.19. I have collected 6 coredumps
>>>>> of the crash in the span of around 2 weeks and would be willing to use
>>>>> GDB if more information is necessary for a patch.
>>>>>
>>>>> Sample:
>>>>>
>>>>> Program terminated with signal SIGSEGV, Segmentation fault.
>>>>> #0 0x0000aaaadf2086a0 in station_start_roam (station=0xffff8776ae50) at src/station.c:2880
>>>>>
>>>>> warning: 2880 src/station.c: No such file or directory
>>>>> (gdb) bt
>>>>> #0 0x0000aaaadf2086a0 in station_start_roam (station=0xffff8776ae50) at src/station.c:2880
>>>>> #1 0x0000aaaadf28c544 in timeout_callback (fd=<optimized out>, events=<optimized out>,
>>>>> user_data=0xffff876b2e20) at ell/timeout.c:68
>>>>> #2 timeout_callback (fd=<optimized out>, events=<optimized out>, user_data=0xffff876b2e20)
>>>>> at ell/timeout.c:57
>>>>> #3 0x0000aaaadf28b9d0 in l_main_iterate (timeout=<optimized out>) at ell/main.c:461
>>>>> #4 0x0000aaaadf28bac0 in l_main_run () at ell/main.c:508
>>>>> #5 l_main_run () at ell/main.c:490
>>>>> #6 0x0000aaaadf28bce4 in l_main_run_with_signal (
>>>>> callback=callback@entry=0xaaaadf1f1110 <signal_handler>, user_data=user_data@entry=0x0)
>>>>> at ell/main.c:630
>>>>> #7 0x0000aaaadf1f0b0c in main (argc=<optimized out>, argv=<optimized out>) at src/main.c:611
>>>>> (gdb) p station->connected_bss
>>>>> $1 = (struct scan_bss *) 0x0
>>>>>
>>>> Its hard to say without any debug logs as well but it appears the disconnect
>>>> never cleared out the timer used for the next roam attempt. I did fix a hang
>>>> due to a disconnect coming in during a roam attempt after 2.19, but I can't
>>>> really make heads or tails without debug logs to see what happened
>>>> before/after the disconnect.
>>> It happened again with debug logs enabled. Relevant snippet (from
>>> logread):
>>>
>>> [Aug 17 21:22:12] daemon iwd: src/station.c:station_roam_state_clear() 5
>>> [Aug 17 21:22:12] daemon iwd: event: state, old: connected, new: disconnecting
>>> [Aug 17 21:22:15] daemon iwd: src/netdev.c:netdev_mlme_notify() MLME notification Del Station(20)
>>> [Aug 17 21:22:15] daemon iwd: src/netdev.c:netdev_link_notify() event 16 on ifindex 5
>>> [Aug 17 21:22:15] daemon iwd: src/netdev.c:netdev_mlme_notify() MLME notification Deauthenticate(39)
>>> [Aug 17 21:22:15] daemon iwd: src/netdev.c:netdev_deauthenticate_event()
>>> [Aug 17 21:22:15] daemon iwd: src/netdev.c:netdev_mlme_notify() MLME notification Disconnect(48)
>>> [Aug 17 21:22:15] daemon iwd: src/netdev.c:netdev_disconnect_event()
>>> [Aug 17 21:22:15] daemon iwd: src/station.c:station_disconnect_cb() 5, success: 1
>>> [Aug 17 21:22:15] daemon iwd: event: state, old: disconnecting, new: disconnected
>>> [Aug 17 21:22:15] daemon iwd: src/wiphy.c:wiphy_reg_notify() Notification of command Reg Change(36)
>>> [Aug 17 21:22:15] daemon iwd: src/wiphy.c:wiphy_update_reg_domain() New reg domain country code for (global) is XX
>>> [Aug 17 21:22:36] daemon iwd: src/station.c:station_roam_trigger_cb() 5
>>>
>>> Afterwards is the segmentation fault.
>> Do you happen to have the logs a few minutes prior. The roam timeout is
>> defaulted to 60 seconds, so at some point it was re-armed but the logs don't
>> go back that far. Its trivial to handle the segfault but I suspect the roam
>> timeout being rearmed is also leaking memory so we should address that as
>> the root cause.
> Unfortunately not, only for the last half-minute (with personal
> information). Here's a bit more that I don't need to redact:
>
> [Aug 17 21:22:07] daemon iwd: src/station.c:station_roam_failed() 5
> [Aug 17 21:22:07] daemon iwd: src/wiphy.c:wiphy_radio_work_done() Work item 279 done
> [Aug 17 21:22:12] daemon iwd: src/station.c:station_dbus_disconnect()
> [Aug 17 21:22:12] daemon iwd: src/station.c:station_reset_connection_state() 5
> [Aug 17 21:22:12] daemon iwd: src/station.c:station_roam_state_clear() 5
> [Aug 17 21:22:12] daemon iwd: event: state, old: connected, new: disconnecting
> [Aug 17 21:22:15] daemon iwd: src/netdev.c:netdev_mlme_notify() MLME notification Del Station(20)
> [Aug 17 21:22:15] daemon iwd: src/netdev.c:netdev_link_notify() event 16 on ifindex 5
> [Aug 17 21:22:15] daemon iwd: src/netdev.c:netdev_mlme_notify() MLME notification Deauthenticate(39)
> [Aug 17 21:22:15] daemon iwd: src/netdev.c:netdev_deauthenticate_event()
> [Aug 17 21:22:15] daemon iwd: src/netdev.c:netdev_mlme_notify() MLME notification Disconnect(48)
> [Aug 17 21:22:15] daemon iwd: src/netdev.c:netdev_disconnect_event()
> [Aug 17 21:22:15] daemon iwd: src/station.c:station_disconnect_cb() 5, success: 1
> [Aug 17 21:22:15] daemon iwd: event: state, old: disconnecting, new: disconnected
> [Aug 17 21:22:15] daemon iwd: src/wiphy.c:wiphy_reg_notify() Notification of command Reg Change(36)
> [Aug 17 21:22:15] daemon iwd: src/wiphy.c:wiphy_update_reg_domain() New reg domain country code for (global) is XX
> [Aug 17 21:22:36] daemon iwd: src/station.c:station_roam_trigger_cb() 5
>
> I can confirm, though, that there are more timeouts than expected, even
> after (presumably) clearing the last timeout:
>
> Core was generated by `/usr/libexec/iwd -d'.
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0 0x0000aaaadf1686a0 in station_start_roam (station=0xffffb0527b30) at src/station.c:2880
>
> warning: 2880 src/station.c: No such file or directory
> (gdb) p *watch_list[12]
> $1 = {fd = 12, events = 1073741825, flags = 1, callback = 0xaaaadf1ec4f0 <timeout_callback>,
> destroy = 0xaaaadf1ec350 <timeout_destroy>, user_data = 0xffffb0530dd0}
> (gdb) p *(struct l_timeout *) watch_list[12]->user_data
> $2 = {fd = 12, callback = 0xaaaadf1687a0 <station_roam_trigger_cb>, destroy = 0x0,
> user_data = 0xffffb0527b30}
> (gdb) p *watch_list[13]
> $3 = {fd = 13, events = 1073741825, flags = 0, callback = 0xaaaadf1ec4f0 <timeout_callback>,
> destroy = 0xaaaadf1ec350 <timeout_destroy>, user_data = 0xffffb0530ef0}
> (gdb) p *(struct l_timeout *) watch_list[13]->user_data
> $4 = {fd = 13, callback = 0xaaaadf1687a0 <station_roam_trigger_cb>, destroy = 0x0,
> user_data = 0xffffb0527b30}
> (gdb) p *watch_list[14]
> $5 = {fd = 14, events = 1073741825, flags = 0, callback = 0xaaaadf1ec4f0 <timeout_callback>,
> destroy = 0xaaaadf1ec350 <timeout_destroy>, user_data = 0xffffb0530c50}
> (gdb) p *(struct l_timeout *) watch_list[14]->user_data
> $6 = {fd = 14, callback = 0xaaaadf1687a0 <station_roam_trigger_cb>, destroy = 0x0,
> user_data = 0xffffb0527b30}
Could you try the attached patch. Assuming this fixes it you should also
see some WARNING debug logs, which would be nice to provide to us as
this will tell us where this timer is being overwritten.
Also do you happen to know the wifi chipset in the Pixel 3a? I cant seem
to find that anywhere online.
Thanks,
James
[-- Attachment #2: 0001-station-check-for-roam-timeout-before-rearming.patch --]
[-- Type: text/x-patch, Size: 1152 bytes --]
From 1366e47de0700c9235da03c58f22a2a423451a6e Mon Sep 17 00:00:00 2001
From: James Prestwood <prestwoj@gmail.com>
Date: Tue, 20 Aug 2024 07:59:36 -0700
Subject: [PATCH] station: check for roam timeout before rearming
---
src/station.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/src/station.c b/src/station.c
index 509d919c..e90a64f7 100644
--- a/src/station.c
+++ b/src/station.c
@@ -2114,7 +2114,7 @@ static void station_roamed(struct station *station)
* Schedule another roaming attempt in case the signal continues to
* remain low. A subsequent high signal notification will cancel it.
*/
- if (station->signal_low)
+ if (station->signal_low && L_WARN_ON(!station->roam_trigger_timeout))
station_roam_timeout_rearm(station, roam_retry_interval);
if (station->netconfig)
@@ -2153,7 +2153,7 @@ static void station_roam_retry(struct station *station)
station->roam_scan_full = false;
station->ap_directed_roaming = false;
- if (station->signal_low)
+ if (station->signal_low && L_WARN_ON(!station->roam_trigger_timeout))
station_roam_timeout_rearm(station, roam_retry_interval);
}
--
2.34.1
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: Segmentation fault when taking device for a walk
2024-08-21 14:27 ` James Prestwood
@ 2024-08-27 0:40 ` Richard Acayan
2024-08-27 11:46 ` James Prestwood
0 siblings, 1 reply; 8+ messages in thread
From: Richard Acayan @ 2024-08-27 0:40 UTC (permalink / raw)
To: James Prestwood; +Cc: iwd
On Wed, Aug 21, 2024 at 07:27:41AM -0700, James Prestwood wrote:
> Could you try the attached patch. Assuming this fixes it you should also see
> some WARNING debug logs, which would be nice to provide to us as this will
> tell us where this timer is being overwritten.
Yeah, I see some:
[Aug 24 16:51:10] daemon iwd: src/station.c:station_roam_failed() 5
[Aug 24 16:51:10] daemon iwd: WARNING: src/station.c:station_roam_retry() condition !station->roam_trigger_timeout failed
[Aug 24 16:52:54] daemon iwd: src/station.c:station_roam_failed() 5
[Aug 24 16:52:54] daemon iwd: WARNING: src/station.c:station_roam_retry() condition !station->roam_trigger_timeout failed
[Aug 24 16:53:09] daemon iwd: src/station.c:station_roam_failed() 5
[Aug 24 16:53:09] daemon iwd: WARNING: src/station.c:station_roam_retry() condition !station->roam_trigger_timeout failed
[Aug 24 16:53:22] daemon iwd: src/station.c:station_roam_failed() 5
[Aug 24 16:53:22] daemon iwd: WARNING: src/station.c:station_roam_retry() condition !station->roam_trigger_timeout failed
[Aug 24 16:53:32] daemon iwd: src/station.c:station_roam_failed() 5
[Aug 24 16:53:32] daemon iwd: WARNING: src/station.c:station_roam_retry() condition !station->roam_trigger_timeout failed
[Aug 24 16:54:33] daemon iwd: src/station.c:station_roam_failed() 5
[Aug 24 16:54:33] daemon iwd: WARNING: src/station.c:station_roam_retry() condition !station->roam_trigger_timeout failed
> Also do you happen to know the wifi chipset in the Pixel 3a? I cant seem to
> find that anywhere online.
The Pixel 3a is based on the SDM670 SoC, and has the WCN3990 WiFi chip.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Segmentation fault when taking device for a walk
2024-08-27 0:40 ` Richard Acayan
@ 2024-08-27 11:46 ` James Prestwood
0 siblings, 0 replies; 8+ messages in thread
From: James Prestwood @ 2024-08-27 11:46 UTC (permalink / raw)
To: Richard Acayan; +Cc: iwd
Hi Richard,
On 8/26/24 5:40 PM, Richard Acayan wrote:
> On Wed, Aug 21, 2024 at 07:27:41AM -0700, James Prestwood wrote:
>> Could you try the attached patch. Assuming this fixes it you should also see
>> some WARNING debug logs, which would be nice to provide to us as this will
>> tell us where this timer is being overwritten.
> Yeah, I see some:
>
> [Aug 24 16:51:10] daemon iwd: src/station.c:station_roam_failed() 5
> [Aug 24 16:51:10] daemon iwd: WARNING: src/station.c:station_roam_retry() condition !station->roam_trigger_timeout failed
> [Aug 24 16:52:54] daemon iwd: src/station.c:station_roam_failed() 5
> [Aug 24 16:52:54] daemon iwd: WARNING: src/station.c:station_roam_retry() condition !station->roam_trigger_timeout failed
> [Aug 24 16:53:09] daemon iwd: src/station.c:station_roam_failed() 5
> [Aug 24 16:53:09] daemon iwd: WARNING: src/station.c:station_roam_retry() condition !station->roam_trigger_timeout failed
> [Aug 24 16:53:22] daemon iwd: src/station.c:station_roam_failed() 5
> [Aug 24 16:53:22] daemon iwd: WARNING: src/station.c:station_roam_retry() condition !station->roam_trigger_timeout failed
> [Aug 24 16:53:32] daemon iwd: src/station.c:station_roam_failed() 5
> [Aug 24 16:53:32] daemon iwd: WARNING: src/station.c:station_roam_retry() condition !station->roam_trigger_timeout failed
> [Aug 24 16:54:33] daemon iwd: src/station.c:station_roam_failed() 5
> [Aug 24 16:54:33] daemon iwd: WARNING: src/station.c:station_roam_retry() condition !station->roam_trigger_timeout failed
Are there any other messages in there in between those warnings and
failed roam prints?
>
>> Also do you happen to know the wifi chipset in the Pixel 3a? I cant seem to
>> find that anywhere online.
> The Pixel 3a is based on the SDM670 SoC, and has the WCN3990 WiFi chip.
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2024-08-27 11:46 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-16 0:24 Segmentation fault when taking device for a walk Richard Acayan
2024-08-16 11:53 ` James Prestwood
2024-08-19 21:59 ` Richard Acayan
2024-08-20 15:04 ` James Prestwood
2024-08-20 16:00 ` Richard Acayan
2024-08-21 14:27 ` James Prestwood
2024-08-27 0:40 ` Richard Acayan
2024-08-27 11:46 ` James Prestwood
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox