Hey Maxime, I added linux-can in CC: I appreciate your comments. I will try to provide you with proper answers. I'll first fresh up my memory, and look into the code. I hope to come back on this before the weekend. Kind regards, Kurt --- Original message --- > Date: Tue, 19 Jul 2016 15:15:24 +0200 > From: Maxime Jayat > To: kurt@vandijck-laurijssen.be > Subject: J1939 Memory corruption > > Hello Kurt, > I am a user of your J1939 implementation for linux. Previously I was > using it with a linux 2.6.32 with success. Recently we moved to a 4.1 > kernel, so I am currently testing the new J1939 implementation (the > 'direct' one). I merged the linux j1939d-v4.1 branch into my tree and > it works correctly. Not having to patch and use iproute2 is greatly > appreciated. > However, I am able to reliably reproduce a kernel panic, or at least a > kref warning which leads to a random kernel crash some time later. > I have attached a C program reproducing the problem, but here is a > quick description: > - I bind a J1939 socket to source address 0x80 with name > 0x7777777777777777 > - I send an address claim for this source address > - Another "external" ECU sends an address claim for 0x80 with name > 0x1111111111111111 (which has higher priority, so "wins" the source > address). > - I re-bind my J1939 socket to source address 0x81. > This leads to: > ------------[ cut here ]------------ > WARNING: CPU: 0 PID: 861 at include/linux/kref.h:47 > j1939_name_local_get+0x128/0x13c() > Modules linked in: > CPU: 0 PID: 861 Comm: j1939 Not tainted 4.1.24-linux4sam_5.3 #64 > Hardware name: Atmel SAMA5 > [] (unwind_backtrace) from [] > (show_stack+0x20/0x24) > [] (show_stack) from [] (dump_stack+0x24/0x28) > [] (dump_stack) from [] > (warn_slowpath_common+0x94/0xc0) > [] (warn_slowpath_common) from [] > (warn_slowpath_null+0x2c/0x34) > [] (warn_slowpath_null) from [] > (j1939_name_local_get+0x128/0x13c) > [] (j1939_name_local_get) from [] > (j1939sk_bind+0x208/0x360) > [] (j1939sk_bind) from [] (SyS_bind+0x6c/0x9c) > [] (SyS_bind) from [] (ret_fast_syscall+0x0/0x60) > ---[ end trace 93231ae522a8b287 ]--- > FYI, candump during this: > (029.642505) can0 18EEFF80  [8] 77 77 77 77 77 77 77 77 > (001.001629) can0 18EEFF80  [8] 11 11 11 11 11 11 11 11 > (001.234503) can0 18EEFF81  [8] 77 77 77 77 77 77 77 77 > I tried to understand to problem and arrived to the following (and > incomplete) conclusions: > It seems to be a mismatch between the ecu list in priv->ecus and the > refcounting of the ECUs. > When the external address claim is received, __j1939_ecu_unregister is > called by j1939_process_address_claim for my ECU and it is removed from > the priv->ecus list but there are still some references to it, so it is > not freed. > Then, when re-binding the socket, j1939_name_local_put is called which > calls __j1939_ecu_get_register which does not find the name in the > priv->ecus list, so it reallocates a new ECU and insert it in the > priv->ecus list. However j1939_name_local_put immediately drops the ECU > reference, so the ECU is freed, but it is still in the priv->ecus list! > So the list contains a reference to a freed structure. The previous > warning appears when kref tries to take a reference while the count is > already at zero. At this point the kernel panics, either when closing > the socket, or randomly a few seconds later because the memory is > corrupted. > I have trouble following where all the krefs to the ecus are taken and > if it all makes sense everywhere. __j1939_ecu_get_register can either > allocate a new ECU and hold a reference (kref_init initializes at 1), > or return an existing ECU without touching the reference count. How is > it possible to know if the caller must call put_j1939_ecu when done > with the ecu? > By the way, __j1939_ecu_get_register has a unused "create_if_necessary" > parameter. > Although I probably could fix these issues with some time, I would > appreciate any help you can give me. > Thanks in advance, > -- > Maxime Jayat > #include > #include > #include > #include > #include > #include > #include > #include > #include > #include > /* linux can headers copied into currect directory */ > #include "can.h" > #include "raw.h" > #include "j1939.h" > > > static int open_j1939_socket(const char *iface, uint64_t name, uint8_t sa) > { > int ret, sock; > int value; > struct sockaddr_can saddr = { > .can_family = AF_CAN, > .can_addr.j1939 = { > .name = name, > .addr = sa, > .pgn = 0x0ee00, > }, > }; > > sock = ret = socket(PF_CAN, SOCK_DGRAM | SOCK_CLOEXEC, CAN_J1939); > if (ret < 0) { > goto fail; > } > > ret = setsockopt(sock, SOL_SOCKET, SO_BINDTODEVICE, > iface, strnlen(iface, 40)); > if (ret < 0) { > goto fail; > } > > ret = bind(sock, (void *)&saddr, sizeof(saddr)); > if (ret < 0) { > goto fail; > } > return sock; > > fail: > close(sock); > return -1; > } > > static int open_raw_socket(const char *iface) > { > int s; > struct sockaddr_can addr; > struct ifreq ifr; > s = socket(PF_CAN, SOCK_RAW, CAN_RAW); > > strcpy(ifr.ifr_name, iface); > if (ioctl(s, SIOCGIFINDEX, &ifr) < 0) { > perror("SIOCGIFINDEX"); > return 1; > } > addr.can_family = AF_CAN; > addr.can_ifindex = ifr.ifr_ifindex; > > if (bind(s, (struct sockaddr *)&addr, sizeof(addr)) < 0) { > perror("bind"); > return 1; > } > > return s; > } > > > int main(void) > { > int s1939, sraw; > uint64_t name = 0x7777777777777777ll; > uint8_t buf[8]; > > s1939 = open_j1939_socket("can0", name, /* src address */ 0x80); > sraw = open_raw_socket("can0"); > > memcpy(buf, &name, 8); > /* Send address claim for src address 0x80 */ > send(s1939, buf, 8, 0); > sleep(1); > { > /* Simulate an ecu claiming src address 0x80 with smaller NAME */ > struct can_frame frame = { > .can_id = 0x18eeff80 | CAN_EFF_FLAG, > .can_dlc = 8, > .data = { 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11 } > }; > write(sraw, &frame, sizeof(frame)); > } > sleep(1); > /* Rebind to src address 0x81 */ > struct sockaddr_can saddr = { > .can_family = AF_CAN, > .can_addr.j1939 = { > .name = name, > .addr = 0x81, > .pgn = 0x0ee00, > }, > }; > bind(s1939, (void *)&saddr, sizeof(saddr)); > /* Memory corrupted */ > > close(s1939); > close(sraw); > }