Distributed Replicated Block Device (DRBD) development
 help / color / mirror / Atom feed
* [Drbd-dev] Bug(s) with Linux v5.4.46
@ 2020-07-27  3:55 Sarah Newman
  2020-07-27  7:16 ` Lars Ellenberg
  0 siblings, 1 reply; 4+ messages in thread
From: Sarah Newman @ 2020-07-27  3:55 UTC (permalink / raw)
  To: philipp.reisner, lars.ellenberg; +Cc: drbd-dev

Hi,

We got the following kernel log messages:

Jul 25 03:38:24 hail kernel: ata1.00: error: { IDNF }
Jul 25 03:38:24 hail kernel: ata1.00: configured for UDMA/133
Jul 25 03:38:24 hail kernel: ata1: EH complete
Jul 25 03:38:24 hail kernel: drbd resource37: peer( Primary -> Unknown ) conn( Connected -> Disconnecting ) pdsk( UpToDate -> DUnknown )
Jul 25 03:38:24 hail kernel: drbd resource37: ack_receiver terminated
Jul 25 03:38:24 hail kernel: drbd resource37: Terminating drbd_a_resource
Jul 25 03:38:26 hail kernel: drbd resource37: ASSERTION FAILED: connection->current_epoch->list not empty
Jul 25 03:38:26 hail kernel: drbd resource37: Connection closed
Jul 25 03:38:26 hail kernel: drbd resource37: conn( Disconnecting -> StandAlone )
Jul 25 03:38:26 hail kernel: drbd resource37: receiver terminated
Jul 25 03:38:26 hail kernel: drbd resource37: Terminating drbd_r_resource
Jul 25 03:38:26 hail kernel: block drbd37: disk( UpToDate -> Failed )
Jul 25 03:38:26 hail kernel: block drbd37: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
Jul 25 03:38:26 hail kernel: block drbd37: disk( Failed -> Diskless )
Jul 25 03:38:26 hail kernel: general protection fault: 0000 [#1] SMP NOPTI
Jul 25 03:38:26 hail kernel: CPU: 0 PID: 18526 Comm: drbdsetup-84 Not tainted 5.4.46-1_prgmr.el7.x86_64 #1
Jul 25 03:38:26 hail kernel: Hardware name: HPE ProLiant DL325 Gen10/ProLiant DL325 Gen10, BIOS A41 03/09/2020
Jul 25 03:38:26 hail kernel: RIP: e030:kobject_uevent_env+0x1d/0x660
Jul 25 03:38:26 hail kernel: Code: 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 41 57 89 f0 41 56 41 55 41 54 55 53 48 83 ec 38 4c 8b 34 c5 00 e8 17 82 83 
fe 01 75 04 <80> 4f 3c 08 48 89 f8 eb 0d 48 8b 40 18 48 85 c0 0f 84 19 01 00 00
Jul 25 03:38:26 hail kernel: RSP: e02b:ffffc900757a7a10 EFLAGS: 00010246
Jul 25 03:38:26 hail kernel: RAX: 0000000000000001 RBX: fdfdfdfdfdfe023d RCX: ffff8880606f9870
Jul 25 03:38:26 hail kernel: RDX: 0000000000000000 RSI: 0000000000000001 RDI: fdfdfdfdfdfe023d
Jul 25 03:38:26 hail kernel: RBP: ffff8880606f9870 R08: 0000000000000040 R09: ffffffffc01ae500
Jul 25 03:38:26 hail kernel: R10: ffffc900757a7aa8 R11: ffffffffc01f0b58 R12: ffff8880606f9800
Jul 25 03:38:26 hail kernel: R13: ffff88800fb5dc00 R14: ffffffff824055e5 R15: ffff88800fb5dc48
Jul 25 03:38:26 hail kernel: FS:  00007fbbc98e4740(0000) GS:ffff888188a00000(0000) knlGS:0000000000000000
Jul 25 03:38:26 hail kernel: CS:  e030 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 25 03:38:26 hail kernel: CR2: 00007f978b44c000 CR3: 0000000007926000 CR4: 0000000000040660
Jul 25 03:38:26 hail kernel: Call Trace:
Jul 25 03:38:26 hail kernel: ? error_exit+0x5/0x20
Jul 25 03:38:26 hail kernel: blk_integrity_del+0x1a/0x2b
Jul 25 03:38:26 hail kernel: del_gendisk+0x27/0x2f0
Jul 25 03:38:26 hail kernel: drbd_delete_device+0xcc/0x100 [drbd]
Jul 25 03:38:26 hail kernel: adm_del_minor+0xc5/0xe0 [drbd]
Jul 25 03:38:26 hail kernel: drbd_adm_down+0x13f/0x1f0 [drbd]
Jul 25 03:38:26 hail kernel: genl_family_rcv_msg+0x1d2/0x410
Jul 25 03:38:26 hail kernel: genl_rcv_msg+0x47/0x90
Jul 25 03:38:26 hail kernel: ? __kmalloc_node_track_caller+0x217/0x2e0
Jul 25 03:38:26 hail kernel: ? genl_family_rcv_msg+0x410/0x410
Jul 25 03:38:26 hail kernel: netlink_rcv_skb+0x49/0x110
Jul 25 03:38:26 hail kernel: genl_rcv+0x24/0x40
Jul 25 03:38:26 hail kernel: netlink_unicast+0x191/0x220
Jul 25 03:38:26 hail kernel: netlink_sendmsg+0x21d/0x3f0
Jul 25 03:38:26 hail kernel: sock_sendmsg+0x5b/0x60
Jul 25 03:38:26 hail kernel: sock_write_iter+0x97/0x100
Jul 25 03:38:26 hail kernel: new_sync_write+0x12d/0x1d0
Jul 25 03:38:26 hail kernel: vfs_write+0xa5/0x1a0
Jul 25 03:38:26 hail kernel: ksys_write+0x59/0xd0
Jul 25 03:38:26 hail kernel: do_syscall_64+0x5b/0x1a0
Jul 25 03:38:26 hail kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jul 25 03:38:26 hail kernel: RIP: 0033:0x7fbbc93f1a00
Jul 25 03:38:26 hail kernel: Code: 73 01 c3 48 8b 0d 70 74 2d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 83 3d bd d5 2d 00 00 75 10 b8 01 00 
00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 ee cb 01 00 48 89 04 24
Jul 25 03:38:26 hail kernel: RSP: 002b:00007ffe63ab8ce8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
Jul 25 03:38:26 hail kernel: RAX: ffffffffffffffda RBX: 0000000001bb8070 RCX: 00007fbbc93f1a00
Jul 25 03:38:26 hail kernel: RDX: 0000000000000030 RSI: 0000000001bb8080 RDI: 0000000000000003
Jul 25 03:38:26 hail kernel: RBP: 0000000000000030 R08: 00007ffe63ab8d54 R09: 0000000000000000
Jul 25 03:38:26 hail kernel: R10: 000000000000006b R11: 0000000000000246 R12: 0000000001bb8080
Jul 25 03:38:26 hail kernel: R13: 0000000000000030 R14: 0000000000000003 R15: 0000000000414378
Jul 25 03:38:26 hail kernel: Modules linked in: 8021q garp mrp bonding xt_physdev ip6table_filter ip6_tables xt_conntrack nf_conntrack nf_defrag_ipv6 
nf_defrag_ipv4 ib_isert iscsi_target_mod ib_srpt target_core_mod ib_srp scsi_transport_srp rpcrdma sunrpc rdma_ucm ib_iser rdma_cm iw_cm ib_umad 
libiscsi ib_ipoib scsi_transport_iscsi ib_cm mlx4_ib ib_uverbs ib_core crct10dif_pclmul crc32_pclmul hpilo ghash_clmulni_intel pcspkr cdc_eem usbnet 
mii hpwdt ccp sp5100_tco k10temp ipmi_si i2c_piix4 wmi ipmi_devintf ipmi_msghandler tun acpi_power_meter sch_ingress sch_htb raid456 async_raid6_recov 
async_memcpy async_pq async_xor async_tx xor raid6_pq raid10 fuse ext2 ebt_mark ebt_ip ebt_ip6 ebt_arp ebtable_nat ebtable_filter ebtables drbd 
lru_cache libcrc32c cls_u32 cls_fw br_netfilter bridge stp llc xenfs xen_privcmd act_police mlx4_en raid1 crc32c_intel serio_raw drm_vram_helper ttm 
igb ptp pps_core dca mlx4_core xen_pciback xen_netback xen_blkback xen_gntalloc xen_gntdev xen_evtchn
Jul 25 03:38:26 hail kernel: ---[ end trace 664811a432d5d82a ]---
Jul 25 03:38:26 hail kernel: RIP: e030:kobject_uevent_env+0x1d/0x660
Jul 25 03:38:26 hail kernel: Code: 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 41 57 89 f0 41 56 41 55 41 54 55 53 48 83 ec 38 4c 8b 34 c5 00 e8 17 82 83 
fe 01 75 04 <80> 4f 3c 08 48 89 f8 eb 0d 48 8b 40 18 48 85 c0 0f 84 19 01 00 00
Jul 25 03:38:26 hail kernel: RSP: e02b:ffffc900757a7a10 EFLAGS: 00010246
Jul 25 03:38:26 hail kernel: RAX: 0000000000000001 RBX: fdfdfdfdfdfe023d RCX: ffff8880606f9870
Jul 25 03:38:26 hail kernel: RDX: 0000000000000000 RSI: 0000000000000001 RDI: fdfdfdfdfdfe023d
Jul 25 03:38:26 hail kernel: RBP: ffff8880606f9870 R08: 0000000000000040 R09: ffffffffc01ae500
Jul 25 03:38:26 hail kernel: R10: ffffc900757a7aa8 R11: ffffffffc01f0b58 R12: ffff8880606f9800
Jul 25 03:38:26 hail kernel: R13: ffff88800fb5dc00 R14: ffffffff824055e5 R15: ffff88800fb5dc48
Jul 25 03:38:26 hail kernel: FS:  00007fbbc98e4740(0000) GS:ffff888188a00000(0000) knlGS:0000000000000000
Jul 25 03:38:26 hail kernel: CS:  e030 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 25 03:38:26 hail kernel: CR2: 00007f978b44c000 CR3: 0000000007926000 CR4: 0000000000040660

Reading this:

kobject_uevent_env+0x1d resolves to:

(gdb) list *(kobject_uevent_env+0x1d)
Line 474 of "lib/kobject_uevent.c" starts at address 0xffffffff81a0f70d <kobject_uevent_env+29> and ends at 0xffffffff81a0f711 <kobject_uevent_env+33>.

Which is:
         if (action == KOBJ_REMOVE)
                 kobj->state_remove_uevent_sent = 1
0xffffffff81a0f70d <kobject_uevent_env+29>:  orb    $0x8,0x3c(%rdi)

kobj as $rdi is 0xfdfdfdfdfdfe023d.

The pointer in question is passed in from:

(gdb) list *(blk_integrity_del+0x1a)
Line 441 of "block/blk-integrity.c" starts at address 0xffffffff814d5d5a <blk_integrity_del+26> and ends at 0xffffffff814d5d62 <blk_integrity_del+34>.

Line's 440 and 441:
      kobject_uevent(&disk->integrity_kobj, KOBJ_REMOVE);
      kobject_del(&disk->integrity_kobj);

What is the offset of integrity_kobj?

(gdb) p &((struct gendisk *)0)->integrity_kobj
$1 = (struct kobject *) 0x440

The pointer we see above, 0xfdfdfdfdfdfe023d, - 0x440 is 0xfdfdfdfdfdfdfdfd.

blk_integrity_del is called at the beginning of del_gendisk near the end of drbd_delete_device. This is the code block from drbd_delete_device around 
there:

	kref_put(&device->kref, drbd_destroy_device);
	del_gendisk(device->vdisk);
	synchronize_rcu();
	kref_put(&device->kref, drbd_destroy_device);
}

In drbd_destroy_device, there is the line:

memset(device, 0xfd, sizeof(*device));

So I think that drbd_destroy_device must have run before del_gendisk, and therefore the reference count for device->kref is unbalanced.

I do not know if this is related to the error message:

  ASSERTION FAILED: connection->current_epoch->list not empty

or not.

There were no error messages reported on the peer.

FYI, when we've run in debug mode we've seen some ODEBUG errors about freeing active objects around the time that DRBD resources were released. One 
was a work_struct and the other was a timer_list. I do not know if either of those are related.

The system in question is still up and running in an error state; is there any more information you want from it?

--Sarah

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Drbd-dev] Bug(s) with Linux v5.4.46
  2020-07-27  3:55 [Drbd-dev] Bug(s) with Linux v5.4.46 Sarah Newman
@ 2020-07-27  7:16 ` Lars Ellenberg
  2020-07-27 19:54   ` Sarah Newman
  2020-07-31  6:27   ` Sarah Newman
  0 siblings, 2 replies; 4+ messages in thread
From: Lars Ellenberg @ 2020-07-27  7:16 UTC (permalink / raw)
  To: Sarah Newman; +Cc: drbd-dev

On Sun, Jul 26, 2020 at 08:55:10PM -0700, Sarah Newman wrote:

> 	kref_put(&device->kref, drbd_destroy_device);

At this point we are "sure" to still hold at least one
additional reference on device.

> 	del_gendisk(device->vdisk);
> 	synchronize_rcu();

which we put here:

> 	kref_put(&device->kref, drbd_destroy_device);


But what you present here shows that in your case that is not true.

There is nothing DRBD specific new in the mentioned kernel version.

> In drbd_destroy_device, there is the line:
> 
> memset(device, 0xfd, sizeof(*device));
> 
> So I think that drbd_destroy_device must have run before del_gendisk,
> and therefore the reference count for device->kref is unbalanced.

Looks like it.

> I do not know if this is related to the error message:
> 
>  ASSERTION FAILED: connection->current_epoch->list not empty
> 
> or not.
> 
> There were no error messages reported on the peer.
> 
> FYI, when we've run in debug mode we've seen some ODEBUG errors about
> freeing active objects around the time that DRBD resources were released.
> One was a work_struct and the other was a timer_list. I do not know if
> either of those are related.

You want to show them? Maybe they help in understanding what is going on here.


> The system in question is still up and running in an error state; is
> there any more information you want from it?

No.

But: is this "easily" reproducible? If so: how?

    Lars


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Drbd-dev] Bug(s) with Linux v5.4.46
  2020-07-27  7:16 ` Lars Ellenberg
@ 2020-07-27 19:54   ` Sarah Newman
  2020-07-31  6:27   ` Sarah Newman
  1 sibling, 0 replies; 4+ messages in thread
From: Sarah Newman @ 2020-07-27 19:54 UTC (permalink / raw)
  To: drbd-dev

On 7/27/20 12:16 AM, Lars Ellenberg wrote:
> On Sun, Jul 26, 2020 at 08:55:10PM -0700, Sarah Newman wrote:
> 
>> 	kref_put(&device->kref, drbd_destroy_device);
> 
> At this point we are "sure" to still hold at least one
> additional reference on device.
> 
>> 	del_gendisk(device->vdisk);
>> 	synchronize_rcu();
> 
> which we put here:
> 
>> 	kref_put(&device->kref, drbd_destroy_device);
> 
> 
> But what you present here shows that in your case that is not true.
> 
> There is nothing DRBD specific new in the mentioned kernel version.

The last version we heavily tested was the v4.9 series.

> 
>> In drbd_destroy_device, there is the line:
>>
>> memset(device, 0xfd, sizeof(*device));
>>
>> So I think that drbd_destroy_device must have run before del_gendisk,
>> and therefore the reference count for device->kref is unbalanced.
> 
> Looks like it.
> 
>> I do not know if this is related to the error message:
>>
>>   ASSERTION FAILED: connection->current_epoch->list not empty
>>
>> or not.
>>
>> There were no error messages reported on the peer.
>>
>> FYI, when we've run in debug mode we've seen some ODEBUG errors about
>> freeing active objects around the time that DRBD resources were released.
>> One was a work_struct and the other was a timer_list. I do not know if
>> either of those are related.
> 
> You want to show them? Maybe they help in understanding what is going on here.

OK - as I said I'm not sure it's DRBD:

Jun 28 13:53:32 hail kernel: kobject: 'integrity' (0000000077623090): kobject_release, parent 0000000000000000 (delayed 4000)
Jun 28 13:53:32 hail kernel: kobject: '147:3' (0000000039ce23c2): kobject_release, parent 0000000000000000 (delayed 1000)
Jun 28 13:53:32 hail kernel: kobject: 'holders' (000000006657b8fa): kobject_release, parent 000000005efea813 (delayed 2000)
Jun 28 13:53:32 hail kernel: kobject: 'slaves' (000000005ee8b4cb): kobject_release, parent 000000005efea813 (delayed 3000)
Jun 28 13:53:32 hail kernel: drbd resource3: Terminating drbd_w_resource
Jun 28 13:53:33 hail kernel: kobject: 'queues' (00000000e530d1f8): kobject_release, parent 0000000000000000 (delayed 1000)
Jun 28 13:53:33 hail kernel: kobject: 'loadtest168-emu' (00000000c67cf6e9): kobject_release, parent 0000000000000000 (delayed 3000)
Jun 28 13:53:33 hail kernel: kobject: 'vif-541-0' (00000000fbe87798): kobject_release, parent 0000000000000000 (delayed 4000)
Jun 28 13:53:33 hail kernel: kobject: 'console-540-0' (000000001f752215): kobject_release, parent 0000000000000000 (delayed 2000)
Jun 28 13:53:33 hail kernel: xenbr1: port 337(loadtest168) entered disabled state
Jun 28 13:53:33 hail kernel: kobject: '1419' (00000000544ef6fb): kobject_release, parent 0000000000000000 (delayed 3000)
Jun 28 13:53:33 hail kernel: kobject: '1420' (000000001c6b6754): kobject_release, parent 0000000000000000 (delayed 3000)
Jun 28 13:53:33 hail kernel: xenbr1: port 337(loadtest168) entered disabled state
Jun 28 13:53:33 hail kernel: device loadtest168 left promiscuous mode
Jun 28 13:53:33 hail kernel: xenbr1: port 337(loadtest168) entered disabled state
Jun 28 13:53:33 hail kernel: kobject: 'rx-0' (0000000004009ca6): kobject_release, parent 000000007083fc4d (delayed 4000)
Jun 28 13:53:33 hail kernel: kobject: 'tx-0' (00000000129ffeb7): kobject_release, parent 000000007083fc4d (delayed 2000)
Jun 28 13:53:33 hail kernel: kobject: 'net' (00000000aa2c79e9): kobject_release, parent 0000000000000000 (delayed 4000)
Jun 28 13:53:33 hail kernel: kobject: 'brport' (0000000023912758): kobject_release, parent 0000000000000000 (delayed 3000)
Jun 28 13:53:33 hail kernel: kobject: '1411' (000000000f8a047a): kobject_release, parent 0000000000000000 (delayed 1000)
Jun 28 13:53:33 hail kernel: kobject: '1423' (00000000ab9afdc9): kobject_release, parent 0000000000000000 (delayed 1000)
Jun 28 13:53:33 hail kernel: kobject: '1424' (00000000769b6aea): kobject_release, parent 0000000000000000 (delayed 3000)
Jun 28 13:53:33 hail kernel: kobject: '1421' (00000000ecdfbadd): kobject_release, parent 0000000000000000 (delayed 4000)
Jun 28 13:53:34 hail kernel: kobject: 'queue' (00000000e1259dd1): kobject_release, parent 0000000000000000 (delayed 3000)
Jun 28 13:53:35 hail kernel: kobject: 'queue' (00000000e8c5e220): kobject_release, parent 0000000000000000 (delayed 4000)
Jun 28 13:53:35 hail kernel: kobject: 'drbd3' (000000005efea813): kobject_release, parent 0000000000000000 (delayed 1000)
Jun 28 13:53:36 hail kernel: kobject: 'queue' (0000000056b0406d): kobject_release, parent 0000000000000000 (delayed 1000)

Jun 28 13:53:36 hail kernel: ------------[ cut here ]------------
Jun 28 13:53:36 hail kernel: ODEBUG: free active (active state 0) object type: work_struct hint: kobject_delayed_cleanup+0x0/0x80
Jun 28 13:53:36 hail kernel: WARNING: CPU: 0 PID: 20709 at lib/debugobjects.c:488 debug_print_object+0x99/0x110
Jun 28 13:53:36 hail kernel: Modules linked in: einj 8021q garp mrp bonding xt_physdev ip6table_filter ip6_tables xt_conntrack nf_conntrack 
nf_defrag_ipv6 nf_defrag_ipv4 crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcspkr cdc_eem usbnet mii hpilo hpwdt ccp tun sch_ingress sch_htb 
raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor k10temp raid6_pq raid10 sp5100_tco fuse i2c_piix4 ext2 ebt_mark ebt_ip wmi 
ebt_ip6 ebt_arp ebtable_nat ebtable_filter ipmi_si ebtables ipmi_devintf ipmi_msghandler acpi_power_meter drbd lru_cache libcrc32c cls_u32 cls_fw 
br_netfilter bridge stp llc xenfs xen_privcmd act_police mlx4_en raid1 crc32c_intel serio_raw drm_vram_helper ttm igb ptp pps_core dca mlx4_core 
xen_pciback xen_netback xen_blkback xen_gntalloc xen_gntdev xen_evtchn
Jun 28 13:53:36 hail kernel: CPU: 0 PID: 20709 Comm: kworker/0:1 Tainted: G      D W         5.4.46-1_prgmr_debug.el7.x86_64 #1
Jun 28 13:53:36 hail kernel: Hardware name: HPE ProLiant DL325 Gen10/ProLiant DL325 Gen10, BIOS A41 03/09/2020
Jun 28 13:53:36 hail kernel: Workqueue: events kobject_delayed_cleanup
Jun 28 13:53:36 hail kernel: RIP: e030:debug_print_object+0x99/0x110
Jun 28 13:53:36 hail kernel: Code: ff ff ff ff 48 89 f7 48 29 c7 48 89 f8 48 83 ff 30 77 5c 48 8b 14 dd a0 86 a8 82 4c 89 e6 48 c7 c7 98 23 5e 82 e8 
3b 5b ab ff <0f> 0b 83 05 12 95 5f 01 01 48 83 c4 18 5b 5d 41 5c c3 83 05 02 95
Jun 28 13:53:36 hail kernel: RSP: e02b:ffffc90043227d10 EFLAGS: 00010282
Jun 28 13:53:36 hail kernel: RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000000006
Jun 28 13:53:36 hail kernel: RDX: 0000000000000007 RSI: 0000000000000000 RDI: ffff888188a17c60
Jun 28 13:53:36 hail kernel: RBP: ffffffff8289aa80 R08: 0000000000000001 R09: 0000000000000001
Jun 28 13:53:36 hail kernel: R10: 0000000000000000 R11: ffff888043a50000 R12: ffffffff825ae78d
Jun 28 13:53:36 hail kernel: R13: ffffffff8289aa80 R14: ffff8881236d6cd0 R15: 0000000000000000
Jun 28 13:53:36 hail kernel: FS:  00007fa5706a3740(0000) GS:ffff888188a00000(0000) knlGS:0000000000000000
Jun 28 13:53:36 hail kernel: CS:  e030 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 28 13:53:36 hail kernel: CR2: 00007f66c00010b8 CR3: 0000000047014000 CR4: 0000000000040660
Jun 28 13:53:36 hail kernel: Call Trace:
Jun 28 13:53:36 hail kernel: debug_check_no_obj_freed+0x1be/0x1ec
Jun 28 13:53:36 hail kernel: slab_free_freelist_hook+0xe9/0x190
Jun 28 13:53:36 hail kernel: ? device_release+0x2a/0x80
Jun 28 13:53:36 hail kernel: kfree+0xca/0x2f0
Jun 28 13:53:36 hail kernel: device_release+0x2a/0x80
Jun 28 13:53:36 hail kernel: kobject_delayed_cleanup+0x3d/0x80
Jun 28 13:53:36 hail kernel: process_one_work+0x26d/0x620
Jun 28 13:53:36 hail kernel: worker_thread+0x3c/0x390
Jun 28 13:53:36 hail kernel: ? process_one_work+0x620/0x620
Jun 28 13:53:36 hail kernel: kthread+0x121/0x140
Jun 28 13:53:36 hail kernel: ? kthread_create_worker_on_cpu+0x70/0x70
Jun 28 13:53:36 hail kernel: ret_from_fork+0x27/0x50
Jun 28 13:53:36 hail kernel: irq event stamp: 0
Jun 28 13:53:36 hail kernel: hardirqs last  enabled at (0): [<0000000000000000>] 0x0
Jun 28 13:53:36 hail kernel: hardirqs last disabled at (0): [<ffffffff81110255>] copy_process+0xa95/0x1e40
Jun 28 13:53:36 hail kernel: softirqs last  enabled at (0): [<ffffffff81110255>] copy_process+0xa95/0x1e40
Jun 28 13:53:36 hail kernel: softirqs last disabled at (0): [<0000000000000000>] 0x0
Jun 28 13:53:36 hail kernel: ---[ end trace 2130fa3cc27515d6 ]---


Jun 28 17:29:13 hail kernel: block drbd12: disk( UpToDate -> Failed )
Jun 28 17:29:13 hail kernel: block drbd12: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
Jun 28 17:29:13 hail kernel: block drbd12: disk( Failed -> Diskless )
Jun 28 17:29:13 hail kernel: kobject: 'integrity' (0000000091c8ac2a): kobject_release, parent 0000000000000000 (delayed 4000)
Jun 28 17:29:13 hail kernel: kobject: '147:12' (0000000083c5876c): kobject_release, parent 0000000000000000 (delayed 2000)
Jun 28 17:29:13 hail kernel: kobject: 'holders' (000000008c009d1d): kobject_release, parent 000000009f3618c3 (delayed 1000)
Jun 28 17:29:13 hail kernel: kobject: 'slaves' (000000005356be07): kobject_release, parent 000000009f3618c3 (delayed 2000)
Jun 28 17:29:13 hail kernel: drbd resource12: Terminating drbd_w_resource
Jun 28 17:29:15 hail kernel: kobject: 'drbd12' (000000009f3618c3): kobject_release, parent 0000000000000000 (delayed 1000)
Jun 28 17:29:16 hail kernel: kobject: 'queue' (00000000dcc18a74): kobject_release, parent 0000000000000000 (delayed 4000)
Jun 28 17:29:16 hail kernel: ------------[ cut here ]------------
Jun 28 17:29:16 hail kernel: ODEBUG: free active (active state 0) object type: timer_list hint: delayed_work_timer_fn+0x0/0x20
Jun 28 17:29:16 hail kernel: WARNING: CPU: 0 PID: 14242 at lib/debugobjects.c:488 debug_print_object+0x99/0x110
Jun 28 17:29:16 hail kernel: Modules linked in: einj 8021q garp mrp bonding xt_physdev ip6table_filter ip6_tables xt_conntrack nf_conntrack 
nf_defrag_ipv6 nf_defrag_ipv4 crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcspkr cdc_eem usbnet mii hpilo hpwdt ccp tun sch_ingress sch_htb 
raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor k10temp raid6_pq raid10 sp5100_tco fuse i2c_piix4 ext2 ebt_mark ebt_ip wmi 
ebt_ip6 ebt_arp ebtable_nat ebtable_filter ipmi_si ebtables ipmi_devintf ipmi_msghandler acpi_power_meter drbd lru_cache libcrc32c cls_u32 cls_fw 
br_netfilter bridge stp llc xenfs xen_privcmd act_police mlx4_en raid1 crc32c_intel serio_raw drm_vram_helper ttm igb ptp pps_core dca mlx4_core 
xen_pciback xen_netback xen_blkback xen_gntalloc xen_gntdev xen_evtchn
Jun 28 17:29:16 hail kernel: CPU: 0 PID: 14242 Comm: kworker/0:0 Tainted: G      D W         5.4.46-1_prgmr_debug.el7.x86_64 #1
Jun 28 17:29:16 hail kernel: Hardware name: HPE ProLiant DL325 Gen10/ProLiant DL325 Gen10, BIOS A41 03/09/2020
Jun 28 17:29:16 hail kernel: Workqueue: events kobject_delayed_cleanup
Jun 28 17:29:16 hail kernel: RIP: e030:debug_print_object+0x99/0x110
Jun 28 17:29:16 hail kernel: Code: ff ff ff ff 48 89 f7 48 29 c7 48 89 f8 48 83 ff 30 77 5c 48 8b 14 dd a0 86 a8 82 4c 89 e6 48 c7 c7 98 23 5e 82 e8 
3b 5b ab ff <0f> 0b 83 05 12 95 5f 01 01 48 83 c4 18 5b 5d 41 5c c3 83 05 02 95
Jun 28 17:29:16 hail kernel: RSP: e02b:ffffc90042753d10 EFLAGS: 00010282
Jun 28 17:29:16 hail kernel: RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000000006
Jun 28 17:29:16 hail kernel: RDX: 0000000000000007 RSI: 0000000000000000 RDI: ffff888188a17c60
Jun 28 17:29:16 hail kernel: RBP: ffffffff82958620 R08: 0000000000000001 R09: 0000000000000001
Jun 28 17:29:16 hail kernel: R10: 0000000000000000 R11: ffff888152e43000 R12: ffffffff825ae78d
Jun 28 17:29:16 hail kernel: R13: ffffffff82958620 R14: ffff888107ff38e8 R15: 0000000000000000
Jun 28 17:29:16 hail kernel: FS:  00007f5498c538c0(0000) GS:ffff888188a00000(0000) knlGS:0000000000000000
Jun 28 17:29:16 hail kernel: CS:  e030 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 28 17:29:16 hail kernel: CR2: 00007f5498c5f000 CR3: 000000015452c000 CR4: 0000000000040660
Jun 28 17:29:16 hail kernel: Call Trace:
Jun 28 17:29:16 hail kernel: debug_check_no_obj_freed+0x1be/0x1ec
Jun 28 17:29:16 hail kernel: slab_free_freelist_hook+0xe9/0x190
Jun 28 17:29:16 hail kernel: ? device_release+0x2a/0x80
Jun 28 17:29:16 hail kernel: kfree+0xca/0x2f0
Jun 28 17:29:16 hail kernel: device_release+0x2a/0x80
Jun 28 17:29:16 hail kernel: kobject_delayed_cleanup+0x3d/0x80
Jun 28 17:29:16 hail kernel: process_one_work+0x26d/0x620
Jun 28 17:29:16 hail kernel: worker_thread+0x3c/0x390
Jun 28 17:29:16 hail kernel: ? process_one_work+0x620/0x620
Jun 28 17:29:16 hail kernel: kthread+0x121/0x140
Jun 28 17:29:16 hail kernel: ? kthread_create_worker_on_cpu+0x70/0x70
Jun 28 17:29:16 hail kernel: ret_from_fork+0x27/0x50
Jun 28 17:29:16 hail kernel: irq event stamp: 0
Jun 28 17:29:16 hail kernel: hardirqs last  enabled at (0): [<0000000000000000>] 0x0
Jun 28 17:29:16 hail kernel: hardirqs last disabled at (0): [<ffffffff81110255>] copy_process+0xa95/0x1e40
Jun 28 17:29:16 hail kernel: softirqs last  enabled at (0): [<ffffffff81110255>] copy_process+0xa95/0x1e40
Jun 28 17:29:16 hail kernel: softirqs last disabled at (0): [<0000000000000000>] 0x0
Jun 28 17:29:16 hail kernel: ---[ end trace 2130fa3cc27515da ]---


> 
> But: is this "easily" reproducible? If so: how?

I haven't figured out how to easily reproduce it yet.

FYI, it looks like I inadvertently left out some of the logs showing what happened with the connection:

---
Jul 25 03:35:36 hail kernel: drbd resource37: Starting worker thread (from drbdsetup-84 [16066])
Jul 25 03:35:36 hail kernel: block drbd37: disk( Diskless -> Attaching )
Jul 25 03:35:36 hail kernel: drbd resource37: Method to ensure write ordering: flush
Jul 25 03:35:36 hail kernel: block drbd37: max BIO size = 1048576
Jul 25 03:35:36 hail kernel: block drbd37: drbd_bm_resize called with capacity == 4194304
Jul 25 03:35:36 hail kernel: block drbd37: resync bitmap: bits=524288 words=8192 pages=16
Jul 25 03:35:36 hail kernel: block drbd37: size = 2048 MB (2097152 KB)
Jul 25 03:35:36 hail kernel: block drbd37: recounting of set bits took additional 0 jiffies
Jul 25 03:35:36 hail kernel: block drbd37: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
Jul 25 03:35:36 hail kernel: block drbd37: disk( Attaching -> UpToDate )
Jul 25 03:35:36 hail kernel: block drbd37: attached to UUIDs D314471E6BD376E8:0000000000000000:F76A13AB9FF79972:F76913AB9FF79973
Jul 25 03:35:36 hail kernel: drbd resource37: conn( StandAlone -> Unconnected )
Jul 25 03:35:36 hail kernel: drbd resource37: Starting receiver thread (from drbd_w_resource [16068])
Jul 25 03:35:36 hail kernel: drbd resource37: receiver (re)started
Jul 25 03:35:36 hail kernel: drbd resource37: conn( Unconnected -> WFConnection )
Jul 25 03:35:37 hail kernel: block drbd37: role( Secondary -> Primary )
Jul 25 03:35:37 hail kernel: block drbd37: new current UUID 7BAD3285364ECE5B:D314471E6BD376E8:F76A13AB9FF79972:F76913AB9FF79973
Jul 25 03:35:37 hail kernel: drbd resource37: Handshake successful: Agreed network protocol version 101
Jul 25 03:35:37 hail kernel: drbd resource37: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.
Jul 25 03:35:37 hail kernel: drbd resource37: Peer authenticated using 16 bytes HMAC
Jul 25 03:35:37 hail kernel: drbd resource37: conn( WFConnection -> WFReportParams )
Jul 25 03:35:37 hail kernel: drbd resource37: Starting ack_recv thread (from drbd_r_resource [16082])
Jul 25 03:35:37 hail kernel: block drbd37: drbd_sync_handshake:
Jul 25 03:35:37 hail kernel: block drbd37: self 7BAD3285364ECE5B:D314471E6BD376E8:F76A13AB9FF79972:F76913AB9FF79973 bits:0 flags:0
Jul 25 03:35:37 hail kernel: block drbd37: peer 5AB69B021591F980:D314471E6BD376E9:F76A13AB9FF79973:F76913AB9FF79973 bits:0 flags:0
Jul 25 03:35:37 hail kernel: block drbd37: uuid_compare()=100 by rule 90
Jul 25 03:35:37 hail kernel: block drbd37: helper command: /bin/true initial-split-brain minor-37
Jul 25 03:35:37 hail kernel: block drbd37: helper command: /bin/true initial-split-brain minor-37 exit code 0 (0x0)
Jul 25 03:35:37 hail kernel: block drbd37: Split-Brain detected, 1 primaries, automatically solved. Sync from this node
Jul 25 03:35:37 hail kernel: block drbd37: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> Consistent )
Jul 25 03:35:37 hail kernel: block drbd37: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 21(1), total 21; compression: 100.0%
Jul 25 03:35:37 hail kernel: block drbd37: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 21(1), total 21; compression: 100.0%
Jul 25 03:35:37 hail kernel: block drbd37: helper command: /bin/true before-resync-source minor-37
Jul 25 03:35:37 hail kernel: block drbd37: helper command: /bin/true before-resync-source minor-37 exit code 0 (0x0)
Jul 25 03:35:37 hail kernel: block drbd37: conn( WFBitMapS -> SyncSource ) pdsk( Consistent -> Inconsistent )
Jul 25 03:35:37 hail kernel: block drbd37: Began resync as SyncSource (will sync 0 KB [0 bits set]).
Jul 25 03:35:37 hail kernel: block drbd37: updated sync UUID 7BAD3285364ECE5B:D315471E6BD376E8:D314471E6BD376E8:F76A13AB9FF79972
Jul 25 03:35:37 hail kernel: block drbd37: Resync done (total 1 sec; paused 0 sec; 0 K/sec)
Jul 25 03:35:37 hail kernel: block drbd37: updated UUIDs 7BAD3285364ECE5B:0000000000000000:D315471E6BD376E8:D314471E6BD376E8
Jul 25 03:35:37 hail kernel: block drbd37: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate )
Jul 25 03:36:41 hail kernel: drbd resource37: peer( Secondary -> Unknown ) conn( Connected -> TearDown ) pdsk( UpToDate -> DUnknown )
Jul 25 03:36:41 hail kernel: drbd resource37: ack_receiver terminated
Jul 25 03:36:41 hail kernel: drbd resource37: Terminating drbd_a_resource
Jul 25 03:36:41 hail kernel: block drbd37: new current UUID B4D3BB7D18811429:7BAD3285364ECE5B:D315471E6BD376E8:D314471E6BD376E8
Jul 25 03:36:42 hail kernel: drbd resource37: conn( TearDown -> Disconnecting )
Jul 25 03:36:42 hail kernel: drbd resource37: Connection closed
Jul 25 03:36:42 hail kernel: drbd resource37: conn( Disconnecting -> StandAlone )
Jul 25 03:36:42 hail kernel: drbd resource37: receiver terminated
Jul 25 03:36:42 hail kernel: drbd resource37: Terminating drbd_r_resource
Jul 25 03:36:46 hail kernel: drbd resource37: conn( StandAlone -> Unconnected )
Jul 25 03:36:46 hail kernel: drbd resource37: Starting receiver thread (from drbd_w_resource [16068])
Jul 25 03:36:46 hail kernel: drbd resource37: receiver (re)started
Jul 25 03:36:46 hail kernel: drbd resource37: conn( Unconnected -> WFConnection )
Jul 25 03:36:47 hail kernel: drbd resource37: Handshake successful: Agreed network protocol version 101
Jul 25 03:36:47 hail kernel: drbd resource37: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.
Jul 25 03:36:47 hail kernel: drbd resource37: Peer authenticated using 16 bytes HMAC
Jul 25 03:36:47 hail kernel: drbd resource37: conn( WFConnection -> WFReportParams )
Jul 25 03:36:47 hail kernel: drbd resource37: Starting ack_recv thread (from drbd_r_resource [17644])
Jul 25 03:36:54 hail kernel: block drbd37: drbd_sync_handshake:
Jul 25 03:36:54 hail kernel: block drbd37: self B4D3BB7D18811429:7BAD3285364ECE5B:D315471E6BD376E8:D314471E6BD376E8 bits:98 flags:0
Jul 25 03:36:54 hail kernel: block drbd37: peer 7BAD3285364ECE5A:0000000000000000:D315471E6BD376E8:D314471E6BD376E8 bits:0 flags:0
Jul 25 03:36:54 hail kernel: block drbd37: uuid_compare()=1 by rule 70
Jul 25 03:36:54 hail kernel: block drbd37: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> Consistent )
Jul 25 03:36:54 hail kernel: block drbd37: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 34(1), total 34; compression: 100.0%
Jul 25 03:36:54 hail kernel: block drbd37: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 34(1), total 34; compression: 100.0%
Jul 25 03:36:54 hail kernel: block drbd37: helper command: /bin/true before-resync-source minor-37
Jul 25 03:36:54 hail kernel: block drbd37: helper command: /bin/true before-resync-source minor-37 exit code 0 (0x0)
Jul 25 03:36:54 hail kernel: block drbd37: conn( WFBitMapS -> SyncSource ) pdsk( Consistent -> Inconsistent )
Jul 25 03:36:54 hail kernel: block drbd37: Began resync as SyncSource (will sync 496 KB [124 bits set]).
Jul 25 03:36:54 hail kernel: block drbd37: updated sync UUID B4D3BB7D18811429:7BAE3285364ECE5B:7BAD3285364ECE5B:D315471E6BD376E8
Jul 25 03:36:54 hail kernel: block drbd37: Resync done (total 1 sec; paused 0 sec; 496 K/sec)
Jul 25 03:36:54 hail kernel: block drbd37: updated UUIDs B4D3BB7D18811429:0000000000000000:7BAE3285364ECE5B:7BAD3285364ECE5B
Jul 25 03:36:54 hail kernel: block drbd37: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate )
Jul 25 03:36:58 hail kernel: block drbd37: peer( Secondary -> Primary )
Jul 25 03:37:10 hail kernel: block drbd37: role( Primary -> Secondary )
Jul 25 03:37:10 hail kernel: block drbd37: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
Jul 25 03:37:11 hail kernel: drbd resource37: peer( Primary -> Unknown ) conn( Connected -> Disconnecting ) pdsk( UpToDate -> DUnknown )
Jul 25 03:37:11 hail kernel: drbd resource37: ack_receiver terminated
Jul 25 03:37:11 hail kernel: drbd resource37: Terminating drbd_a_resource
Jul 25 03:37:11 hail kernel: drbd resource37: ASSERTION FAILED: connection->current_epoch->list not empty
Jul 25 03:37:11 hail kernel: drbd resource37: Connection closed
Jul 25 03:37:11 hail kernel: drbd resource37: conn( Disconnecting -> StandAlone )
Jul 25 03:37:11 hail kernel: drbd resource37: receiver terminated
Jul 25 03:37:11 hail kernel: drbd resource37: Terminating drbd_r_resource
Jul 25 03:37:11 hail kernel: drbd resource37: conn( StandAlone -> Unconnected )
Jul 25 03:37:11 hail kernel: drbd resource37: Starting receiver thread (from drbd_w_resource [16068])
Jul 25 03:37:11 hail kernel: drbd resource37: receiver (re)started
Jul 25 03:37:11 hail kernel: drbd resource37: conn( Unconnected -> WFConnection )
Jul 25 03:37:12 hail kernel: drbd resource37: Handshake successful: Agreed network protocol version 101
Jul 25 03:37:12 hail kernel: drbd resource37: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.
Jul 25 03:37:12 hail kernel: drbd resource37: Peer authenticated using 16 bytes HMAC
Jul 25 03:37:12 hail kernel: drbd resource37: conn( WFConnection -> WFReportParams )
Jul 25 03:37:12 hail kernel: drbd resource37: Starting ack_recv thread (from drbd_r_resource [18313])
Jul 25 03:37:12 hail kernel: block drbd37: drbd_sync_handshake:
Jul 25 03:37:12 hail kernel: block drbd37: self B4D3BB7D18811428:0000000000000000:7BAE3285364ECE5B:7BAD3285364ECE5B bits:0 flags:0
Jul 25 03:37:12 hail kernel: block drbd37: peer D70375DB919ACEDB:B4D3BB7D18811429:7BAE3285364ECE5A:7BAD3285364ECE5B bits:400 flags:0
Jul 25 03:37:12 hail kernel: block drbd37: uuid_compare()=-1 by rule 50
Jul 25 03:37:12 hail kernel: block drbd37: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) disk( UpToDate -> Outdated ) pdsk( DUnknown 
-> UpToDate )
Jul 25 03:37:12 hail kernel: block drbd37: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 37(1), total 37; compression: 100.0%
Jul 25 03:37:12 hail kernel: block drbd37: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 37(1), total 37; compression: 100.0%
Jul 25 03:37:12 hail kernel: block drbd37: conn( WFBitMapT -> WFSyncUUID )
Jul 25 03:37:12 hail kernel: block drbd37: updated sync uuid B4D4BB7D18811428:0000000000000000:7BAE3285364ECE5B:7BAD3285364ECE5B
Jul 25 03:37:12 hail kernel: block drbd37: helper command: /bin/true before-resync-target minor-37
Jul 25 03:37:12 hail kernel: block drbd37: helper command: /bin/true before-resync-target minor-37 exit code 0 (0x0)
Jul 25 03:37:12 hail kernel: block drbd37: conn( WFSyncUUID -> SyncTarget ) disk( Outdated -> Inconsistent )
Jul 25 03:37:12 hail kernel: block drbd37: Began resync as SyncTarget (will sync 1600 KB [400 bits set]).
Jul 25 03:37:12 hail kernel: block drbd37: Resync done (total 1 sec; paused 0 sec; 1600 K/sec)
Jul 25 03:37:12 hail kernel: block drbd37: updated UUIDs D70375DB919ACEDA:0000000000000000:B4D4BB7D18811428:B4D3BB7D18811429
Jul 25 03:37:12 hail kernel: block drbd37: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate )
Jul 25 03:37:17 hail kernel: block drbd37: helper command: /bin/true after-resync-target minor-37
Jul 25 03:37:17 hail kernel: block drbd37: helper command: /bin/true after-resync-target minor-37 exit code 0 (0x0)
Jul 25 03:38:24 hail kernel: drbd resource37: peer( Primary -> Unknown ) conn( Connected -> Disconnecting ) pdsk( UpToDate -> DUnknown )
Jul 25 03:38:24 hail kernel: drbd resource37: ack_receiver terminated
Jul 25 03:38:24 hail kernel: drbd resource37: Terminating drbd_a_resource
Jul 25 03:38:26 hail kernel: drbd resource37: ASSERTION FAILED: connection->current_epoch->list not empty
Jul 25 03:38:26 hail kernel: drbd resource37: Connection closed
Jul 25 03:38:26 hail kernel: drbd resource37: conn( Disconnecting -> StandAlone )
Jul 25 03:38:26 hail kernel: drbd resource37: receiver terminated
Jul 25 03:38:26 hail kernel: drbd resource37: Terminating drbd_r_resource
Jul 25 03:38:26 hail kernel: block drbd37: disk( UpToDate -> Failed )
Jul 25 03:38:26 hail kernel: block drbd37: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
Jul 25 03:38:26 hail kernel: block drbd37: disk( Failed -> Diskless )
Jul 25 03:38:26 hail kernel: general protection fault: 0000 [#1] SMP NOPTI
---

It looks like we saw "role( Primary -> Secondary )' around 7,500 times on this system before seeing this problem; there are about 16,000 such messages 
on the peer node.

Switching roles isn't something we can test under our current debug kernel because the the kernel is slow enough that the live migration scripts time 
out. Unless you already have a suggestion, I'll have to see if there's a subset of options which would still help here.

--Sarah

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Drbd-dev] Bug(s) with Linux v5.4.46
  2020-07-27  7:16 ` Lars Ellenberg
  2020-07-27 19:54   ` Sarah Newman
@ 2020-07-31  6:27   ` Sarah Newman
  1 sibling, 0 replies; 4+ messages in thread
From: Sarah Newman @ 2020-07-31  6:27 UTC (permalink / raw)
  To: drbd-dev

On 7/27/20 12:16 AM, Lars Ellenberg wrote:
> On Sun, Jul 26, 2020 at 08:55:10PM -0700, Sarah Newman wrote:
> 
>> 	kref_put(&device->kref, drbd_destroy_device);
> 
> At this point we are "sure" to still hold at least one
> additional reference on device.
> 
>> 	del_gendisk(device->vdisk);
>> 	synchronize_rcu();
> 
> which we put here:
> 
>> 	kref_put(&device->kref, drbd_destroy_device);
> 
> 
> But what you present here shows that in your case that is not true.
> 

Is there a missing kref_get (and put on failure) around calling

"queue_work(connection->ack_sender, &peer_req->peer_device->send_acks_work);"

from handle_write_conflicts ?

I don't understand at all how drbd_adm_get_initial_state is supposed to work... for that reason, that's the only other place I potentially see an 
issue via code inspection.

Thanks, Sarah

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2020-07-31  6:27 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-07-27  3:55 [Drbd-dev] Bug(s) with Linux v5.4.46 Sarah Newman
2020-07-27  7:16 ` Lars Ellenberg
2020-07-27 19:54   ` Sarah Newman
2020-07-31  6:27   ` Sarah Newman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox