All of lore.kernel.org
 help / color / mirror / Atom feed
* hidp_get_raw_report source of 5 second delay in device removal
@ 2013-02-20 10:28 Karl Relton
  2013-02-20 15:44 ` David Herrmann
  0 siblings, 1 reply; 6+ messages in thread
From: Karl Relton @ 2013-02-20 10:28 UTC (permalink / raw)
  To: linux-bluetooth

Dear friends

Any thoughts on my two patches:
- [PATCH] hidp: Make hidp_get_raw_report abort if the session is
terminating
- RFC [PATCH] hci: Fix race between hidp_session and hci code

originally sent 29th Jan and resent 18th Feb?

There has been no reply to this list ...

The first patch is critical, the second makes everything neat and tidy.

The following kernel log, using linux 3.8-rc6 and DEBUG turned on in the
bluetooth & hidp modules illustrate the problem if neither of these
patches are used. I added a couple of extra printks in the
hidp_session() routine to show its progress. I also
tweaked hidp_session to avoid a possible kernel OOPS caused by this same
problem (which is avoided if the first patch was applied).
Lines with *** are my annotations to describe what is happening.

*** Kernel resuming - does usb reset as part of resume from suspend sequence ...
Feb 20 09:31:15 laptop2 kernel: [  334.204082] usb 3-2: reset full-speed USB device number 2 using uhci_hcd
Feb 20 09:31:15 laptop2 kernel: [  334.375693] btusb 3-2:1.0: no reset_resume for driver btusb?
Feb 20 09:31:15 laptop2 kernel: [  334.375697] btusb 3-2:1.1: no reset_resume for driver btusb?
Feb 20 09:31:15 laptop2 kernel: [  334.375710] hci_unregister_dev: f6119000 name hci0 bus 1
Feb 20 09:31:15 laptop2 kernel: [  334.375712] hci_dev_do_close: hci0 f6119000
Feb 20 09:31:15 laptop2 kernel: [  334.375716] hci_req_cancel: hci0 err 0x13
Feb 20 09:31:15 laptop2 kernel: [  334.375719] hci_conn_hash_flush: hdev hci0
Feb 20 09:31:15 laptop2 kernel: [  334.375722] l2cap_disconn_cfm: hcon e000bc00 reason 22
Feb 20 09:31:15 laptop2 kernel: [  334.375725] l2cap_conn_del: hcon e000bc00 conn f3294800, err 103
Feb 20 09:31:15 laptop2 kernel: [  334.375727] l2cap_chan_hold: chan e000a400 orig refcnt 3
Feb 20 09:31:15 laptop2 kernel: [  334.375730] l2cap_chan_del: chan e000a400, conn f3294800, err 103
Feb 20 09:31:15 laptop2 kernel: [  334.375732] l2cap_chan_put: chan e000a400 orig refcnt 4
Feb 20 09:31:15 laptop2 kernel: [  334.375734] hci_conn_put: hcon e000bc00 orig refcnt 2
Feb 20 09:31:15 laptop2 kernel: [  334.375742] l2cap_chan_put: chan e000a400 orig refcnt 3
Feb 20 09:31:15 laptop2 kernel: [  334.375744] l2cap_chan_hold: chan e0009400 orig refcnt 3
Feb 20 09:31:15 laptop2 kernel: [  334.375746] l2cap_chan_del: chan e0009400, conn f3294800, err 103
Feb 20 09:31:15 laptop2 kernel: [  334.375748] l2cap_chan_put: chan e0009400 orig refcnt 4
Feb 20 09:31:15 laptop2 kernel: [  334.375750] hci_conn_put: hcon e000bc00 orig refcnt 1
Feb 20 09:31:15 laptop2 kernel: [  334.375754] l2cap_chan_put: chan e0009400 orig refcnt 3
Feb 20 09:31:15 laptop2 kernel: [  334.375756] hci_chan_del: hci0 hcon e000bc00 chan e00d7dc0
*** the l2cap_disonn_cfm/l2cap_chan_del above triggers the hidp_session
*** thread, so hidp_session can now fall out its main loop.
*** hidp_session then does hid_destroy_device(), which through code in the power_supply
*** part of the kernel invokes a battery status lookup which requires a call to hidp_get_raw_report
Feb 20 09:31:15 laptop2 kernel: [  334.375765] hidp_session Out of main loop
Feb 20 09:31:15 laptop2 kernel: [  334.375767] hidp_session About to do hid_destroy_device
Feb 20 09:31:15 laptop2 kernel: [  334.375857] hidp_get_raw_report: Entering hidp_get_raw_report
Feb 20 09:31:15 laptop2 kernel: [  334.375860] __hidp_send_ctrl_message: session f3294f00 data e0115d36 size 1
*** hidp_get_raw_report sends a ctrl message and then waits. Meanwhile the main kernel gets
*** on with running the hci driver code ...
Feb 20 09:31:15 laptop2 kernel: [  334.376725] hci_conn_put: hcon e000bc00 orig refcnt 0
Feb 20 09:31:15 laptop2 kernel: [  334.376729] hci_conn_del: hci0 hcon e000bc00 handle 46
Feb 20 09:31:15 laptop2 kernel: [  334.376732] hci_chan_list_flush: hcon e000bc00
Feb 20 09:31:15 laptop2 kernel: [  334.376832] hci_dev_put: hci0 orig refcnt 12
Feb 20 09:31:15 laptop2 kernel: [  334.376835] hci_sock_dev_event: hdev hci0 event 4
Feb 20 09:31:15 laptop2 kernel: [  334.376839] hci_send_to_sock: hdev   (null) len 8
Feb 20 09:31:15 laptop2 kernel: [  334.376850] hci_dev_put: hci0 orig refcnt 11
Feb 20 09:31:15 laptop2 kernel: [  334.376854] hci_send_to_control: len 6
Feb 20 09:31:15 laptop2 kernel: [  334.376856] hci_sock_dev_event: hdev hci0 event 2
Feb 20 09:31:15 laptop2 kernel: [  334.376859] hci_send_to_sock: hdev   (null) len 8
Feb 20 09:31:15 laptop2 kernel: [  334.376864] hci_dev_put: hci0 orig refcnt 10
Feb 20 09:31:15 laptop2 kernel: [  334.376910] hci_del_sysfs: f6119000 name hci0 bus 1
Feb 20 09:31:15 laptop2 kernel: [  334.376994] hci_dev_put: hci0 orig refcnt 5
*** the 'hci' device is the sysfs tree has now just been removed - yet the various devices
*** dependent on it have yet to be removed (input, event, hidraw, hci_conn)

*** Note the 5 SECOND gap ... thats the timeout value in hidp_get_raw_report!
*** having timed out, the hidp_session thread gets underway again
*** the various devices are now removed (input, event, hidraw, hci_conn)
*** and life can go on
Feb 20 09:31:15 laptop2 kernel: [  339.372148] hidp_get_raw_report timed out!
Feb 20 09:31:15 laptop2 kernel: [  339.372153] power_supply hid-60:c5:47:1b:07:b8-battery: driver failed to report `capacity' property: -5
Feb 20 09:31:15 laptop2 kernel: [  339.376347] PM: resume of devices complete after 5491.694 msecs
Feb 20 09:31:15 laptop2 kernel: [  339.376362] hidp_session Done hid_destry_device
Feb 20 09:31:15 laptop2 kernel: [  339.376366] hci_conn_del_sysfs: conn e000bc00
Feb 20 09:31:15 laptop2 kernel: [  339.376392] hci_dev_put: hci0 orig refcnt 1
*** a new hci device is registered, and a new hci_conn, and sub-tree of input devices
*** will be added, which is normal and to be expected.
*** NOTE though whilst the new registration occurs here in this log, I have seen it
*** occur much earlier in other logs (different versions of kernel), leading to
*** various problems when a new hci_conn is added whilst the old one has yet to be deleted.
Feb 20 09:31:15 laptop2 kernel: [  339.376424] hci_register_dev: f3367000 name hci0 bus 1
Feb 20 09:31:15 laptop2 kernel: [  339.376475] hci_add_sysfs: f3367000 name hci0 bus 1
Feb 20 09:31:15 laptop2 kernel: [  339.376561] hci_sock_dev_event: hdev hci0 event 1
Feb 20 09:31:15 laptop2 kernel: [  339.376564] hci_send_to_sock: hdev   (null) len 8
Feb 20 09:31:15 laptop2 kernel: [  339.376568] hci_dev_hold: hci0 orig refcnt 5
Feb 20 09:31:15 laptop2 kernel: [  339.378210] bt_sock_poll: sock ef627500, sk f30afe00
Feb 20 09:31:15 laptop2 kernel: [  339.384283] bt_sock_poll: sock f2d95380, sk f675a600
Feb 20 09:31:15 laptop2 kernel: [  339.384291] bt_sock_poll: sock f5be0c00, sk f673f600
Feb 20 09:31:15 laptop2 kernel: [  339.384295] bt_sock_poll: sock f5be0180, sk f600ac00
Feb 20 09:31:15 laptop2 kernel: [  339.384298] bt_sock_poll: sock f5b30180, sk f66c4400
Feb 20 09:31:15 laptop2 kernel: [  339.384301] bt_sock_poll: sock f5b30300, sk f6691e00
Feb 20 09:31:15 laptop2 kernel: [  339.384303] bt_sock_poll: sock f5b30480, sk f673e000
Feb 20 09:31:15 laptop2 kernel: [  339.384306] bt_sock_poll: sock f5b30600, sk f673e600
Feb 20 09:31:15 laptop2 kernel: [  339.384309] bt_sock_poll: sock f5b30780, sk f675b800
Feb 20 09:31:15 laptop2 kernel: [  339.384311] bt_sock_poll: sock f5b30900, sk f675bc00
Feb 20 09:31:15 laptop2 kernel: [  339.384314] bt_sock_poll: sock f5b30a80, sk f675a800
Feb 20 09:31:15 laptop2 kernel: [  339.384320] bt_sock_poll: sock f2d94180, sk f66c4800
Feb 20 09:31:15 laptop2 kernel: [  339.384323] bt_sock_poll: sock f5b30000, sk f66c5200
Feb 20 09:31:15 laptop2 kernel: [  339.384325] bt_sock_poll: sock df8a8300, sk e968c800
Feb 20 09:31:15 laptop2 kernel: [  339.384328] bt_sock_poll: sock df8a8180, sk e968d600
Feb 20 09:31:15 laptop2 kernel: [  339.384357] hci_sock_recvmsg: sock f5be0780, sk f673ea00
Feb 20 09:31:15 laptop2 kernel: [  339.384574] l2cap_sock_release: sock df8a8300, sk e968c800
Feb 20 09:31:15 laptop2 kernel: [  339.384578] l2cap_sock_shutdown: sock df8a8300, sk e968c800
Feb 20 09:31:15 laptop2 kernel: [  339.384582] l2cap_chan_close: chan e000a400 state BT_CLOSED sk e968c800
Feb 20 09:31:15 laptop2 kernel: [  339.384586] l2cap_sock_kill: sk e968c800 state BT_CLOSED
Feb 20 09:31:15 laptop2 kernel: [  339.384589] l2cap_chan_put: chan e000a400 orig refcnt 2
Feb 20 09:31:15 laptop2 kernel: [  339.384592] l2cap_sock_destruct: sk e968c800
Feb 20 09:31:15 laptop2 kernel: [  339.384594] l2cap_chan_put: chan e000a400 orig refcnt 1
Feb 20 09:31:15 laptop2 kernel: [  339.384597] l2cap_chan_destroy: chan e000a400
Feb 20 09:31:15 laptop2 kernel: [  339.384656] l2cap_sock_release: sock df8a8180, sk e968d600
Feb 20 09:31:15 laptop2 kernel: [  339.384659] l2cap_sock_shutdown: sock df8a8180, sk e968d600
Feb 20 09:31:15 laptop2 kernel: [  339.384663] l2cap_chan_close: chan e0009400 state BT_CLOSED sk e968d600
Feb 20 09:31:15 laptop2 kernel: [  339.384666] l2cap_sock_kill: sk e968d600 state BT_CLOSED
Feb 20 09:31:15 laptop2 kernel: [  339.384668] l2cap_chan_put: chan e0009400 orig refcnt 2
Feb 20 09:31:15 laptop2 kernel: [  339.384670] l2cap_sock_destruct: sk e968d600
Feb 20 09:31:15 laptop2 kernel: [  339.384673] l2cap_chan_put: chan e0009400 orig refcnt 1
Feb 20 09:31:15 laptop2 kernel: [  339.384675] l2cap_chan_destroy: chan e0009400
Feb 20 09:31:15 laptop2 kernel: [  339.384707] bt_sock_poll: sock f2d95380, sk f675a600
Feb 20 09:31:15 laptop2 kernel: [  339.384710] bt_sock_poll: sock f5be0c00, sk f673f600
Feb 20 09:31:15 laptop2 kernel: [  339.384713] bt_sock_poll: sock f5be0180, sk f600ac00
Feb 20 09:31:15 laptop2 kernel: [  339.384715] bt_sock_poll: sock f5b30180, sk f66c4400
Feb 20 09:31:15 laptop2 kernel: [  339.384718] bt_sock_poll: sock f5b30300, sk f6691e00
Feb 20 09:31:15 laptop2 kernel: [  339.384720] bt_sock_poll: sock f5b30480, sk f673e000
Feb 20 09:31:15 laptop2 kernel: [  339.384722] bt_sock_poll: sock f5b30600, sk f673e600
Feb 20 09:31:15 laptop2 kernel: [  339.384725] bt_sock_poll: sock f5b30780, sk f675b800
Feb 20 09:31:15 laptop2 kernel: [  339.384727] bt_sock_poll: sock f5b30900, sk f675bc00
Feb 20 09:31:15 laptop2 kernel: [  339.384729] bt_sock_poll: sock f5b30a80, sk f675a800
Feb 20 09:31:15 laptop2 kernel: [  339.384733] bt_sock_poll: sock f2d94180, sk f66c4800
Feb 20 09:31:15 laptop2 kernel: [  339.384735] bt_sock_poll: sock f5b30000, sk f66c5200
Feb 20 09:31:15 laptop2 kernel: [  339.384743] hci_sock_recvmsg: sock f5be0780, sk f673ea00
Feb 20 09:31:15 laptop2 kernel: [  339.384771] hci_sock_release: sock f5be0600 sk f673e800
Feb 20 09:31:15 laptop2 kernel: [  339.384829] hci_sock_create: sock df4b0000
Feb 20 09:31:15 laptop2 kernel: [  339.384838] hci_sock_ioctl: cmd 800448d2 arg b885cb00
Feb 20 09:31:15 laptop2 kernel: [  339.384843] hci_sock_release: sock df4b0000 sk f1b82600
Feb 20 09:31:15 laptop2 kernel: [  339.385167] bt_accept_dequeue: parent f675a800
Feb 20 09:31:15 laptop2 kernel: [  339.385191] l2cap_sock_release: sock f5b30900, sk f675bc00
Feb 20 09:31:15 laptop2 kernel: [  339.385194] l2cap_sock_shutdown: sock f5b30900, sk f675bc00
Feb 20 09:31:15 laptop2 kernel: [  339.385197] l2cap_chan_close: chan f303b800 state BT_LISTEN sk f675bc00
Feb 20 09:31:15 laptop2 kernel: [  339.385200] l2cap_sock_cleanup_listen: parent f675bc00
Feb 20 09:31:15 laptop2 kernel: [  339.385202] bt_accept_dequeue: parent f675bc00
Feb 20 09:31:15 laptop2 kernel: [  339.385205] l2cap_sock_kill: sk f675bc00 state BT_CLOSED
Feb 20 09:31:15 laptop2 kernel: [  339.385208] l2cap_chan_put: chan f303b800 orig refcnt 2
Feb 20 09:31:15 laptop2 kernel: [  339.385210] l2cap_sock_destruct: sk f675bc00
Feb 20 09:31:15 laptop2 kernel: [  339.385213] l2cap_chan_put: chan f303b800 orig refcnt 1
Feb 20 09:31:15 laptop2 kernel: [  339.385215] l2cap_chan_destroy: chan f303b800
Feb 20 09:31:15 laptop2 kernel: [  339.385343] l2cap_sock_release: sock f5b30780, sk f675b800
Feb 20 09:31:15 laptop2 kernel: [  339.385346] l2cap_sock_shutdown: sock f5b30780, sk f675b800
Feb 20 09:31:15 laptop2 kernel: [  339.385350] l2cap_chan_close: chan f303b400 state BT_LISTEN sk f675b800
Feb 20 09:31:15 laptop2 kernel: [  339.385352] l2cap_sock_cleanup_listen: parent f675b800
Feb 20 09:31:15 laptop2 kernel: [  339.385354] bt_accept_dequeue: parent f675b800
Feb 20 09:31:15 laptop2 kernel: [  339.385357] l2cap_sock_kill: sk f675b800 state BT_CLOSED
Feb 20 09:31:15 laptop2 kernel: [  339.385360] l2cap_chan_put: chan f303b400 orig refcnt 2
Feb 20 09:31:15 laptop2 kernel: [  339.385362] l2cap_sock_destruct: sk f675b800
Feb 20 09:31:15 laptop2 kernel: [  339.385364] l2cap_chan_put: chan f303b400 orig refcnt 1
Feb 20 09:31:15 laptop2 kernel: [  339.385367] l2cap_chan_destroy: chan f303b400
Feb 20 09:31:15 laptop2 kernel: [  339.385386] bt_accept_dequeue: parent f673e600
Feb 20 09:31:15 laptop2 kernel: [  339.385397] bt_accept_dequeue: parent f6691e00
Feb 20 09:31:15 laptop2 kernel: [  339.385408] bt_accept_dequeue: parent f673e000
Feb 20 09:31:15 laptop2 kernel: [  339.394097] hci_rfkill_set_block: f3367000 name hci0 blocked 0
Feb 20 09:31:15 laptop2 kernel: [  339.394103] hci_power_on: hci0
Feb 20 09:31:15 laptop2 kernel: [  339.394106] hci_dev_get: 0
Feb 20 09:31:15 laptop2 kernel: [  339.394109] hci_dev_hold: hci0 orig refcnt 6
Feb 20 09:31:15 laptop2 kernel: [  339.394112] hci_dev_open: hci0 f3367000
Feb 20 09:31:15 laptop2 kernel: [  339.394136] __hci_request: hci0 start
Feb 20 09:31:15 laptop2 kernel: [  339.394139] hci_init_req: hci0 0
Feb 20 09:31:15 laptop2 kernel: [  339.394142] hci_reset_req: hci0 0
Feb 20 09:31:15 laptop2 kernel: [  339.394145] hci_send_cmd: hci0 opcode 0x0c03 plen 0
Feb 20 09:31:15 laptop2 kernel: [  339.394147] hci_send_cmd: skb len 3
Feb 20 09:31:15 laptop2 kernel: [  339.394155] hci_send_cmd: hci0 opcode 0x1003 plen 0
Feb 20 09:31:15 laptop2 kernel: [  339.394157] hci_send_cmd: skb len 3
Feb 20 09:31:15 laptop2 kernel: [  339.394160] hci_send_cmd: hci0 opcode 0x1001 plen 0
Feb 20 09:31:15 laptop2 kernel: [  339.394162] hci_send_cmd: skb len 3
Feb 20 09:31:15 laptop2 kernel: [  339.394168] hci_cmd_work: hci0 cmd_cnt 1 cmd queued 3
Feb 20 09:31:15 laptop2 kernel: [  339.394172] hci_send_frame: hci0 type 1 len 3
Feb 20 09:31:15 laptop2 kernel: [  339.395843] hci_rx_work: hci0
Feb 20 09:31:15 laptop2 kernel: [  339.395850] hci_rx_work: hci0 Event packet
Feb 20 09:31:15 laptop2 kernel: [  339.395854] hci_cmd_status_evt: hci0 opcode 0x0000
Feb 20 09:31:15 laptop2 kernel: [  339.376989] Restarting tasks ... done.

The result of the above is the following from udevadm monitor (i.e. the kernel udev events):
KERNEL - the kernel uevent
KERNEL[339.393586] remove   /devices/pci0000:00/0000:00:1a.0/usb3/3-2/3-2:1.0/bluetooth/hci0/rfkill0 (rfkill)
KERNEL[339.393605] remove   /devices/pci0000:00/0000:00:1a.0/usb3/3-2/3-2:1.0/bluetooth/hci0 (bluetooth)
KERNEL[339.393627] change   /devices/pci0000:00/0000:00:02.0/drm/card0 (drm)
KERNEL[339.393645] remove   /hci0/hci0:46/input10/event10 (input)
KERNEL[339.393673] remove   /hci0/hci0:46/input10 (input)
KERNEL[339.393692] remove   /hci0/hci0:46/0005:05AC:0256.0001/hidraw/hidraw0 (hidraw)
KERNEL[339.393711] remove   /hci0/hci0:46/0005:05AC:0256.0001 (hid)
KERNEL[339.393727] remove   /hci0/hci0:46 (bluetooth)
KERNEL[339.393743] add      /devices/pci0000:00/0000:00:1a.0/usb3/3-2/3-2:1.0/bluetooth/hci0 (bluetooth)
KERNEL[339.393761] add      /devices/pci0000:00/0000:00:1a.0/usb3/3-2/3-2:1.0/bluetooth/hci0/rfkill2 (rfkill)
KERNEL[345.114360] add      /devices/pci0000:00/0000:00:1a.0/usb3/3-2/3-2:1.0/bluetooth/hci0/hci0:46 (bluetooth)
KERNEL[345.384580] add      /devices/pci0000:00/0000:00:1a.0/usb3/3-2/3-2:1.0/bluetooth/hci0/hci0:46/0005:05AC:0256.0002 (hid)
KERNEL[345.424395] add      /devices/pci0000:00/0000:00:1a.0/usb3/3-2/3-2:1.0/bluetooth/hci0/hci0:46/input11 (input)
KERNEL[345.425001] add      /devices/pci0000:00/0000:00:1a.0/usb3/3-2/3-2:1.0/bluetooth/hci0/hci0:46/input11/event10 (input)
KERNEL[345.425312] add      /devices/pci0000:00/0000:00:1a.0/usb3/3-2/3-2:1.0/bluetooth/hci0/hci0:46/0005:05AC:0256.0002/hidraw/hidraw0 (hidraw)

Note the lines starting /hci0/hci0:46 ...

These lines are incorrect - the sysfs path is truncated, caused by device removal in the wrong order.
This can play havoc with userspace tools (like Xorg input device driver) that depends on these events.

The two patches posted resolve this. The first by avoiding the 5 second
timeout penalty (a trival fix). The second by delaying hci removal until
hci_conn references to it have all gone.

The bluetooth kernel code in its current form is problematic - hence all
kind of bug reports by users against their distributions.

Regards
Karl


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: hidp_get_raw_report source of 5 second delay in device removal
  2013-02-20 10:28 hidp_get_raw_report source of 5 second delay in device removal Karl Relton
@ 2013-02-20 15:44 ` David Herrmann
  2013-02-20 17:17   ` Karl Relton
  0 siblings, 1 reply; 6+ messages in thread
From: David Herrmann @ 2013-02-20 15:44 UTC (permalink / raw)
  To: Karl Relton; +Cc: linux-bluetooth

Hi Karl

On Wed, Feb 20, 2013 at 11:28 AM, Karl Relton
<karllinuxtest.relton@ntlworld.com> wrote:
> Dear friends
>
> Any thoughts on my two patches:
> - [PATCH] hidp: Make hidp_get_raw_report abort if the session is
> terminating
> - RFC [PATCH] hci: Fix race between hidp_session and hci code
>
> originally sent 29th Jan and resent 18th Feb?
>
> There has been no reply to this list ...
>
> The first patch is critical, the second makes everything neat and tidy.

I'm confused. I think the second one is critical, not the first one,
right? At least according to the list above. Anyway, both bugs are
known and need to be fixed. I reviewed both of your patches and the
second one needs major rework as I think it's the wrong fix.

Some minor comments below:

> The following kernel log, using linux 3.8-rc6 and DEBUG turned on in the
> bluetooth & hidp modules illustrate the problem if neither of these
> patches are used. I added a couple of extra printks in the
> hidp_session() routine to show its progress. I also
> tweaked hidp_session to avoid a possible kernel OOPS caused by this same
> problem (which is avoided if the first patch was applied).
> Lines with *** are my annotations to describe what is happening.
>
> *** Kernel resuming - does usb reset as part of resume from suspend sequence ...
> Feb 20 09:31:15 laptop2 kernel: [  334.204082] usb 3-2: reset full-speed USB device number 2 using uhci_hcd
> Feb 20 09:31:15 laptop2 kernel: [  334.375693] btusb 3-2:1.0: no reset_resume for driver btusb?
> Feb 20 09:31:15 laptop2 kernel: [  334.375697] btusb 3-2:1.1: no reset_resume for driver btusb?
> Feb 20 09:31:15 laptop2 kernel: [  334.375710] hci_unregister_dev: f6119000 name hci0 bus 1
> Feb 20 09:31:15 laptop2 kernel: [  334.375712] hci_dev_do_close: hci0 f6119000
> Feb 20 09:31:15 laptop2 kernel: [  334.375716] hci_req_cancel: hci0 err 0x13
> Feb 20 09:31:15 laptop2 kernel: [  334.375719] hci_conn_hash_flush: hdev hci0
> Feb 20 09:31:15 laptop2 kernel: [  334.375722] l2cap_disconn_cfm: hcon e000bc00 reason 22
> Feb 20 09:31:15 laptop2 kernel: [  334.375725] l2cap_conn_del: hcon e000bc00 conn f3294800, err 103
> Feb 20 09:31:15 laptop2 kernel: [  334.375727] l2cap_chan_hold: chan e000a400 orig refcnt 3
> Feb 20 09:31:15 laptop2 kernel: [  334.375730] l2cap_chan_del: chan e000a400, conn f3294800, err 103
> Feb 20 09:31:15 laptop2 kernel: [  334.375732] l2cap_chan_put: chan e000a400 orig refcnt 4
> Feb 20 09:31:15 laptop2 kernel: [  334.375734] hci_conn_put: hcon e000bc00 orig refcnt 2
> Feb 20 09:31:15 laptop2 kernel: [  334.375742] l2cap_chan_put: chan e000a400 orig refcnt 3
> Feb 20 09:31:15 laptop2 kernel: [  334.375744] l2cap_chan_hold: chan e0009400 orig refcnt 3
> Feb 20 09:31:15 laptop2 kernel: [  334.375746] l2cap_chan_del: chan e0009400, conn f3294800, err 103
> Feb 20 09:31:15 laptop2 kernel: [  334.375748] l2cap_chan_put: chan e0009400 orig refcnt 4
> Feb 20 09:31:15 laptop2 kernel: [  334.375750] hci_conn_put: hcon e000bc00 orig refcnt 1
> Feb 20 09:31:15 laptop2 kernel: [  334.375754] l2cap_chan_put: chan e0009400 orig refcnt 3
> Feb 20 09:31:15 laptop2 kernel: [  334.375756] hci_chan_del: hci0 hcon e000bc00 chan e00d7dc0
> *** the l2cap_disonn_cfm/l2cap_chan_del above triggers the hidp_session
> *** thread, so hidp_session can now fall out its main loop.
> *** hidp_session then does hid_destroy_device(), which through code in the power_supply
> *** part of the kernel invokes a battery status lookup which requires a call to hidp_get_raw_report

Yeah, same procedure applies to device-unplug. The calltrace is
totally valid and understandable. Your "atomic_inc(&hidp->terminate)"
patch fixes this problem properly.

> Feb 20 09:31:15 laptop2 kernel: [  334.375765] hidp_session Out of main loop
> Feb 20 09:31:15 laptop2 kernel: [  334.375767] hidp_session About to do hid_destroy_device
> Feb 20 09:31:15 laptop2 kernel: [  334.375857] hidp_get_raw_report: Entering hidp_get_raw_report
> Feb 20 09:31:15 laptop2 kernel: [  334.375860] __hidp_send_ctrl_message: session f3294f00 data e0115d36 size 1
> *** hidp_get_raw_report sends a ctrl message and then waits. Meanwhile the main kernel gets
> *** on with running the hci driver code ...
> Feb 20 09:31:15 laptop2 kernel: [  334.376725] hci_conn_put: hcon e000bc00 orig refcnt 0
> Feb 20 09:31:15 laptop2 kernel: [  334.376729] hci_conn_del: hci0 hcon e000bc00 handle 46
> Feb 20 09:31:15 laptop2 kernel: [  334.376732] hci_chan_list_flush: hcon e000bc00
> Feb 20 09:31:15 laptop2 kernel: [  334.376832] hci_dev_put: hci0 orig refcnt 12
> Feb 20 09:31:15 laptop2 kernel: [  334.376835] hci_sock_dev_event: hdev hci0 event 4
> Feb 20 09:31:15 laptop2 kernel: [  334.376839] hci_send_to_sock: hdev   (null) len 8
> Feb 20 09:31:15 laptop2 kernel: [  334.376850] hci_dev_put: hci0 orig refcnt 11
> Feb 20 09:31:15 laptop2 kernel: [  334.376854] hci_send_to_control: len 6
> Feb 20 09:31:15 laptop2 kernel: [  334.376856] hci_sock_dev_event: hdev hci0 event 2
> Feb 20 09:31:15 laptop2 kernel: [  334.376859] hci_send_to_sock: hdev   (null) len 8
> Feb 20 09:31:15 laptop2 kernel: [  334.376864] hci_dev_put: hci0 orig refcnt 10
> Feb 20 09:31:15 laptop2 kernel: [  334.376910] hci_del_sysfs: f6119000 name hci0 bus 1
> Feb 20 09:31:15 laptop2 kernel: [  334.376994] hci_dev_put: hci0 orig refcnt 5
> *** the 'hci' device is the sysfs tree has now just been removed - yet the various devices
> *** dependent on it have yet to be removed (input, event, hidraw, hci_conn)


This is a bug, indeed. We need to notify all child devices first so
they can get deleted. However, delaying the HCI removal is _wrong_!
The HCI device must be deleted here.
But we must simply forward the delete-request to the childs before
doing the deletion.

> *** Note the 5 SECOND gap ... thats the timeout value in hidp_get_raw_report!
> *** having timed out, the hidp_session thread gets underway again
> *** the various devices are now removed (input, event, hidraw, hci_conn)
> *** and life can go on
> Feb 20 09:31:15 laptop2 kernel: [  339.372148] hidp_get_raw_report timed out!
> Feb 20 09:31:15 laptop2 kernel: [  339.372153] power_supply hid-60:c5:47:1b:07:b8-battery: driver failed to report `capacity' property: -5
> Feb 20 09:31:15 laptop2 kernel: [  339.376347] PM: resume of devices complete after 5491.694 msecs
> Feb 20 09:31:15 laptop2 kernel: [  339.376362] hidp_session Done hid_destry_device
> Feb 20 09:31:15 laptop2 kernel: [  339.376366] hci_conn_del_sysfs: conn e000bc00
> Feb 20 09:31:15 laptop2 kernel: [  339.376392] hci_dev_put: hci0 orig refcnt 1
> *** a new hci device is registered, and a new hci_conn, and sub-tree of input devices
> *** will be added, which is normal and to be expected.
> *** NOTE though whilst the new registration occurs here in this log, I have seen it
> *** occur much earlier in other logs (different versions of kernel), leading to
> *** various problems when a new hci_conn is added whilst the old one has yet to be deleted.

No! There is no problem with adding a new hci_conn while another one
is still to be removed. This totally normal and _must_ be supported.
As long as both hci_conn objects have different IDs this is correctly
understood by user-space, too.

We fixed the ID issue two years ago and this is no longer a problem
here. If you see a bug where a hci_conn with an equal ID is
registered, please report it. This would be a bug then.

> Feb 20 09:31:15 laptop2 kernel: [  339.376424] hci_register_dev: f3367000 name hci0 bus 1
> Feb 20 09:31:15 laptop2 kernel: [  339.376475] hci_add_sysfs: f3367000 name hci0 bus 1
> Feb 20 09:31:15 laptop2 kernel: [  339.376561] hci_sock_dev_event: hdev hci0 event 1
> Feb 20 09:31:15 laptop2 kernel: [  339.376564] hci_send_to_sock: hdev   (null) len 8
> Feb 20 09:31:15 laptop2 kernel: [  339.376568] hci_dev_hold: hci0 orig refcnt 5
> Feb 20 09:31:15 laptop2 kernel: [  339.378210] bt_sock_poll: sock ef627500, sk f30afe00
> Feb 20 09:31:15 laptop2 kernel: [  339.384283] bt_sock_poll: sock f2d95380, sk f675a600
> Feb 20 09:31:15 laptop2 kernel: [  339.384291] bt_sock_poll: sock f5be0c00, sk f673f600
> Feb 20 09:31:15 laptop2 kernel: [  339.384295] bt_sock_poll: sock f5be0180, sk f600ac00
> Feb 20 09:31:15 laptop2 kernel: [  339.384298] bt_sock_poll: sock f5b30180, sk f66c4400
> Feb 20 09:31:15 laptop2 kernel: [  339.384301] bt_sock_poll: sock f5b30300, sk f6691e00
> Feb 20 09:31:15 laptop2 kernel: [  339.384303] bt_sock_poll: sock f5b30480, sk f673e000
> Feb 20 09:31:15 laptop2 kernel: [  339.384306] bt_sock_poll: sock f5b30600, sk f673e600
> Feb 20 09:31:15 laptop2 kernel: [  339.384309] bt_sock_poll: sock f5b30780, sk f675b800
> Feb 20 09:31:15 laptop2 kernel: [  339.384311] bt_sock_poll: sock f5b30900, sk f675bc00
> Feb 20 09:31:15 laptop2 kernel: [  339.384314] bt_sock_poll: sock f5b30a80, sk f675a800
> Feb 20 09:31:15 laptop2 kernel: [  339.384320] bt_sock_poll: sock f2d94180, sk f66c4800
> Feb 20 09:31:15 laptop2 kernel: [  339.384323] bt_sock_poll: sock f5b30000, sk f66c5200
> Feb 20 09:31:15 laptop2 kernel: [  339.384325] bt_sock_poll: sock df8a8300, sk e968c800
> Feb 20 09:31:15 laptop2 kernel: [  339.384328] bt_sock_poll: sock df8a8180, sk e968d600
> Feb 20 09:31:15 laptop2 kernel: [  339.384357] hci_sock_recvmsg: sock f5be0780, sk f673ea00
> Feb 20 09:31:15 laptop2 kernel: [  339.384574] l2cap_sock_release: sock df8a8300, sk e968c800
> Feb 20 09:31:15 laptop2 kernel: [  339.384578] l2cap_sock_shutdown: sock df8a8300, sk e968c800
> Feb 20 09:31:15 laptop2 kernel: [  339.384582] l2cap_chan_close: chan e000a400 state BT_CLOSED sk e968c800
> Feb 20 09:31:15 laptop2 kernel: [  339.384586] l2cap_sock_kill: sk e968c800 state BT_CLOSED
> Feb 20 09:31:15 laptop2 kernel: [  339.384589] l2cap_chan_put: chan e000a400 orig refcnt 2
> Feb 20 09:31:15 laptop2 kernel: [  339.384592] l2cap_sock_destruct: sk e968c800
> Feb 20 09:31:15 laptop2 kernel: [  339.384594] l2cap_chan_put: chan e000a400 orig refcnt 1
> Feb 20 09:31:15 laptop2 kernel: [  339.384597] l2cap_chan_destroy: chan e000a400
> Feb 20 09:31:15 laptop2 kernel: [  339.384656] l2cap_sock_release: sock df8a8180, sk e968d600
> Feb 20 09:31:15 laptop2 kernel: [  339.384659] l2cap_sock_shutdown: sock df8a8180, sk e968d600
> Feb 20 09:31:15 laptop2 kernel: [  339.384663] l2cap_chan_close: chan e0009400 state BT_CLOSED sk e968d600
> Feb 20 09:31:15 laptop2 kernel: [  339.384666] l2cap_sock_kill: sk e968d600 state BT_CLOSED
> Feb 20 09:31:15 laptop2 kernel: [  339.384668] l2cap_chan_put: chan e0009400 orig refcnt 2
> Feb 20 09:31:15 laptop2 kernel: [  339.384670] l2cap_sock_destruct: sk e968d600
> Feb 20 09:31:15 laptop2 kernel: [  339.384673] l2cap_chan_put: chan e0009400 orig refcnt 1
> Feb 20 09:31:15 laptop2 kernel: [  339.384675] l2cap_chan_destroy: chan e0009400
> Feb 20 09:31:15 laptop2 kernel: [  339.384707] bt_sock_poll: sock f2d95380, sk f675a600
> Feb 20 09:31:15 laptop2 kernel: [  339.384710] bt_sock_poll: sock f5be0c00, sk f673f600
> Feb 20 09:31:15 laptop2 kernel: [  339.384713] bt_sock_poll: sock f5be0180, sk f600ac00
> Feb 20 09:31:15 laptop2 kernel: [  339.384715] bt_sock_poll: sock f5b30180, sk f66c4400
> Feb 20 09:31:15 laptop2 kernel: [  339.384718] bt_sock_poll: sock f5b30300, sk f6691e00
> Feb 20 09:31:15 laptop2 kernel: [  339.384720] bt_sock_poll: sock f5b30480, sk f673e000
> Feb 20 09:31:15 laptop2 kernel: [  339.384722] bt_sock_poll: sock f5b30600, sk f673e600
> Feb 20 09:31:15 laptop2 kernel: [  339.384725] bt_sock_poll: sock f5b30780, sk f675b800
> Feb 20 09:31:15 laptop2 kernel: [  339.384727] bt_sock_poll: sock f5b30900, sk f675bc00
> Feb 20 09:31:15 laptop2 kernel: [  339.384729] bt_sock_poll: sock f5b30a80, sk f675a800
> Feb 20 09:31:15 laptop2 kernel: [  339.384733] bt_sock_poll: sock f2d94180, sk f66c4800
> Feb 20 09:31:15 laptop2 kernel: [  339.384735] bt_sock_poll: sock f5b30000, sk f66c5200
> Feb 20 09:31:15 laptop2 kernel: [  339.384743] hci_sock_recvmsg: sock f5be0780, sk f673ea00
> Feb 20 09:31:15 laptop2 kernel: [  339.384771] hci_sock_release: sock f5be0600 sk f673e800
> Feb 20 09:31:15 laptop2 kernel: [  339.384829] hci_sock_create: sock df4b0000
> Feb 20 09:31:15 laptop2 kernel: [  339.384838] hci_sock_ioctl: cmd 800448d2 arg b885cb00
> Feb 20 09:31:15 laptop2 kernel: [  339.384843] hci_sock_release: sock df4b0000 sk f1b82600
> Feb 20 09:31:15 laptop2 kernel: [  339.385167] bt_accept_dequeue: parent f675a800
> Feb 20 09:31:15 laptop2 kernel: [  339.385191] l2cap_sock_release: sock f5b30900, sk f675bc00
> Feb 20 09:31:15 laptop2 kernel: [  339.385194] l2cap_sock_shutdown: sock f5b30900, sk f675bc00
> Feb 20 09:31:15 laptop2 kernel: [  339.385197] l2cap_chan_close: chan f303b800 state BT_LISTEN sk f675bc00
> Feb 20 09:31:15 laptop2 kernel: [  339.385200] l2cap_sock_cleanup_listen: parent f675bc00
> Feb 20 09:31:15 laptop2 kernel: [  339.385202] bt_accept_dequeue: parent f675bc00
> Feb 20 09:31:15 laptop2 kernel: [  339.385205] l2cap_sock_kill: sk f675bc00 state BT_CLOSED
> Feb 20 09:31:15 laptop2 kernel: [  339.385208] l2cap_chan_put: chan f303b800 orig refcnt 2
> Feb 20 09:31:15 laptop2 kernel: [  339.385210] l2cap_sock_destruct: sk f675bc00
> Feb 20 09:31:15 laptop2 kernel: [  339.385213] l2cap_chan_put: chan f303b800 orig refcnt 1
> Feb 20 09:31:15 laptop2 kernel: [  339.385215] l2cap_chan_destroy: chan f303b800
> Feb 20 09:31:15 laptop2 kernel: [  339.385343] l2cap_sock_release: sock f5b30780, sk f675b800
> Feb 20 09:31:15 laptop2 kernel: [  339.385346] l2cap_sock_shutdown: sock f5b30780, sk f675b800
> Feb 20 09:31:15 laptop2 kernel: [  339.385350] l2cap_chan_close: chan f303b400 state BT_LISTEN sk f675b800
> Feb 20 09:31:15 laptop2 kernel: [  339.385352] l2cap_sock_cleanup_listen: parent f675b800
> Feb 20 09:31:15 laptop2 kernel: [  339.385354] bt_accept_dequeue: parent f675b800
> Feb 20 09:31:15 laptop2 kernel: [  339.385357] l2cap_sock_kill: sk f675b800 state BT_CLOSED
> Feb 20 09:31:15 laptop2 kernel: [  339.385360] l2cap_chan_put: chan f303b400 orig refcnt 2
> Feb 20 09:31:15 laptop2 kernel: [  339.385362] l2cap_sock_destruct: sk f675b800
> Feb 20 09:31:15 laptop2 kernel: [  339.385364] l2cap_chan_put: chan f303b400 orig refcnt 1
> Feb 20 09:31:15 laptop2 kernel: [  339.385367] l2cap_chan_destroy: chan f303b400
> Feb 20 09:31:15 laptop2 kernel: [  339.385386] bt_accept_dequeue: parent f673e600
> Feb 20 09:31:15 laptop2 kernel: [  339.385397] bt_accept_dequeue: parent f6691e00
> Feb 20 09:31:15 laptop2 kernel: [  339.385408] bt_accept_dequeue: parent f673e000
> Feb 20 09:31:15 laptop2 kernel: [  339.394097] hci_rfkill_set_block: f3367000 name hci0 blocked 0
> Feb 20 09:31:15 laptop2 kernel: [  339.394103] hci_power_on: hci0
> Feb 20 09:31:15 laptop2 kernel: [  339.394106] hci_dev_get: 0
> Feb 20 09:31:15 laptop2 kernel: [  339.394109] hci_dev_hold: hci0 orig refcnt 6
> Feb 20 09:31:15 laptop2 kernel: [  339.394112] hci_dev_open: hci0 f3367000
> Feb 20 09:31:15 laptop2 kernel: [  339.394136] __hci_request: hci0 start
> Feb 20 09:31:15 laptop2 kernel: [  339.394139] hci_init_req: hci0 0
> Feb 20 09:31:15 laptop2 kernel: [  339.394142] hci_reset_req: hci0 0
> Feb 20 09:31:15 laptop2 kernel: [  339.394145] hci_send_cmd: hci0 opcode 0x0c03 plen 0
> Feb 20 09:31:15 laptop2 kernel: [  339.394147] hci_send_cmd: skb len 3
> Feb 20 09:31:15 laptop2 kernel: [  339.394155] hci_send_cmd: hci0 opcode 0x1003 plen 0
> Feb 20 09:31:15 laptop2 kernel: [  339.394157] hci_send_cmd: skb len 3
> Feb 20 09:31:15 laptop2 kernel: [  339.394160] hci_send_cmd: hci0 opcode 0x1001 plen 0
> Feb 20 09:31:15 laptop2 kernel: [  339.394162] hci_send_cmd: skb len 3
> Feb 20 09:31:15 laptop2 kernel: [  339.394168] hci_cmd_work: hci0 cmd_cnt 1 cmd queued 3
> Feb 20 09:31:15 laptop2 kernel: [  339.394172] hci_send_frame: hci0 type 1 len 3
> Feb 20 09:31:15 laptop2 kernel: [  339.395843] hci_rx_work: hci0
> Feb 20 09:31:15 laptop2 kernel: [  339.395850] hci_rx_work: hci0 Event packet
> Feb 20 09:31:15 laptop2 kernel: [  339.395854] hci_cmd_status_evt: hci0 opcode 0x0000
> Feb 20 09:31:15 laptop2 kernel: [  339.376989] Restarting tasks ... done.
>
> The result of the above is the following from udevadm monitor (i.e. the kernel udev events):
> KERNEL - the kernel uevent
> KERNEL[339.393586] remove   /devices/pci0000:00/0000:00:1a.0/usb3/3-2/3-2:1.0/bluetooth/hci0/rfkill0 (rfkill)
> KERNEL[339.393605] remove   /devices/pci0000:00/0000:00:1a.0/usb3/3-2/3-2:1.0/bluetooth/hci0 (bluetooth)
> KERNEL[339.393627] change   /devices/pci0000:00/0000:00:02.0/drm/card0 (drm)
> KERNEL[339.393645] remove   /hci0/hci0:46/input10/event10 (input)
> KERNEL[339.393673] remove   /hci0/hci0:46/input10 (input)
> KERNEL[339.393692] remove   /hci0/hci0:46/0005:05AC:0256.0001/hidraw/hidraw0 (hidraw)
> KERNEL[339.393711] remove   /hci0/hci0:46/0005:05AC:0256.0001 (hid)
> KERNEL[339.393727] remove   /hci0/hci0:46 (bluetooth)
> KERNEL[339.393743] add      /devices/pci0000:00/0000:00:1a.0/usb3/3-2/3-2:1.0/bluetooth/hci0 (bluetooth)
> KERNEL[339.393761] add      /devices/pci0000:00/0000:00:1a.0/usb3/3-2/3-2:1.0/bluetooth/hci0/rfkill2 (rfkill)
> KERNEL[345.114360] add      /devices/pci0000:00/0000:00:1a.0/usb3/3-2/3-2:1.0/bluetooth/hci0/hci0:46 (bluetooth)
> KERNEL[345.384580] add      /devices/pci0000:00/0000:00:1a.0/usb3/3-2/3-2:1.0/bluetooth/hci0/hci0:46/0005:05AC:0256.0002 (hid)
> KERNEL[345.424395] add      /devices/pci0000:00/0000:00:1a.0/usb3/3-2/3-2:1.0/bluetooth/hci0/hci0:46/input11 (input)
> KERNEL[345.425001] add      /devices/pci0000:00/0000:00:1a.0/usb3/3-2/3-2:1.0/bluetooth/hci0/hci0:46/input11/event10 (input)
> KERNEL[345.425312] add      /devices/pci0000:00/0000:00:1a.0/usb3/3-2/3-2:1.0/bluetooth/hci0/hci0:46/0005:05AC:0256.0002/hidraw/hidraw0 (hidraw)
>
> Note the lines starting /hci0/hci0:46 ...
>
> These lines are incorrect - the sysfs path is truncated, caused by device removal in the wrong order.
> This can play havoc with userspace tools (like Xorg input device driver) that depends on these events.
>
> The two patches posted resolve this. The first by avoiding the 5 second
> timeout penalty (a trival fix). The second by delaying hci removal until
> hci_conn references to it have all gone.
>
> The bluetooth kernel code in its current form is problematic - hence all
> kind of bug reports by users against their distributions.

I am currently reworking the HIDP layer. The whole module is _very_
buggy. I will base my rework on your patches so feel free to resend
them. If you don't resend them, I will fix it myself, but this will
mean they get not applied right away as I will probably include them
in my rework which might take longer.

Also, please CC the bluetooth kernel maintainers when sending patches!
This includes Marcel Holtmann, Johan Hedberg and Gustavo Padovan, I
think.

Thanks
David

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: hidp_get_raw_report source of 5 second delay in device removal
  2013-02-20 15:44 ` David Herrmann
@ 2013-02-20 17:17   ` Karl Relton
  2013-02-20 19:22     ` David Herrmann
  0 siblings, 1 reply; 6+ messages in thread
From: Karl Relton @ 2013-02-20 17:17 UTC (permalink / raw)
  To: David Herrmann; +Cc: linux-bluetooth

On Wed, 2013-02-20 at 16:44 +0100, David Herrmann wrote:
> >
> > The first patch is critical, the second makes everything neat and tidy.
> 
> I'm confused. I think the second one is critical, not the first one,
> right? At least according to the list above. Anyway, both bugs are
> known and need to be fixed. I reviewed both of your patches and the
> second one needs major rework as I think it's the wrong fix.
> 

Well its semantics and one's point of view. The 5 second delay is a big
hit ... and can have other side effects such as making hidp_session race
against the sock structure eventually being orphaned (which leads to a
kernel OOPS). Anyway - at least we are both on the same page now.

> Some minor comments below:
> 
> > Feb 20 09:31:15 laptop2 kernel: [  334.375756] hci_chan_del: hci0 hcon e000bc00 chan e00d7dc0
> > *** the l2cap_disonn_cfm/l2cap_chan_del above triggers the hidp_session
> > *** thread, so hidp_session can now fall out its main loop.
> > *** hidp_session then does hid_destroy_device(), which through code in the power_supply
> > *** part of the kernel invokes a battery status lookup which requires a call to hidp_get_raw_report
> 
> Yeah, same procedure applies to device-unplug. The calltrace is
> totally valid and understandable. Your "atomic_inc(&hidp->terminate)"
> patch fixes this problem properly.
> 

Yep - I'll resend with your comment there taken on board.

> > Feb 20 09:31:15 laptop2 kernel: [  334.375765] hidp_session Out of main loop
> > Feb 20 09:31:15 laptop2 kernel: [  334.375767] hidp_session About to do hid_destroy_device
> > Feb 20 09:31:15 laptop2 kernel: [  334.375857] hidp_get_raw_report: Entering hidp_get_raw_report
> > Feb 20 09:31:15 laptop2 kernel: [  334.375860] __hidp_send_ctrl_message: session f3294f00 data e0115d36 size 1
> > *** hidp_get_raw_report sends a ctrl message and then waits. Meanwhile the main kernel gets
> > *** on with running the hci driver code ...
> > Feb 20 09:31:15 laptop2 kernel: [  334.376725] hci_conn_put: hcon e000bc00 orig refcnt 0
> > Feb 20 09:31:15 laptop2 kernel: [  334.376729] hci_conn_del: hci0 hcon e000bc00 handle 46
> > Feb 20 09:31:15 laptop2 kernel: [  334.376732] hci_chan_list_flush: hcon e000bc00
> > Feb 20 09:31:15 laptop2 kernel: [  334.376832] hci_dev_put: hci0 orig refcnt 12
> > Feb 20 09:31:15 laptop2 kernel: [  334.376835] hci_sock_dev_event: hdev hci0 event 4
> > Feb 20 09:31:15 laptop2 kernel: [  334.376839] hci_send_to_sock: hdev   (null) len 8
> > Feb 20 09:31:15 laptop2 kernel: [  334.376850] hci_dev_put: hci0 orig refcnt 11
> > Feb 20 09:31:15 laptop2 kernel: [  334.376854] hci_send_to_control: len 6
> > Feb 20 09:31:15 laptop2 kernel: [  334.376856] hci_sock_dev_event: hdev hci0 event 2
> > Feb 20 09:31:15 laptop2 kernel: [  334.376859] hci_send_to_sock: hdev   (null) len 8
> > Feb 20 09:31:15 laptop2 kernel: [  334.376864] hci_dev_put: hci0 orig refcnt 10
> > Feb 20 09:31:15 laptop2 kernel: [  334.376910] hci_del_sysfs: f6119000 name hci0 bus 1
> > Feb 20 09:31:15 laptop2 kernel: [  334.376994] hci_dev_put: hci0 orig refcnt 5
> > *** the 'hci' device is the sysfs tree has now just been removed - yet the various devices
> > *** dependent on it have yet to be removed (input, event, hidraw, hci_conn)
> 
> 
> This is a bug, indeed. We need to notify all child devices first so
> they can get deleted. However, delaying the HCI removal is _wrong_!
> The HCI device must be deleted here.
> But we must simply forward the delete-request to the childs before
> doing the deletion.
> 

Why 'HCI device must be deleted here'? I guess there is a reason, but it
is not obvious to me.

My suggested fix simply follows the same logic as hci_conn_put_device,
which only does sysfs removal once all users of the HCI_CONN device have
finished.


> > *** Note the 5 SECOND gap ... thats the timeout value in hidp_get_raw_report!
> > *** having timed out, the hidp_session thread gets underway again
> > *** the various devices are now removed (input, event, hidraw, hci_conn)
> > *** and life can go on
> > Feb 20 09:31:15 laptop2 kernel: [  339.372148] hidp_get_raw_report timed out!
> > Feb 20 09:31:15 laptop2 kernel: [  339.372153] power_supply hid-60:c5:47:1b:07:b8-battery: driver failed to report `capacity' property: -5
> > Feb 20 09:31:15 laptop2 kernel: [  339.376347] PM: resume of devices complete after 5491.694 msecs
> > Feb 20 09:31:15 laptop2 kernel: [  339.376362] hidp_session Done hid_destry_device
> > Feb 20 09:31:15 laptop2 kernel: [  339.376366] hci_conn_del_sysfs: conn e000bc00
> > Feb 20 09:31:15 laptop2 kernel: [  339.376392] hci_dev_put: hci0 orig refcnt 1
> > *** a new hci device is registered, and a new hci_conn, and sub-tree of input devices
> > *** will be added, which is normal and to be expected.
> > *** NOTE though whilst the new registration occurs here in this log, I have seen it
> > *** occur much earlier in other logs (different versions of kernel), leading to
> > *** various problems when a new hci_conn is added whilst the old one has yet to be deleted.
> 
> No! There is no problem with adding a new hci_conn while another one
> is still to be removed. This totally normal and _must_ be supported.
> As long as both hci_conn objects have different IDs this is correctly
> understood by user-space, too.
> 
> We fixed the ID issue two years ago and this is no longer a problem
> here. If you see a bug where a hci_conn with an equal ID is
> registered, please report it. This would be a bug then.
> 

Fair enough - I have seen a new hci_conn with an equal ID, but I have
been playing with different kernels between 3.2 and 3.8. So unless I see
it again with 3.8 or later I'll trust that particular issue is indeed
resolved.

> > The bluetooth kernel code in its current form is problematic - hence all
> > kind of bug reports by users against their distributions.
> 
> I am currently reworking the HIDP layer. The whole module is _very_
> buggy. I will base my rework on your patches so feel free to resend
> them. If you don't resend them, I will fix it myself, but this will
> mean they get not applied right away as I will probably include them
> in my rework which might take longer.
> 
> Also, please CC the bluetooth kernel maintainers when sending patches!
> This includes Marcel Holtmann, Johan Hedberg and Gustavo Padovan, I
> think.
> 
Thankyou David. I'll do the minor alter to the first patch. I'll take up
discussion on the 2nd in that email thread later.

Karl



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: hidp_get_raw_report source of 5 second delay in device removal
  2013-02-20 17:17   ` Karl Relton
@ 2013-02-20 19:22     ` David Herrmann
  2013-02-20 19:43       ` Karl Relton
  0 siblings, 1 reply; 6+ messages in thread
From: David Herrmann @ 2013-02-20 19:22 UTC (permalink / raw)
  To: Karl Relton; +Cc: linux-bluetooth

Hi Karl

On Wed, Feb 20, 2013 at 6:17 PM, Karl Relton
<karllinuxtest.relton@ntlworld.com> wrote:
>
> Why 'HCI device must be deleted here'? I guess there is a reason, but it
> is not obvious to me.

Ok let me first explain the device-lifetime concept of the kernel:

A device is an object with at least 4 functions on it:
 - get_device()
     Increases the reference-count of the device object. This
     guarantees that the memory-object of the device itself
     is not free()'ed as long as you hold that reference. Nothing
     else is guaranteed about the object!
 - put_device()
     Decreases the reference-count of the device object. You _must_
     have called get_device() before! If the reference-count drops
     to 0, there are no more users and the object is freed.
 - device_add()
     Puts 'life' into the object. This adds the object to the global
     tree of _living_ objects. You must hold a reference to the
     device when calling this.
     User-space is informed at this time about the object.
 - device_del()
     'Kills' an object. The object is not deleted but killed. That
     is, it is no longer linked into the tree of living objects, but
     the object still exists.
     User-space is informed at this time that the object is gone.

All other calls like "register_device()", "unregister_device()", ...
are based on these four calls. There is one more,
"device_initialize()", but this is basically "get_device()" and should
be used during device initialization _instead_ of the first call to
"get_device()".

So how are these calls supposed to be used?
When you create a device, you call "device_initialize()", set the
device up, initialize everything and keep it around for your own use.
When the device wakes up and you want to put life into it, you call
"device_add()". This is equivalent to saying: "this device is now
ready for operation and the underlying hardware is fully functional".
You now operate on the device, pass it around and do a lot of
get_device() and put_device().
Now, when the underlying hardware shuts down, the device shouldn't be
used anymore. So what do you do? You call "device_del()". This
notifies all users of the device, that it is dead. Any operation on it
that requires a _living_ device, will then _probably_ fail with
-ENODEV. However, any operation that _doesn't_ require a living device
will still work. This is, because "device_del()" doesn't destroy
devices, it just marks them as dead.
After you called "device_del()", a device is normally supposed to
eventually fade away and get destroyed. This is, because you normally
_never_ revive an object (you create a new object instead). So with a
call to "device_del()" you should also inform all its users that the
device is gone. Bus systems do this by unloading the driver of the
device, other systems do this by notifying the users via some other
way. However, every user that holds the device must somehow be
informed, that it's dead and they should release it.
After all users released it (while maybe asynchronously waiting for
user-space or hardware reactions), the device gets deleted when the
last user calls put_device().

As you might notice, there is always _one_ piece of (maybe virtual)
hardware that controls the device. When the hardware shows up, you
call device_add(), when the hardware goes away, you call device_del().
You cannot control when hardware goes away, so it does *never* *ever*
makes sense to delay a device_del() call for an unknown amount of
time. The hardware is _gone_, so why should the device stay? Really,
try to find any _logical_ reason why the device should stay?

So what does that mean for hci_conn? It basically means
hci_conn_hold_device() and hci_conn_put_device() are bullsh**. Use
hci_conn_hold() and hci_conn_put() instead.
Additionally, we need some way to inform HIDP that the used hci_conn
is dead _before_ calling device_del() on hci_conn. How do we do that?
One idea would be to link each hidp object into hci_conn with a
callback that we call. HIDP can then delete its child devices and
continue.

But more importantly, never ever call "device_del()" from within the
_release_ callback of the device itself! If you do that, you didn't
understand the idea behind the _get, _add, _del, _put functions.
Besides, you are _required_ to hold a reference to a device when
calling "device_del()". If you do that in the "release" callback, you
violate this rule.

> My suggested fix simply follows the same logic as hci_conn_put_device,
> which only does sysfs removal once all users of the HCI_CONN device have
> finished.

Yeah, hci_conn is the culprit here. It's broken. Don't copy its behavior.

>
>> > *** Note the 5 SECOND gap ... thats the timeout value in hidp_get_raw_report!
>> > *** having timed out, the hidp_session thread gets underway again
>> > *** the various devices are now removed (input, event, hidraw, hci_conn)
>> > *** and life can go on
>> > Feb 20 09:31:15 laptop2 kernel: [  339.372148] hidp_get_raw_report timed out!
>> > Feb 20 09:31:15 laptop2 kernel: [  339.372153] power_supply hid-60:c5:47:1b:07:b8-battery: driver failed to report `capacity' property: -5
>> > Feb 20 09:31:15 laptop2 kernel: [  339.376347] PM: resume of devices complete after 5491.694 msecs
>> > Feb 20 09:31:15 laptop2 kernel: [  339.376362] hidp_session Done hid_destry_device
>> > Feb 20 09:31:15 laptop2 kernel: [  339.376366] hci_conn_del_sysfs: conn e000bc00
>> > Feb 20 09:31:15 laptop2 kernel: [  339.376392] hci_dev_put: hci0 orig refcnt 1
>> > *** a new hci device is registered, and a new hci_conn, and sub-tree of input devices
>> > *** will be added, which is normal and to be expected.
>> > *** NOTE though whilst the new registration occurs here in this log, I have seen it
>> > *** occur much earlier in other logs (different versions of kernel), leading to
>> > *** various problems when a new hci_conn is added whilst the old one has yet to be deleted.
>>
>> No! There is no problem with adding a new hci_conn while another one
>> is still to be removed. This totally normal and _must_ be supported.
>> As long as both hci_conn objects have different IDs this is correctly
>> understood by user-space, too.
>>
>> We fixed the ID issue two years ago and this is no longer a problem
>> here. If you see a bug where a hci_conn with an equal ID is
>> registered, please report it. This would be a bug then.
>>
>
> Fair enough - I have seen a new hci_conn with an equal ID, but I have
> been playing with different kernels between 3.2 and 3.8. So unless I see
> it again with 3.8 or later I'll trust that particular issue is indeed
> resolved.

It might not be solved, yet. But it's definitely some other bug then.

I hope this cleared things up a bit. It's not your code, so you are
not to blame. And honestly, I don't know who to blame. But IMO this
needs to be fixed asap.

Btw., the _ref, _add, _del, _unref cycle is something that is used in
many other places. Unfortunately, many people get it wrong..

I will try to fix the mess this week.
Cheers
David

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: hidp_get_raw_report source of 5 second delay in device removal
  2013-02-20 19:22     ` David Herrmann
@ 2013-02-20 19:43       ` Karl Relton
  2013-02-20 20:15         ` David Herrmann
  0 siblings, 1 reply; 6+ messages in thread
From: Karl Relton @ 2013-02-20 19:43 UTC (permalink / raw)
  To: David Herrmann; +Cc: linux-bluetooth

On Wed, 2013-02-20 at 20:22 +0100, David Herrmann wrote:
> Hi Karl
> 
> On Wed, Feb 20, 2013 at 6:17 PM, Karl Relton
> <karllinuxtest.relton@ntlworld.com> wrote:
> >
> > Why 'HCI device must be deleted here'? I guess there is a reason, but it
> > is not obvious to me.
> 
> Ok let me first explain the device-lifetime concept of the kernel:
>  ... <much excellent explanation> ...

Excellent - thankyou, that puts me properly in the picture and overcomes
my lack of knowledge. Thankyou again.

> 
> So what does that mean for hci_conn? It basically means
> hci_conn_hold_device() and hci_conn_put_device() are bullsh**. Use
> hci_conn_hold() and hci_conn_put() instead.

Fair enough.

> Additionally, we need some way to inform HIDP that the used hci_conn
> is dead _before_ calling device_del() on hci_conn. How do we do that?
> One idea would be to link each hidp object into hci_conn with a
> callback that we call. HIDP can then delete its child devices and
> continue.
> 

OK - understood.

Now over in the '[PATCH] hci: Fix race between hidp_session and hci
code' thread you said:

> So if stuff breaks because we remove the device from sysfs, then fix
> the stuff that breaks! In this case this means removing all
> sub-protocols from sysfs exactly at the same point when we remove the
> hci-dev from sysfs!
> 
Right!

The point is
hci_unregister_dev calls
  hci_dev_do_close calls
    hci_conn_hash_flush calls
      hci_proto_disconn_cfm which triggers (via a 'wakeup' on a sock
waitlist)
        hidp_session running in another thread (a kthread) - which 
                     removes the sub-protocols

... and then hci_unregister_dev, after the above calls, will go on to
call  hci_del_sysfs to remove the 'hci' device

So the existing code does indeed trigger the removal of the
sub-protocols 'theoretically' before the HCI device removal as you
suggest it should.


BUT because hidp_session is running in another thread there will ALWAYS
be a race. The whole 5 second hidp_get_raw_report thing shows up the
race big time, but even without that there is still a race ... and so
sometimes hidp_session will lose, leading to 'bad' udev events (and thus
potential problems for userspace programs that were depending on them).


So ... if the HCI dev must be removed at its current point in the flow
of hci code, then a method of delaying the hci code flow until
sub-protocols (in this case hidp) are truly done is required (compared
to the current 'trigger and forget' scenario). It is not obvious to me
how to do that.


> But more importantly, never ever call "device_del()" from within the
> _release_ callback of the device itself! If you do that, you didn't
> understand the idea behind the _get, _add, _del, _put functions.
> Besides, you are _required_ to hold a reference to a device when
> calling "device_del()". If you do that in the "release" callback, you
> violate this rule.
> 

I wasn't planning on doing that - which is why I re-introduced the hci
local refcnt. Never mind.


> I hope this cleared things up a bit. It's not your code, so you are
> not to blame. And honestly, I don't know who to blame. But IMO this
> needs to be fixed asap.
> 
> Btw., the _ref, _add, _del, _unref cycle is something that is used in
> many other places. Unfortunately, many people get it wrong..
> 
> I will try to fix the mess this week.


Great - sounds like the proper fix is beyond my ability and probably
requires more radical surgery. There are lots of user reports against
distributions that I think stem from this issue. For example Google
Chrome OS hacked their Xorg input driver to try and cope with the
truncated sysfs events that result - when the real issue is the hci/hidp
driver interaction. I'm glad therefore you are on the case.

Many thanks
Karl


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: hidp_get_raw_report source of 5 second delay in device removal
  2013-02-20 19:43       ` Karl Relton
@ 2013-02-20 20:15         ` David Herrmann
  0 siblings, 0 replies; 6+ messages in thread
From: David Herrmann @ 2013-02-20 20:15 UTC (permalink / raw)
  To: Karl Relton; +Cc: linux-bluetooth

Hi Karl

On Wed, Feb 20, 2013 at 8:43 PM, Karl Relton
<karllinuxtest.relton@ntlworld.com> wrote:
[...]
> Right!
>
> The point is
> hci_unregister_dev calls
>   hci_dev_do_close calls
>     hci_conn_hash_flush calls
>       hci_proto_disconn_cfm which triggers (via a 'wakeup' on a sock
> waitlist)
>         hidp_session running in another thread (a kthread) - which
>                      removes the sub-protocols
>
> ... and then hci_unregister_dev, after the above calls, will go on to
> call  hci_del_sysfs to remove the 'hci' device
>
> So the existing code does indeed trigger the removal of the
> sub-protocols 'theoretically' before the HCI device removal as you
> suggest it should.

The way to fix that is to add "hidp_remove" or something similar,
which locks the HIDP device, then removes the input-device/hid-device,
unlocks the HIDP device and returns. This then needs to be called
during hci_conn_del().

We currently remove input-devices/hid-devices in the HIDP session
itself. We can keep that if we make "hidp_remove" signal "terminate++"
and then wait for the thread to stop. But I think the other way is
better as it allows runtime-removal even though the sockets are still
alive. For instance, we could make hidp_delconn() synchronous with
this.

>
> BUT because hidp_session is running in another thread there will ALWAYS
> be a race. The whole 5 second hidp_get_raw_report thing shows up the
> race big time, but even without that there is still a race ... and so
> sometimes hidp_session will lose, leading to 'bad' udev events (and thus
> potential problems for userspace programs that were depending on them).

My bad. The 5s gap does indeed increase the possibility of that race
so it's definitely the more important fix.

>
> So ... if the HCI dev must be removed at its current point in the flow
> of hci code, then a method of delaying the hci code flow until
> sub-protocols (in this case hidp) are truly done is required (compared
> to the current 'trigger and forget' scenario). It is not obvious to me
> how to do that.
>
>> But more importantly, never ever call "device_del()" from within the
>> _release_ callback of the device itself! If you do that, you didn't
>> understand the idea behind the _get, _add, _del, _put functions.
>> Besides, you are _required_ to hold a reference to a device when
>> calling "device_del()". If you do that in the "release" callback, you
>> violate this rule.
>>
>
> I wasn't planning on doing that - which is why I re-introduced the hci
> local refcnt. Never mind.

hci_dev_hold() and hci_dev_put() are the _only_ place where we call
get_device() and put_device(), so if you introduce a refcnt in these
functions, you simply extend the device-refcnt.
This isn't obvious from the name of the functions so I'm sorry if that
was rude. It wasn't supposed to be rude, sorry.

>
>> I hope this cleared things up a bit. It's not your code, so you are
>> not to blame. And honestly, I don't know who to blame. But IMO this
>> needs to be fixed asap.
>>
>> Btw., the _ref, _add, _del, _unref cycle is something that is used in
>> many other places. Unfortunately, many people get it wrong..
>>
>> I will try to fix the mess this week.
>
>
> Great - sounds like the proper fix is beyond my ability and probably
> requires more radical surgery. There are lots of user reports against
> distributions that I think stem from this issue. For example Google
> Chrome OS hacked their Xorg input driver to try and cope with the
> truncated sysfs events that result - when the real issue is the hci/hidp
> driver interaction. I'm glad therefore you are on the case.

I know. This problem gets reported since at least 4 years here and
no-one fixed it properly. I'm not working on the Bluetooth layer,
either, but the bug bothers me as my HID drivers don't work over
Bluetooth.

Regards
David

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2013-02-20 20:15 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-02-20 10:28 hidp_get_raw_report source of 5 second delay in device removal Karl Relton
2013-02-20 15:44 ` David Herrmann
2013-02-20 17:17   ` Karl Relton
2013-02-20 19:22     ` David Herrmann
2013-02-20 19:43       ` Karl Relton
2013-02-20 20:15         ` David Herrmann

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.