* [PATCH 1/1] diag: fix userspace read-write freeze during high logging rates
@ 2013-05-20 8:17 Kasidit Yusuf
0 siblings, 0 replies; only message in thread
From: Kasidit Yusuf @ 2013-05-20 8:17 UTC (permalink / raw)
To: linux-arm-msm; +Cc: Kasidit Yusuf
At very high diagchar logging rates (a HSPA+ ftp download with many
instantaneous diag log response messages enabled) on quad-core phones,
it was observed that after about 2 to 20 minutes (depends on logging rate)
- the userspace read to /dev/diag would block, causing a freeze effect
to the userspace program - and further attempts to log would randomly
freeze at an earlier stage then a phone restart was required to do any
meaningful logging. You can test this with "diag_mdlog", use a diag.cfg
with high logging rate log_response_messages like 0x4222, 0x4179 and other
fast logs.
The patched kernel was tested on a "Google Nexus 4 with Android 4.2.2"
(mako) phone - and had no more freezes even after 13 Hours of logging diag
for over-night ftp tests, an older sister-kernel (essentially the same -
but with many more pr_debug() to identify the causes) was also tested
sucessfully on two other "Google Nexus 4" phones, one of them logged
successfully through about 10 hours of fast ftp downloads - no freezes too.
This fix consists of 4 main parts:
1. There were cases that diag_device_write() completed the "enque" and sets
the data_ready flag during an ongoing diagchar_read() which would then
clear the flag on completion - resulting in a lost data_ready notification
and diagchar_read() would block by waiting for the lost wakeup.
The extra flag (array) "data_ready_while_busy" was introduced to
handle that situation.
2. It was observed that zero length buffers in
diag_hsic_read_complete_callback() was causing problems too
- sending zero length entries to the hsic_buf_tbl queue could cause it to
never be properly dequeued and never "diagmem_free()". This is because
diagfwd.c's diag_device_write() would successfully "enqueue" it but then
diagchar_core.c's diagchar_read() would omit it because
"if (driver->hsic_buf_tbl[i].length > 0)".
So it won't be "dequeued" and also diagmem_free would not be called on it.
This is fixed by not allowing zero-length entries into diag_device_write()
- and then calling diagmem_free() on it instead.
3. Consecutive writes (try 10+ conf writes before reading) to diagchar
from userspace would block, although normal steps and ioctl were done.
This was because the diagchar_write() was blocked in
wait_event_interruptible() but the diag_hsic_write_complete_callback()
didn't wake it up. I simply added a wake_up_interruptible() in that
callback and now the consecutive writes complete sucessfully.
4. At very high logging rates (on the Nexus 4 quad-core phone), it was
observed that "count_hsic_write_pool" faced read/write race conditions
(to increment and decrement) and became incorrect - even negative - and
"diag_read_hsic_work_fn" would use it to "determine the current number of
available buffers for writing after reading from the HSIC has completed"
- so the code was modified to determine the number of entries available by
iterating through "hsic_buf_tbl" instead.
Also, added a "driver->num_clients--" to diagchar_close() to update
num_clints on close() from userspace.
Signed-off-by: Kasidit Yusuf <kasidit@azenqos.com>
---
drivers/char/diag/diagchar.h | 1 +
drivers/char/diag/diagchar_core.c | 21 ++++++++++-
drivers/char/diag/diagfwd.c | 16 ++++++++
drivers/char/diag/diagfwd_hsic.c | 79 ++++++++++++++++++++++++++++++++-------
4 files changed, 103 insertions(+), 14 deletions(-)
diff --git a/drivers/char/diag/diagchar.h b/drivers/char/diag/diagchar.h
index 8d87cf9..4281912 100644
--- a/drivers/char/diag/diagchar.h
+++ b/drivers/char/diag/diagchar.h
@@ -142,6 +142,7 @@ struct diagchar_dev {
wait_queue_head_t wait_q;
struct diag_client_map *client_map;
int *data_ready;
+ int *data_ready_while_busy;
int num_clients;
int polling_reg_flag;
struct diag_write_device *buf_tbl;
diff --git a/drivers/char/diag/diagchar_core.c b/drivers/char/diag/diagchar_core.c
index 2c572ed..9132fcb 100644
--- a/drivers/char/diag/diagchar_core.c
+++ b/drivers/char/diag/diagchar_core.c
@@ -182,6 +182,13 @@ static int diagchar_open(struct inode *inode, struct file *file)
goto fail;
else
driver->data_ready = temp;
+ temp = krealloc(driver->data_ready_while_busy
+ , (driver->num_clients) * sizeof(int),
+ GFP_KERNEL);
+ if (!temp)
+ goto fail;
+ else
+ driver->data_ready_while_busy = temp;
diag_add_client(i, file);
} else {
mutex_unlock(&driver->diagchar_mutex);
@@ -221,6 +228,8 @@ static int diagchar_close(struct inode *inode, struct file *file)
struct diagchar_priv *diagpriv_data = file->private_data;
pr_debug("diag: process exit %s\n", current->comm);
+ driver->num_clients--;
+
if (!(file->private_data)) {
pr_alert("diag: Invalid file pointer");
return -ENOMEM;
@@ -804,6 +813,7 @@ static int diagchar_read(struct file *file, char __user *buf, size_t count,
wait_event_interruptible(driver->wait_q,
driver->data_ready[index]);
+ driver->data_ready_while_busy[index] = 0;
mutex_lock(&driver->diagchar_mutex);
if ((driver->data_ready[index] & USER_SPACE_LOG_TYPE) && (driver->
@@ -981,7 +991,16 @@ drop_hsic:
/* copy number of data fields */
COPY_USER_SPACE_OR_EXIT(buf+4, num_data, 4);
ret -= 4;
- driver->data_ready[index] ^= USER_SPACE_LOG_TYPE;
+
+ /* At very high logging rates, it was observed that
+ * sometimes diag_device_write() completed the "enqueue"
+ * and sets the data_ready flag during an ongoing read
+ * in the lines above - so we should not clear the flag
+ * in these cases
+ */
+ if (driver->data_ready_while_busy[index] == 0)
+ driver->data_ready[index] ^= USER_SPACE_LOG_TYPE;
+
if (driver->ch)
queue_work(driver->diag_wq,
&(driver->diag_read_smd_work));
diff --git a/drivers/char/diag/diagfwd.c b/drivers/char/diag/diagfwd.c
index 4419309..7f12a3f 100644
--- a/drivers/char/diag/diagfwd.c
+++ b/drivers/char/diag/diagfwd.c
@@ -332,6 +332,12 @@ int diag_device_write(void *buf, int proc_num, struct diag_request *write_ptr)
driver->logging_process_id)
break;
if (i < driver->num_clients) {
+ /* Set a "data_ready_while_busy" flag to
+ * prevent diagchar_read() from clearing new
+ * data_ready sets that came during its ongoing read.
+ */
+ driver->data_ready_while_busy[i] |= USER_SPACE_LOG_TYPE;
+
driver->data_ready[i] |= USER_SPACE_LOG_TYPE;
pr_debug("diag: wake up logging process\n");
wake_up_interruptible(&driver->wait_q);
@@ -2056,6 +2062,14 @@ void diagfwd_init(void)
, GFP_KERNEL)) == NULL)
goto err;
kmemleak_not_leak(driver->data_ready);
+ if (driver->data_ready_while_busy == NULL) {
+ driver->data_ready_while_busy = kzalloc(driver->num_clients
+ * sizeof(int)
+ , GFP_KERNEL);
+ if (driver->data_ready_while_busy == NULL)
+ goto err;
+ }
+ kmemleak_not_leak(driver->data_ready_while_busy);
if (driver->table == NULL &&
(driver->table = kzalloc(diag_max_reg*
sizeof(struct diag_master_table),
@@ -2166,6 +2180,7 @@ err:
kfree(driver->client_map);
kfree(driver->buf_tbl);
kfree(driver->data_ready);
+ kfree(driver->data_ready_while_busy);
kfree(driver->table);
kfree(driver->pkt_buf);
kfree(driver->write_ptr_1);
@@ -2217,6 +2232,7 @@ void diagfwd_exit(void)
kfree(driver->client_map);
kfree(driver->buf_tbl);
kfree(driver->data_ready);
+ kfree(driver->data_ready_while_busy);
kfree(driver->table);
kfree(driver->pkt_buf);
kfree(driver->write_ptr_1);
diff --git a/drivers/char/diag/diagfwd_hsic.c b/drivers/char/diag/diagfwd_hsic.c
index bd8c52b..94d8b9d 100644
--- a/drivers/char/diag/diagfwd_hsic.c
+++ b/drivers/char/diag/diagfwd_hsic.c
@@ -45,6 +45,7 @@ static void diag_read_hsic_work_fn(struct work_struct *work)
int num_reads_submitted = 0;
int err = 0;
int write_ptrs_available;
+ int i;
if (!driver->hsic_ch) {
pr_err("DIAG in %s: driver->hsic_ch == 0\n", __func__);
@@ -55,12 +56,30 @@ static void diag_read_hsic_work_fn(struct work_struct *work)
* Determine the current number of available buffers for writing after
* reading from the HSIC has completed.
*/
- if (driver->logging_mode == MEMORY_DEVICE_MODE)
- write_ptrs_available = driver->poolsize_hsic_write -
- driver->num_hsic_buf_tbl_entries;
- else
+ if (driver->logging_mode == MEMORY_DEVICE_MODE) {
+ /*
+ * At very high logging rates on multi-core phones,
+ * it was observed that "count_hsic_write_pool" faced read/write
+ * race conditions (to increment and decrement) and became
+ * incorrect - so don't use it to determine the number of
+ * entries available - iterate through "hsic_buf_tbl" instead.
+ */
+ write_ptrs_available = 0;
+ i = (driver->poolsize_hsic_write) - 1;
+ for (; i >= 0; i--)
+ if (driver->hsic_buf_tbl[i].length == 0)
+ write_ptrs_available++;
+ } else {
write_ptrs_available = driver->poolsize_hsic_write -
driver->count_hsic_write_pool;
+ }
+
+ if (driver->logging_mode == MEMORY_DEVICE_MODE &&
+ (write_ptrs_available <= 0 ||
+ write_ptrs_available > driver->poolsize_hsic_write)) {
+ pr_debug("diag: warning - probe write_ptrs_available: %d"
+ , write_ptrs_available);
+ }
/*
* Queue up a read on the HSIC for all available buffers in the
@@ -138,17 +157,50 @@ static void diag_hsic_read_complete_callback(void *ctxt, char *buf,
if (!buf) {
pr_err("diag: Out of diagmem for HSIC\n");
} else {
- /*
- * Send data in buf to be written on the
- * appropriate device, e.g. USB MDM channel
+
+ /* Contradicting with the original comment above to send
+ * zero length entries, sending 0 length entries to the
+ * hsic_buf_tbl queue could cause it to never be
+ * properly dequeued and never "diagmem_free()".
+ *
+ * This is because diagfwd.c's diag_device_write() would
+ * successfully "enqueue" it but then diagchar_core.c's
+ * diagchar_read() would omit it -
+ * because "if (driver->hsic_buf_tbl[i].length > 0)"...
+ * So it won't be "dequeued" and also diagmem_free would
+ * not be called on it.
+ *
+ * Also, other diag_device_write() calls with non-0
+ * length entries would also overwrite this 0-length
+ * entry because of the
+ * "if (driver->hsic_buf_tbl[i].length == 0)" check.
+ *
+ * Threrefore, if the actual_size is > 0, send it,
+ * but if the size is 0 then call diagmem_free
+ * on it and don't send it.
*/
- driver->write_len_mdm = actual_size;
- err = diag_device_write((void *)buf, HSIC_DATA, NULL);
- /* If an error, return buffer to the pool */
- if (err) {
+
+ if (actual_size > 0) {
+ /*
+ * Send data in buf to be written on the
+ * appropriate device, e.g. USB MDM channel
+ */
+ driver->write_len_mdm = actual_size;
+ err = diag_device_write((void *)buf,
+ HSIC_DATA, NULL);
+ /* If an error, return buffer to the pool */
+ if (err) {
+ diagmem_free(driver, buf,
+ POOL_TYPE_HSIC);
+ pr_err("diag: In %s, error calling diag_device_write, err: %d\n",
+ __func__, err);
+ }
+ } else {
+ /* if we reach here this means actual_size is 0
+ * then free and omit it.
+ */
+ err = 0;
diagmem_free(driver, buf, POOL_TYPE_HSIC);
- pr_err("diag: In %s, error calling diag_device_write, err: %d\n",
- __func__, err);
}
}
} else {
@@ -179,6 +231,7 @@ static void diag_hsic_write_complete_callback(void *ctxt, char *buf,
{
/* The write of the data to the HSIC bridge is complete */
driver->in_busy_hsic_write = 0;
+ wake_up_interruptible(&driver->wait_q);
if (!driver->hsic_ch) {
pr_err("DIAG in %s: driver->hsic_ch == 0\n", __func__);
--
1.8.1.5
^ permalink raw reply related [flat|nested] only message in thread
only message in thread, other threads:[~2013-05-20 8:18 UTC | newest]
Thread overview: (only message) (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-05-20 8:17 [PATCH 1/1] diag: fix userspace read-write freeze during high logging rates Kasidit Yusuf
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).