* [PATCH v2 0/2] scsi: ufs: core: interrupt handling optimisations
@ 2025-07-25 14:16 André Draszik
2025-07-25 14:16 ` [PATCH v2 1/2] scsi: ufs: core: complete polled requests also from interrupt context André Draszik
2025-07-25 14:16 ` [PATCH v2 2/2] scsi: ufs: core: move some irq handling back to hardirq (with time limit) André Draszik
0 siblings, 2 replies; 9+ messages in thread
From: André Draszik @ 2025-07-25 14:16 UTC (permalink / raw)
To: Alim Akhtar, Avri Altman, Bart Van Assche, James E.J. Bottomley,
Martin K. Petersen, Neil Armstrong
Cc: Peter Griffin, Tudor Ambarus, Will McVicker,
Manivannan Sadhasivam, kernel-team, linux-arm-msm,
linux-samsung-soc, linux-scsi, linux-kernel, André Draszik,
stable
UFS performance for < v4 controllers has reduced quite a bit in 6.16.
This series addresses this regression and brings numbers more or less
back to the previous level.
See patch 2 for some benchmark (fio) results.
Signed-off-by: André Draszik <andre.draszik@linaro.org>
---
Changes in v2:
- patch 1: new patch as suggested by Bart during v1 review
- patch 2: update commit message and some inline & kerneldoc comments
- patch 2: add missing jiffies.h include
- Link to v1: https://lore.kernel.org/r/20250724-ufshcd-hardirq-v1-1-6398a52f8f02@linaro.org
---
André Draszik (2):
scsi: ufs: core: complete polled requests also from interrupt context
scsi: ufs: core: move some irq handling back to hardirq (with time limit)
drivers/ufs/core/ufshcd.c | 211 +++++++++++++++++++++++++++++++++-------------
1 file changed, 152 insertions(+), 59 deletions(-)
---
base-commit: 58ba80c4740212c29a1cf9b48f588e60a7612209
change-id: 20250723-ufshcd-hardirq-c7326f642e56
Best regards,
--
André Draszik <andre.draszik@linaro.org>
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH v2 1/2] scsi: ufs: core: complete polled requests also from interrupt context
2025-07-25 14:16 [PATCH v2 0/2] scsi: ufs: core: interrupt handling optimisations André Draszik
@ 2025-07-25 14:16 ` André Draszik
2025-07-25 14:16 ` [PATCH v2 2/2] scsi: ufs: core: move some irq handling back to hardirq (with time limit) André Draszik
1 sibling, 0 replies; 9+ messages in thread
From: André Draszik @ 2025-07-25 14:16 UTC (permalink / raw)
To: Alim Akhtar, Avri Altman, Bart Van Assche, James E.J. Bottomley,
Martin K. Petersen, Neil Armstrong
Cc: Peter Griffin, Tudor Ambarus, Will McVicker,
Manivannan Sadhasivam, kernel-team, linux-arm-msm,
linux-samsung-soc, linux-scsi, linux-kernel, André Draszik
When commit ee8c88cab4af ("scsi: ufs: core: Fix the polling
implementation") was added, the block layer did not support completing
polled requests from interrupt context.
This has changed since (presumably with commit b99182c501c3 ("bio: add
pcpu caching for non-polling bio_put")) and it is possible to complete
polled requests also from interrupt context.
Therefore, this commit here changes the code to also complete polled
requests from interrupt context, mostly reverting above referenced
commit as it is not necessary anymore. We do keep the fix to make
ufshcd_poll() return a positive number.
The change has been verified using:
fio --name=randread --rw=randread --ioengine=pvsync2 --hipri=1 \
--direct=1 --bs=4k --numjobs=8 --size=32m --runtime=30 \
--time_based --end_fsync=1 --group_reporting --filename=/foo
which appears to have completed with no errors with this commit.
Suggested-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: André Draszik <andre.draszik@linaro.org>
---
drivers/ufs/core/ufshcd.c | 26 +-------------------------
1 file changed, 1 insertion(+), 25 deletions(-)
diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c
index 13f7e0469141619cfc5e180aa730171ff01b8fb1..d8e2eabacd3efbf07458e81cc4d15ba7f05d3913 100644
--- a/drivers/ufs/core/ufshcd.c
+++ b/drivers/ufs/core/ufshcd.c
@@ -5613,26 +5613,6 @@ static void __ufshcd_transfer_req_compl(struct ufs_hba *hba,
ufshcd_compl_one_cqe(hba, tag, NULL);
}
-/* Any value that is not an existing queue number is fine for this constant. */
-enum {
- UFSHCD_POLL_FROM_INTERRUPT_CONTEXT = -1
-};
-
-static void ufshcd_clear_polled(struct ufs_hba *hba,
- unsigned long *completed_reqs)
-{
- int tag;
-
- for_each_set_bit(tag, completed_reqs, hba->nutrs) {
- struct scsi_cmnd *cmd = hba->lrb[tag].cmd;
-
- if (!cmd)
- continue;
- if (scsi_cmd_to_rq(cmd)->cmd_flags & REQ_POLLED)
- __clear_bit(tag, completed_reqs);
- }
-}
-
/*
* Return: > 0 if one or more commands have been completed or 0 if no
* requests have been completed.
@@ -5656,10 +5636,6 @@ static int ufshcd_poll(struct Scsi_Host *shost, unsigned int queue_num)
WARN_ONCE(completed_reqs & ~hba->outstanding_reqs,
"completed: %#lx; outstanding: %#lx\n", completed_reqs,
hba->outstanding_reqs);
- if (queue_num == UFSHCD_POLL_FROM_INTERRUPT_CONTEXT) {
- /* Do not complete polled requests from interrupt context. */
- ufshcd_clear_polled(hba, &completed_reqs);
- }
hba->outstanding_reqs &= ~completed_reqs;
spin_unlock_irqrestore(&hba->outstanding_lock, flags);
@@ -5747,7 +5723,7 @@ static irqreturn_t ufshcd_transfer_req_compl(struct ufs_hba *hba)
* Ignore the ufshcd_poll() return value and return IRQ_HANDLED since we
* do not want polling to trigger spurious interrupt complaints.
*/
- ufshcd_poll(hba->host, UFSHCD_POLL_FROM_INTERRUPT_CONTEXT);
+ ufshcd_poll(hba->host, 0);
return IRQ_HANDLED;
}
--
2.50.1.487.gc89ff58d15-goog
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH v2 2/2] scsi: ufs: core: move some irq handling back to hardirq (with time limit)
2025-07-25 14:16 [PATCH v2 0/2] scsi: ufs: core: interrupt handling optimisations André Draszik
2025-07-25 14:16 ` [PATCH v2 1/2] scsi: ufs: core: complete polled requests also from interrupt context André Draszik
@ 2025-07-25 14:16 ` André Draszik
2025-07-25 21:08 ` Bart Van Assche
2025-07-28 14:43 ` Neil Armstrong
1 sibling, 2 replies; 9+ messages in thread
From: André Draszik @ 2025-07-25 14:16 UTC (permalink / raw)
To: Alim Akhtar, Avri Altman, Bart Van Assche, James E.J. Bottomley,
Martin K. Petersen, Neil Armstrong
Cc: Peter Griffin, Tudor Ambarus, Will McVicker,
Manivannan Sadhasivam, kernel-team, linux-arm-msm,
linux-samsung-soc, linux-scsi, linux-kernel, André Draszik,
stable
Commit 3c7ac40d7322 ("scsi: ufs: core: Delegate the interrupt service
routine to a threaded IRQ handler") introduced a massive performance
drop for various work loads on UFSHC versions < 4 due to the extra
latency introduced by moving all of the IRQ handling into a threaded
handler. See below for a summary.
To resolve this performance drop, move IRQ handling back into hardirq
context, but apply a time limit which, once expired, will cause the
remainder of the work to be deferred to the threaded handler.
Above commit is trying to avoid unduly delay of other subsystem
interrupts while the UFS events are being handled. By limiting the
amount of time spent in hardirq context, we can still ensure that.
The time limit itself was chosen because I have generally seen
interrupt handling to have been completed within 20 usecs, with the
occasional spikes of a couple 100 usecs.
This commits brings UFS performance roughly back to original
performance, and should still avoid other subsystem's starvation thanks
to dealing with these spikes.
fio results for 4k block size on Pixel 6, all values being the average
of 5 runs each:
read / 1 job original after this commit
min IOPS 4,653.60 2,704.40 3,902.80
max IOPS 6,151.80 4,847.60 6,103.40
avg IOPS 5,488.82 4,226.61 5,314.89
cpu % usr 1.85 1.72 1.97
cpu % sys 32.46 28.88 33.29
bw MB/s 21.46 16.50 20.76
read / 8 jobs original after this commit
min IOPS 18,207.80 11,323.00 17,911.80
max IOPS 25,535.80 14,477.40 24,373.60
avg IOPS 22,529.93 13,325.59 21,868.85
cpu % usr 1.70 1.41 1.67
cpu % sys 27.89 21.85 27.23
bw MB/s 88.10 52.10 84.48
write / 1 job original after this commit
min IOPS 6,524.20 3,136.00 5,988.40
max IOPS 7,303.60 5,144.40 7,232.40
avg IOPS 7,169.80 4,608.29 7,014.66
cpu % usr 2.29 2.34 2.23
cpu % sys 41.91 39.34 42.48
bw MB/s 28.02 18.00 27.42
write / 8 jobs original after this commit
min IOPS 12,685.40 13,783.00 12,622.40
max IOPS 30,814.20 22,122.00 29,636.00
avg IOPS 21,539.04 18,552.63 21,134.65
cpu % usr 2.08 1.61 2.07
cpu % sys 30.86 23.88 30.64
bw MB/s 84.18 72.54 82.62
Closes: https://lore.kernel.org/all/1e06161bf49a3a88c4ea2e7a406815be56114c4f.camel@linaro.org
Fixes: 3c7ac40d7322 ("scsi: ufs: core: Delegate the interrupt service routine to a threaded IRQ handler")
Cc: stable@vger.kernel.org
Signed-off-by: André Draszik <andre.draszik@linaro.org>
---
v2:
* update some inline & kerneldoc comments
* mention 4k block size and 5 runs were used in fio runs
* add missing jiffies.h include
---
drivers/ufs/core/ufshcd.c | 191 +++++++++++++++++++++++++++++++++++++---------
1 file changed, 154 insertions(+), 37 deletions(-)
diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c
index d8e2eabacd3efbf07458e81cc4d15ba7f05d3913..404a4e075a21e73d22ae6bb89f77f69aebb7cd6a 100644
--- a/drivers/ufs/core/ufshcd.c
+++ b/drivers/ufs/core/ufshcd.c
@@ -19,6 +19,7 @@
#include <linux/clk.h>
#include <linux/delay.h>
#include <linux/interrupt.h>
+#include <linux/jiffies.h>
#include <linux/module.h>
#include <linux/pm_opp.h>
#include <linux/regulator/consumer.h>
@@ -111,6 +112,9 @@ enum {
/* bMaxNumOfRTT is equal to two after device manufacturing */
#define DEFAULT_MAX_NUM_RTT 2
+/* Time limit in usecs for hardirq context */
+#define HARDIRQ_TIMELIMIT 20
+
/* UFSHC 4.0 compliant HC support this mode. */
static bool use_mcq_mode = true;
@@ -5603,26 +5607,56 @@ void ufshcd_compl_one_cqe(struct ufs_hba *hba, int task_tag,
* __ufshcd_transfer_req_compl - handle SCSI and query command completion
* @hba: per adapter instance
* @completed_reqs: bitmask that indicates which requests to complete
+ * @time_limit: time limit in jiffies to not exceed executing command completion
+ *
+ * This completes the individual requests as per @completed_reqs with an
+ * optional time limit. If a time limit is given and it expired before all
+ * requests were handled, the return value will indicate which requests have not
+ * been handled.
+ *
+ * Return: Bitmask that indicates which requests have not been completed due to
+ * time limit expiry.
*/
-static void __ufshcd_transfer_req_compl(struct ufs_hba *hba,
- unsigned long completed_reqs)
+static unsigned long __ufshcd_transfer_req_compl(struct ufs_hba *hba,
+ unsigned long completed_reqs,
+ unsigned long time_limit)
{
int tag;
- for_each_set_bit(tag, &completed_reqs, hba->nutrs)
+ for_each_set_bit(tag, &completed_reqs, hba->nutrs) {
ufshcd_compl_one_cqe(hba, tag, NULL);
+ __clear_bit(tag, &completed_reqs);
+ if (time_limit && time_after_eq(jiffies, time_limit))
+ break;
+ }
+
+ /* any bits still set represent unhandled requests due to timeout */
+ return completed_reqs;
}
-/*
- * Return: > 0 if one or more commands have been completed or 0 if no
- * requests have been completed.
+/**
+ * ufshcd_poll_impl - handle SCSI and query command completion helper
+ * @shost: Scsi_Host instance
+ * @queue_num: The h/w queue number, or UFSHCD_POLL_FROM_INTERRUPT_CONTEXT when
+ * invoked from the interrupt handler
+ * @time_limit: time limit in jiffies to not exceed executing command completion
+ * @__pending: Pointer to store any still pending requests in case of time limit
+ * expiry
+ *
+ * This handles completed commands with an optional time limit. If a time limit
+ * is given and it expires, @__pending will be set to the requests that could
+ * not be completed in time and are still pending.
+ *
+ * Return: true if one or more commands have been completed, false otherwise.
*/
-static int ufshcd_poll(struct Scsi_Host *shost, unsigned int queue_num)
+static int ufshcd_poll_impl(struct Scsi_Host *shost, unsigned int queue_num,
+ unsigned long time_limit, unsigned long *__pending)
{
struct ufs_hba *hba = shost_priv(shost);
unsigned long completed_reqs, flags;
u32 tr_doorbell;
struct ufs_hw_queue *hwq;
+ unsigned long pending = 0;
if (hba->mcq_enabled) {
hwq = &hba->uhq[queue_num];
@@ -5636,15 +5670,34 @@ static int ufshcd_poll(struct Scsi_Host *shost, unsigned int queue_num)
WARN_ONCE(completed_reqs & ~hba->outstanding_reqs,
"completed: %#lx; outstanding: %#lx\n", completed_reqs,
hba->outstanding_reqs);
- hba->outstanding_reqs &= ~completed_reqs;
+
+ if (completed_reqs) {
+ pending = __ufshcd_transfer_req_compl(hba, completed_reqs,
+ time_limit);
+ completed_reqs &= ~pending;
+ hba->outstanding_reqs &= ~completed_reqs;
+ }
+
spin_unlock_irqrestore(&hba->outstanding_lock, flags);
- if (completed_reqs)
- __ufshcd_transfer_req_compl(hba, completed_reqs);
+ if (__pending)
+ *__pending = pending;
return completed_reqs != 0;
}
+/*
+ * ufshcd_poll - SCSI interface of blk_poll to poll for IO completions
+ * @shost: Scsi_Host instance
+ * @queue_num: The h/w queue number
+ *
+ * Return: true if one or more commands have been completed, false otherwise.
+ */
+static int ufshcd_poll(struct Scsi_Host *shost, unsigned int queue_num)
+{
+ return ufshcd_poll_impl(shost, queue_num, 0, NULL);
+}
+
/**
* ufshcd_mcq_compl_pending_transfer - MCQ mode function. It is
* invoked from the error handler context or ufshcd_host_reset_and_restore()
@@ -5698,13 +5751,19 @@ static void ufshcd_mcq_compl_pending_transfer(struct ufs_hba *hba,
/**
* ufshcd_transfer_req_compl - handle SCSI and query command completion
* @hba: per adapter instance
+ * @time_limit: time limit in jiffies to not exceed executing command completion
*
* Return:
- * IRQ_HANDLED - If interrupt is valid
- * IRQ_NONE - If invalid interrupt
+ * IRQ_HANDLED - If interrupt is valid
+ * IRQ_WAKE_THREAD - If further interrupt processing should be delegated to the
+ * thread
+ * IRQ_NONE - If invalid interrupt
*/
-static irqreturn_t ufshcd_transfer_req_compl(struct ufs_hba *hba)
+static irqreturn_t ufshcd_transfer_req_compl(struct ufs_hba *hba,
+ unsigned long time_limit)
{
+ unsigned long pending;
+
/* Resetting interrupt aggregation counters first and reading the
* DOOR_BELL afterward allows us to handle all the completed requests.
* In order to prevent other interrupts starvation the DB is read once
@@ -5720,12 +5779,18 @@ static irqreturn_t ufshcd_transfer_req_compl(struct ufs_hba *hba)
return IRQ_HANDLED;
/*
- * Ignore the ufshcd_poll() return value and return IRQ_HANDLED since we
- * do not want polling to trigger spurious interrupt complaints.
+ * Ignore the ufshcd_poll() return value and return IRQ_HANDLED or
+ * IRQ_WAKE_THREAD since we do not want polling to trigger spurious
+ * interrupt complaints.
*/
- ufshcd_poll(hba->host, 0);
+ ufshcd_poll_impl(hba->host, 0, time_limit, &pending);
- return IRQ_HANDLED;
+ /*
+ * If a time limit was set, some request completions might not have been
+ * handled yet and will need to be dealt with in the threaded interrupt
+ * handler.
+ */
+ return pending ? IRQ_WAKE_THREAD : IRQ_HANDLED;
}
int __ufshcd_write_ee_control(struct ufs_hba *hba, u32 ee_ctrl_mask)
@@ -6286,7 +6351,7 @@ static void ufshcd_complete_requests(struct ufs_hba *hba, bool force_compl)
if (hba->mcq_enabled)
ufshcd_mcq_compl_pending_transfer(hba, force_compl);
else
- ufshcd_transfer_req_compl(hba);
+ ufshcd_transfer_req_compl(hba, 0);
ufshcd_tmc_handler(hba);
}
@@ -6988,12 +7053,16 @@ static irqreturn_t ufshcd_handle_mcq_cq_events(struct ufs_hba *hba)
* ufshcd_sl_intr - Interrupt service routine
* @hba: per adapter instance
* @intr_status: contains interrupts generated by the controller
+ * @time_limit: time limit in jiffies to not exceed executing command completion
*
* Return:
- * IRQ_HANDLED - If interrupt is valid
- * IRQ_NONE - If invalid interrupt
+ * IRQ_HANDLED - If interrupt is valid
+ * IRQ_WAKE_THREAD - If further interrupt processing should be delegated to the
+ * thread
+ * IRQ_NONE - If invalid interrupt
*/
-static irqreturn_t ufshcd_sl_intr(struct ufs_hba *hba, u32 intr_status)
+static irqreturn_t ufshcd_sl_intr(struct ufs_hba *hba, u32 intr_status,
+ unsigned long time_limit)
{
irqreturn_t retval = IRQ_NONE;
@@ -7007,7 +7076,7 @@ static irqreturn_t ufshcd_sl_intr(struct ufs_hba *hba, u32 intr_status)
retval |= ufshcd_tmc_handler(hba);
if (intr_status & UTP_TRANSFER_REQ_COMPL)
- retval |= ufshcd_transfer_req_compl(hba);
+ retval |= ufshcd_transfer_req_compl(hba, time_limit);
if (intr_status & MCQ_CQ_EVENT_STATUS)
retval |= ufshcd_handle_mcq_cq_events(hba);
@@ -7016,15 +7085,25 @@ static irqreturn_t ufshcd_sl_intr(struct ufs_hba *hba, u32 intr_status)
}
/**
- * ufshcd_threaded_intr - Threaded interrupt service routine
+ * ufshcd_intr_helper - hardirq and threaded interrupt service routine
* @irq: irq number
* @__hba: pointer to adapter instance
+ * @time_limit: time limit in jiffies to not exceed during execution
+ *
+ * Interrupts are initially served from hardirq context with a time limit, but
+ * if there is more work to be done than can be completed before the limit
+ * expires, remaining work is delegated to the IRQ thread. This helper does the
+ * bulk of the work in either case - if @time_limit is set, it is being run from
+ * hardirq context, otherwise from the threaded interrupt handler.
*
* Return:
- * IRQ_HANDLED - If interrupt is valid
- * IRQ_NONE - If invalid interrupt
+ * IRQ_HANDLED - If interrupt was fully handled
+ * IRQ_WAKE_THREAD - If further interrupt processing should be delegated to the
+ * thread
+ * IRQ_NONE - If invalid interrupt
*/
-static irqreturn_t ufshcd_threaded_intr(int irq, void *__hba)
+static irqreturn_t ufshcd_intr_helper(int irq, void *__hba,
+ unsigned long time_limit)
{
u32 last_intr_status, intr_status, enabled_intr_status = 0;
irqreturn_t retval = IRQ_NONE;
@@ -7038,15 +7117,22 @@ static irqreturn_t ufshcd_threaded_intr(int irq, void *__hba)
* if the reqs get finished 1 by 1 after the interrupt status is
* read, make sure we handle them by checking the interrupt status
* again in a loop until we process all of the reqs before returning.
+ * This is done until the time limit is exceeded, at which point further
+ * processing is delegated to the threaded handler.
*/
- while (intr_status && retries--) {
+ while (intr_status && !(retval & IRQ_WAKE_THREAD) && retries--) {
enabled_intr_status =
intr_status & ufshcd_readl(hba, REG_INTERRUPT_ENABLE);
ufshcd_writel(hba, intr_status, REG_INTERRUPT_STATUS);
if (enabled_intr_status)
- retval |= ufshcd_sl_intr(hba, enabled_intr_status);
+ retval |= ufshcd_sl_intr(hba, enabled_intr_status,
+ time_limit);
intr_status = ufshcd_readl(hba, REG_INTERRUPT_STATUS);
+
+ if (intr_status && time_limit && time_after_eq(jiffies,
+ time_limit))
+ retval |= IRQ_WAKE_THREAD;
}
if (enabled_intr_status && retval == IRQ_NONE &&
@@ -7063,6 +7149,20 @@ static irqreturn_t ufshcd_threaded_intr(int irq, void *__hba)
return retval;
}
+/**
+ * ufshcd_threaded_intr - Threaded interrupt service routine
+ * @irq: irq number
+ * @__hba: pointer to adapter instance
+ *
+ * Return:
+ * IRQ_HANDLED - If interrupt was fully handled
+ * IRQ_NONE - If invalid interrupt
+ */
+static irqreturn_t ufshcd_threaded_intr(int irq, void *__hba)
+{
+ return ufshcd_intr_helper(irq, __hba, 0);
+}
+
/**
* ufshcd_intr - Main interrupt service routine
* @irq: irq number
@@ -7070,20 +7170,37 @@ static irqreturn_t ufshcd_threaded_intr(int irq, void *__hba)
*
* Return:
* IRQ_HANDLED - If interrupt is valid
- * IRQ_WAKE_THREAD - If handling is moved to threaded handled
+ * IRQ_WAKE_THREAD - If handling is moved to threaded handler
* IRQ_NONE - If invalid interrupt
*/
static irqreturn_t ufshcd_intr(int irq, void *__hba)
{
struct ufs_hba *hba = __hba;
+ unsigned long time_limit = jiffies +
+ usecs_to_jiffies(HARDIRQ_TIMELIMIT);
- /* Move interrupt handling to thread when MCQ & ESI are not enabled */
- if (!hba->mcq_enabled || !hba->mcq_esi_enabled)
- return IRQ_WAKE_THREAD;
+ /*
+ * Directly handle interrupts when MCQ & ESI are enabled since MCQ
+ * ESI handlers do the hard job.
+ */
+ if (hba->mcq_enabled && hba->mcq_esi_enabled)
+ return ufshcd_sl_intr(hba,
+ ufshcd_readl(hba, REG_INTERRUPT_STATUS) &
+ ufshcd_readl(hba, REG_INTERRUPT_ENABLE),
+ 0);
- /* Directly handle interrupts since MCQ ESI handlers does the hard job */
- return ufshcd_sl_intr(hba, ufshcd_readl(hba, REG_INTERRUPT_STATUS) &
- ufshcd_readl(hba, REG_INTERRUPT_ENABLE));
+ /*
+ * Otherwise handle interrupt in hardirq context until the time limit
+ * expires, at which point the remaining work will be completed in
+ * interrupt thread context.
+ */
+ if (!time_limit)
+ /*
+ * To deal with jiffies wrapping, we just add one so that other
+ * code can reliably detect if a time limit was requested.
+ */
+ time_limit++;
+ return ufshcd_intr_helper(irq, __hba, time_limit);
}
static int ufshcd_clear_tm_cmd(struct ufs_hba *hba, int tag)
@@ -7516,7 +7633,7 @@ static int ufshcd_eh_device_reset_handler(struct scsi_cmnd *cmd)
__func__, pos);
}
}
- __ufshcd_transfer_req_compl(hba, pending_reqs & ~not_cleared_mask);
+ __ufshcd_transfer_req_compl(hba, pending_reqs & ~not_cleared_mask, 0);
out:
hba->req_abort_count = 0;
@@ -7672,7 +7789,7 @@ static int ufshcd_abort(struct scsi_cmnd *cmd)
dev_err(hba->dev,
"%s: cmd was completed, but without a notifying intr, tag = %d",
__func__, tag);
- __ufshcd_transfer_req_compl(hba, 1UL << tag);
+ __ufshcd_transfer_req_compl(hba, 1UL << tag, 0);
goto release;
}
--
2.50.1.487.gc89ff58d15-goog
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH v2 2/2] scsi: ufs: core: move some irq handling back to hardirq (with time limit)
2025-07-25 14:16 ` [PATCH v2 2/2] scsi: ufs: core: move some irq handling back to hardirq (with time limit) André Draszik
@ 2025-07-25 21:08 ` Bart Van Assche
2025-07-28 14:43 ` Neil Armstrong
1 sibling, 0 replies; 9+ messages in thread
From: Bart Van Assche @ 2025-07-25 21:08 UTC (permalink / raw)
To: André Draszik, Alim Akhtar, Avri Altman,
James E.J. Bottomley, Martin K. Petersen, Neil Armstrong
Cc: Peter Griffin, Tudor Ambarus, Will McVicker,
Manivannan Sadhasivam, kernel-team, linux-arm-msm,
linux-samsung-soc, linux-scsi, linux-kernel, stable
On 7/25/25 7:16 AM, André Draszik wrote:
> - for_each_set_bit(tag, &completed_reqs, hba->nutrs)
> + for_each_set_bit(tag, &completed_reqs, hba->nutrs) {
> ufshcd_compl_one_cqe(hba, tag, NULL);
> + __clear_bit(tag, &completed_reqs);
> + if (time_limit && time_after_eq(jiffies, time_limit))
> + break;
> + }
Has it been considered to use time_is_before_eq_jiffies(time_limit)
instead of open-coding it?
> @@ -5636,15 +5670,34 @@ static int ufshcd_poll(struct Scsi_Host *shost, unsigned int queue_num)
> WARN_ONCE(completed_reqs & ~hba->outstanding_reqs,
> "completed: %#lx; outstanding: %#lx\n", completed_reqs,
> hba->outstanding_reqs);
> - hba->outstanding_reqs &= ~completed_reqs;
> +
> + if (completed_reqs) {
> + pending = __ufshcd_transfer_req_compl(hba, completed_reqs,
> + time_limit);
> + completed_reqs &= ~pending;
> + hba->outstanding_reqs &= ~completed_reqs;
> + }
> +
> spin_unlock_irqrestore(&hba->outstanding_lock, flags);
>
> - if (completed_reqs)
> - __ufshcd_transfer_req_compl(hba, completed_reqs);
This change moves the __ufshcd_transfer_req_compl() call from outside to
inside the critical section. I expect this to impact performance
negatively because it makes it significantly more likely that the
command submission code will have to wait while the completion code is
holding hba->outstanding_lock. Can this be avoided, e.g. by limiting the
number of commands that are completed instead of the time spent in
interrupt context? usecs_to_jiffies(HARDIRQ_TIMELIMIT) will round up
the time limit anyway from 20 microseconds to 1/HZ (one millisecond?).
Thanks,
Bart.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v2 2/2] scsi: ufs: core: move some irq handling back to hardirq (with time limit)
2025-07-25 14:16 ` [PATCH v2 2/2] scsi: ufs: core: move some irq handling back to hardirq (with time limit) André Draszik
2025-07-25 21:08 ` Bart Van Assche
@ 2025-07-28 14:43 ` Neil Armstrong
2025-07-28 14:49 ` André Draszik
1 sibling, 1 reply; 9+ messages in thread
From: Neil Armstrong @ 2025-07-28 14:43 UTC (permalink / raw)
To: André Draszik, Alim Akhtar, Avri Altman, Bart Van Assche,
James E.J. Bottomley, Martin K. Petersen
Cc: Peter Griffin, Tudor Ambarus, Will McVicker,
Manivannan Sadhasivam, kernel-team, linux-arm-msm,
linux-samsung-soc, linux-scsi, linux-kernel, stable
On 25/07/2025 16:16, André Draszik wrote:
> Commit 3c7ac40d7322 ("scsi: ufs: core: Delegate the interrupt service
> routine to a threaded IRQ handler") introduced a massive performance
> drop for various work loads on UFSHC versions < 4 due to the extra
> latency introduced by moving all of the IRQ handling into a threaded
> handler. See below for a summary.
>
> To resolve this performance drop, move IRQ handling back into hardirq
> context, but apply a time limit which, once expired, will cause the
> remainder of the work to be deferred to the threaded handler.
>
> Above commit is trying to avoid unduly delay of other subsystem
> interrupts while the UFS events are being handled. By limiting the
> amount of time spent in hardirq context, we can still ensure that.
>
> The time limit itself was chosen because I have generally seen
> interrupt handling to have been completed within 20 usecs, with the
> occasional spikes of a couple 100 usecs.
>
> This commits brings UFS performance roughly back to original
> performance, and should still avoid other subsystem's starvation thanks
> to dealing with these spikes.
>
> fio results for 4k block size on Pixel 6, all values being the average
> of 5 runs each:
> read / 1 job original after this commit
> min IOPS 4,653.60 2,704.40 3,902.80
> max IOPS 6,151.80 4,847.60 6,103.40
> avg IOPS 5,488.82 4,226.61 5,314.89
> cpu % usr 1.85 1.72 1.97
> cpu % sys 32.46 28.88 33.29
> bw MB/s 21.46 16.50 20.76
>
> read / 8 jobs original after this commit
> min IOPS 18,207.80 11,323.00 17,911.80
> max IOPS 25,535.80 14,477.40 24,373.60
> avg IOPS 22,529.93 13,325.59 21,868.85
> cpu % usr 1.70 1.41 1.67
> cpu % sys 27.89 21.85 27.23
> bw MB/s 88.10 52.10 84.48
>
> write / 1 job original after this commit
> min IOPS 6,524.20 3,136.00 5,988.40
> max IOPS 7,303.60 5,144.40 7,232.40
> avg IOPS 7,169.80 4,608.29 7,014.66
> cpu % usr 2.29 2.34 2.23
> cpu % sys 41.91 39.34 42.48
> bw MB/s 28.02 18.00 27.42
>
> write / 8 jobs original after this commit
> min IOPS 12,685.40 13,783.00 12,622.40
> max IOPS 30,814.20 22,122.00 29,636.00
> avg IOPS 21,539.04 18,552.63 21,134.65
> cpu % usr 2.08 1.61 2.07
> cpu % sys 30.86 23.88 30.64
> bw MB/s 84.18 72.54 82.62
Thanks for this updated change, I'm running the exact same run on SM8650 to check the impact,
and I'll report something comparable.
Thanks,
Neil
>
> Closes: https://lore.kernel.org/all/1e06161bf49a3a88c4ea2e7a406815be56114c4f.camel@linaro.org
> Fixes: 3c7ac40d7322 ("scsi: ufs: core: Delegate the interrupt service routine to a threaded IRQ handler")
> Cc: stable@vger.kernel.org
> Signed-off-by: André Draszik <andre.draszik@linaro.org>
>
> ---
> v2:
> * update some inline & kerneldoc comments
> * mention 4k block size and 5 runs were used in fio runs
> * add missing jiffies.h include
> ---
> drivers/ufs/core/ufshcd.c | 191 +++++++++++++++++++++++++++++++++++++---------
> 1 file changed, 154 insertions(+), 37 deletions(-)
>
> diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c
> index d8e2eabacd3efbf07458e81cc4d15ba7f05d3913..404a4e075a21e73d22ae6bb89f77f69aebb7cd6a 100644
> --- a/drivers/ufs/core/ufshcd.c
> +++ b/drivers/ufs/core/ufshcd.c
> @@ -19,6 +19,7 @@
> #include <linux/clk.h>
> #include <linux/delay.h>
> #include <linux/interrupt.h>
> +#include <linux/jiffies.h>
> #include <linux/module.h>
> #include <linux/pm_opp.h>
> #include <linux/regulator/consumer.h>
> @@ -111,6 +112,9 @@ enum {
> /* bMaxNumOfRTT is equal to two after device manufacturing */
> #define DEFAULT_MAX_NUM_RTT 2
>
> +/* Time limit in usecs for hardirq context */
> +#define HARDIRQ_TIMELIMIT 20
> +
> /* UFSHC 4.0 compliant HC support this mode. */
> static bool use_mcq_mode = true;
>
> @@ -5603,26 +5607,56 @@ void ufshcd_compl_one_cqe(struct ufs_hba *hba, int task_tag,
> * __ufshcd_transfer_req_compl - handle SCSI and query command completion
> * @hba: per adapter instance
> * @completed_reqs: bitmask that indicates which requests to complete
> + * @time_limit: time limit in jiffies to not exceed executing command completion
> + *
> + * This completes the individual requests as per @completed_reqs with an
> + * optional time limit. If a time limit is given and it expired before all
> + * requests were handled, the return value will indicate which requests have not
> + * been handled.
> + *
> + * Return: Bitmask that indicates which requests have not been completed due to
> + * time limit expiry.
> */
> -static void __ufshcd_transfer_req_compl(struct ufs_hba *hba,
> - unsigned long completed_reqs)
> +static unsigned long __ufshcd_transfer_req_compl(struct ufs_hba *hba,
> + unsigned long completed_reqs,
> + unsigned long time_limit)
> {
> int tag;
>
> - for_each_set_bit(tag, &completed_reqs, hba->nutrs)
> + for_each_set_bit(tag, &completed_reqs, hba->nutrs) {
> ufshcd_compl_one_cqe(hba, tag, NULL);
> + __clear_bit(tag, &completed_reqs);
> + if (time_limit && time_after_eq(jiffies, time_limit))
> + break;
> + }
> +
> + /* any bits still set represent unhandled requests due to timeout */
> + return completed_reqs;
> }
>
> -/*
> - * Return: > 0 if one or more commands have been completed or 0 if no
> - * requests have been completed.
> +/**
> + * ufshcd_poll_impl - handle SCSI and query command completion helper
> + * @shost: Scsi_Host instance
> + * @queue_num: The h/w queue number, or UFSHCD_POLL_FROM_INTERRUPT_CONTEXT when
> + * invoked from the interrupt handler
> + * @time_limit: time limit in jiffies to not exceed executing command completion
> + * @__pending: Pointer to store any still pending requests in case of time limit
> + * expiry
> + *
> + * This handles completed commands with an optional time limit. If a time limit
> + * is given and it expires, @__pending will be set to the requests that could
> + * not be completed in time and are still pending.
> + *
> + * Return: true if one or more commands have been completed, false otherwise.
> */
> -static int ufshcd_poll(struct Scsi_Host *shost, unsigned int queue_num)
> +static int ufshcd_poll_impl(struct Scsi_Host *shost, unsigned int queue_num,
> + unsigned long time_limit, unsigned long *__pending)
> {
> struct ufs_hba *hba = shost_priv(shost);
> unsigned long completed_reqs, flags;
> u32 tr_doorbell;
> struct ufs_hw_queue *hwq;
> + unsigned long pending = 0;
>
> if (hba->mcq_enabled) {
> hwq = &hba->uhq[queue_num];
> @@ -5636,15 +5670,34 @@ static int ufshcd_poll(struct Scsi_Host *shost, unsigned int queue_num)
> WARN_ONCE(completed_reqs & ~hba->outstanding_reqs,
> "completed: %#lx; outstanding: %#lx\n", completed_reqs,
> hba->outstanding_reqs);
> - hba->outstanding_reqs &= ~completed_reqs;
> +
> + if (completed_reqs) {
> + pending = __ufshcd_transfer_req_compl(hba, completed_reqs,
> + time_limit);
> + completed_reqs &= ~pending;
> + hba->outstanding_reqs &= ~completed_reqs;
> + }
> +
> spin_unlock_irqrestore(&hba->outstanding_lock, flags);
>
> - if (completed_reqs)
> - __ufshcd_transfer_req_compl(hba, completed_reqs);
> + if (__pending)
> + *__pending = pending;
>
> return completed_reqs != 0;
> }
>
> +/*
> + * ufshcd_poll - SCSI interface of blk_poll to poll for IO completions
> + * @shost: Scsi_Host instance
> + * @queue_num: The h/w queue number
> + *
> + * Return: true if one or more commands have been completed, false otherwise.
> + */
> +static int ufshcd_poll(struct Scsi_Host *shost, unsigned int queue_num)
> +{
> + return ufshcd_poll_impl(shost, queue_num, 0, NULL);
> +}
> +
> /**
> * ufshcd_mcq_compl_pending_transfer - MCQ mode function. It is
> * invoked from the error handler context or ufshcd_host_reset_and_restore()
> @@ -5698,13 +5751,19 @@ static void ufshcd_mcq_compl_pending_transfer(struct ufs_hba *hba,
> /**
> * ufshcd_transfer_req_compl - handle SCSI and query command completion
> * @hba: per adapter instance
> + * @time_limit: time limit in jiffies to not exceed executing command completion
> *
> * Return:
> - * IRQ_HANDLED - If interrupt is valid
> - * IRQ_NONE - If invalid interrupt
> + * IRQ_HANDLED - If interrupt is valid
> + * IRQ_WAKE_THREAD - If further interrupt processing should be delegated to the
> + * thread
> + * IRQ_NONE - If invalid interrupt
> */
> -static irqreturn_t ufshcd_transfer_req_compl(struct ufs_hba *hba)
> +static irqreturn_t ufshcd_transfer_req_compl(struct ufs_hba *hba,
> + unsigned long time_limit)
> {
> + unsigned long pending;
> +
> /* Resetting interrupt aggregation counters first and reading the
> * DOOR_BELL afterward allows us to handle all the completed requests.
> * In order to prevent other interrupts starvation the DB is read once
> @@ -5720,12 +5779,18 @@ static irqreturn_t ufshcd_transfer_req_compl(struct ufs_hba *hba)
> return IRQ_HANDLED;
>
> /*
> - * Ignore the ufshcd_poll() return value and return IRQ_HANDLED since we
> - * do not want polling to trigger spurious interrupt complaints.
> + * Ignore the ufshcd_poll() return value and return IRQ_HANDLED or
> + * IRQ_WAKE_THREAD since we do not want polling to trigger spurious
> + * interrupt complaints.
> */
> - ufshcd_poll(hba->host, 0);
> + ufshcd_poll_impl(hba->host, 0, time_limit, &pending);
>
> - return IRQ_HANDLED;
> + /*
> + * If a time limit was set, some request completions might not have been
> + * handled yet and will need to be dealt with in the threaded interrupt
> + * handler.
> + */
> + return pending ? IRQ_WAKE_THREAD : IRQ_HANDLED;
> }
>
> int __ufshcd_write_ee_control(struct ufs_hba *hba, u32 ee_ctrl_mask)
> @@ -6286,7 +6351,7 @@ static void ufshcd_complete_requests(struct ufs_hba *hba, bool force_compl)
> if (hba->mcq_enabled)
> ufshcd_mcq_compl_pending_transfer(hba, force_compl);
> else
> - ufshcd_transfer_req_compl(hba);
> + ufshcd_transfer_req_compl(hba, 0);
>
> ufshcd_tmc_handler(hba);
> }
> @@ -6988,12 +7053,16 @@ static irqreturn_t ufshcd_handle_mcq_cq_events(struct ufs_hba *hba)
> * ufshcd_sl_intr - Interrupt service routine
> * @hba: per adapter instance
> * @intr_status: contains interrupts generated by the controller
> + * @time_limit: time limit in jiffies to not exceed executing command completion
> *
> * Return:
> - * IRQ_HANDLED - If interrupt is valid
> - * IRQ_NONE - If invalid interrupt
> + * IRQ_HANDLED - If interrupt is valid
> + * IRQ_WAKE_THREAD - If further interrupt processing should be delegated to the
> + * thread
> + * IRQ_NONE - If invalid interrupt
> */
> -static irqreturn_t ufshcd_sl_intr(struct ufs_hba *hba, u32 intr_status)
> +static irqreturn_t ufshcd_sl_intr(struct ufs_hba *hba, u32 intr_status,
> + unsigned long time_limit)
> {
> irqreturn_t retval = IRQ_NONE;
>
> @@ -7007,7 +7076,7 @@ static irqreturn_t ufshcd_sl_intr(struct ufs_hba *hba, u32 intr_status)
> retval |= ufshcd_tmc_handler(hba);
>
> if (intr_status & UTP_TRANSFER_REQ_COMPL)
> - retval |= ufshcd_transfer_req_compl(hba);
> + retval |= ufshcd_transfer_req_compl(hba, time_limit);
>
> if (intr_status & MCQ_CQ_EVENT_STATUS)
> retval |= ufshcd_handle_mcq_cq_events(hba);
> @@ -7016,15 +7085,25 @@ static irqreturn_t ufshcd_sl_intr(struct ufs_hba *hba, u32 intr_status)
> }
>
> /**
> - * ufshcd_threaded_intr - Threaded interrupt service routine
> + * ufshcd_intr_helper - hardirq and threaded interrupt service routine
> * @irq: irq number
> * @__hba: pointer to adapter instance
> + * @time_limit: time limit in jiffies to not exceed during execution
> + *
> + * Interrupts are initially served from hardirq context with a time limit, but
> + * if there is more work to be done than can be completed before the limit
> + * expires, remaining work is delegated to the IRQ thread. This helper does the
> + * bulk of the work in either case - if @time_limit is set, it is being run from
> + * hardirq context, otherwise from the threaded interrupt handler.
> *
> * Return:
> - * IRQ_HANDLED - If interrupt is valid
> - * IRQ_NONE - If invalid interrupt
> + * IRQ_HANDLED - If interrupt was fully handled
> + * IRQ_WAKE_THREAD - If further interrupt processing should be delegated to the
> + * thread
> + * IRQ_NONE - If invalid interrupt
> */
> -static irqreturn_t ufshcd_threaded_intr(int irq, void *__hba)
> +static irqreturn_t ufshcd_intr_helper(int irq, void *__hba,
> + unsigned long time_limit)
> {
> u32 last_intr_status, intr_status, enabled_intr_status = 0;
> irqreturn_t retval = IRQ_NONE;
> @@ -7038,15 +7117,22 @@ static irqreturn_t ufshcd_threaded_intr(int irq, void *__hba)
> * if the reqs get finished 1 by 1 after the interrupt status is
> * read, make sure we handle them by checking the interrupt status
> * again in a loop until we process all of the reqs before returning.
> + * This is done until the time limit is exceeded, at which point further
> + * processing is delegated to the threaded handler.
> */
> - while (intr_status && retries--) {
> + while (intr_status && !(retval & IRQ_WAKE_THREAD) && retries--) {
> enabled_intr_status =
> intr_status & ufshcd_readl(hba, REG_INTERRUPT_ENABLE);
> ufshcd_writel(hba, intr_status, REG_INTERRUPT_STATUS);
> if (enabled_intr_status)
> - retval |= ufshcd_sl_intr(hba, enabled_intr_status);
> + retval |= ufshcd_sl_intr(hba, enabled_intr_status,
> + time_limit);
>
> intr_status = ufshcd_readl(hba, REG_INTERRUPT_STATUS);
> +
> + if (intr_status && time_limit && time_after_eq(jiffies,
> + time_limit))
> + retval |= IRQ_WAKE_THREAD;
> }
>
> if (enabled_intr_status && retval == IRQ_NONE &&
> @@ -7063,6 +7149,20 @@ static irqreturn_t ufshcd_threaded_intr(int irq, void *__hba)
> return retval;
> }
>
> +/**
> + * ufshcd_threaded_intr - Threaded interrupt service routine
> + * @irq: irq number
> + * @__hba: pointer to adapter instance
> + *
> + * Return:
> + * IRQ_HANDLED - If interrupt was fully handled
> + * IRQ_NONE - If invalid interrupt
> + */
> +static irqreturn_t ufshcd_threaded_intr(int irq, void *__hba)
> +{
> + return ufshcd_intr_helper(irq, __hba, 0);
> +}
> +
> /**
> * ufshcd_intr - Main interrupt service routine
> * @irq: irq number
> @@ -7070,20 +7170,37 @@ static irqreturn_t ufshcd_threaded_intr(int irq, void *__hba)
> *
> * Return:
> * IRQ_HANDLED - If interrupt is valid
> - * IRQ_WAKE_THREAD - If handling is moved to threaded handled
> + * IRQ_WAKE_THREAD - If handling is moved to threaded handler
> * IRQ_NONE - If invalid interrupt
> */
> static irqreturn_t ufshcd_intr(int irq, void *__hba)
> {
> struct ufs_hba *hba = __hba;
> + unsigned long time_limit = jiffies +
> + usecs_to_jiffies(HARDIRQ_TIMELIMIT);
>
> - /* Move interrupt handling to thread when MCQ & ESI are not enabled */
> - if (!hba->mcq_enabled || !hba->mcq_esi_enabled)
> - return IRQ_WAKE_THREAD;
> + /*
> + * Directly handle interrupts when MCQ & ESI are enabled since MCQ
> + * ESI handlers do the hard job.
> + */
> + if (hba->mcq_enabled && hba->mcq_esi_enabled)
> + return ufshcd_sl_intr(hba,
> + ufshcd_readl(hba, REG_INTERRUPT_STATUS) &
> + ufshcd_readl(hba, REG_INTERRUPT_ENABLE),
> + 0);
>
> - /* Directly handle interrupts since MCQ ESI handlers does the hard job */
> - return ufshcd_sl_intr(hba, ufshcd_readl(hba, REG_INTERRUPT_STATUS) &
> - ufshcd_readl(hba, REG_INTERRUPT_ENABLE));
> + /*
> + * Otherwise handle interrupt in hardirq context until the time limit
> + * expires, at which point the remaining work will be completed in
> + * interrupt thread context.
> + */
> + if (!time_limit)
> + /*
> + * To deal with jiffies wrapping, we just add one so that other
> + * code can reliably detect if a time limit was requested.
> + */
> + time_limit++;
> + return ufshcd_intr_helper(irq, __hba, time_limit);
> }
>
> static int ufshcd_clear_tm_cmd(struct ufs_hba *hba, int tag)
> @@ -7516,7 +7633,7 @@ static int ufshcd_eh_device_reset_handler(struct scsi_cmnd *cmd)
> __func__, pos);
> }
> }
> - __ufshcd_transfer_req_compl(hba, pending_reqs & ~not_cleared_mask);
> + __ufshcd_transfer_req_compl(hba, pending_reqs & ~not_cleared_mask, 0);
>
> out:
> hba->req_abort_count = 0;
> @@ -7672,7 +7789,7 @@ static int ufshcd_abort(struct scsi_cmnd *cmd)
> dev_err(hba->dev,
> "%s: cmd was completed, but without a notifying intr, tag = %d",
> __func__, tag);
> - __ufshcd_transfer_req_compl(hba, 1UL << tag);
> + __ufshcd_transfer_req_compl(hba, 1UL << tag, 0);
> goto release;
> }
>
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v2 2/2] scsi: ufs: core: move some irq handling back to hardirq (with time limit)
2025-07-28 14:43 ` Neil Armstrong
@ 2025-07-28 14:49 ` André Draszik
2025-07-28 15:19 ` Bart Van Assche
0 siblings, 1 reply; 9+ messages in thread
From: André Draszik @ 2025-07-28 14:49 UTC (permalink / raw)
To: Neil Armstrong, Alim Akhtar, Avri Altman, Bart Van Assche,
James E.J. Bottomley, Martin K. Petersen
Cc: Peter Griffin, Tudor Ambarus, Will McVicker,
Manivannan Sadhasivam, kernel-team, linux-arm-msm,
linux-samsung-soc, linux-scsi, linux-kernel, stable
On Mon, 2025-07-28 at 16:43 +0200, Neil Armstrong wrote:
> On 25/07/2025 16:16, André Draszik wrote:
> > Commit 3c7ac40d7322 ("scsi: ufs: core: Delegate the interrupt service
> > routine to a threaded IRQ handler") introduced a massive performance
> > drop for various work loads on UFSHC versions < 4 due to the extra
> > latency introduced by moving all of the IRQ handling into a threaded
> > handler. See below for a summary.
> >
> > To resolve this performance drop, move IRQ handling back into hardirq
> > context, but apply a time limit which, once expired, will cause the
> > remainder of the work to be deferred to the threaded handler.
> >
> > Above commit is trying to avoid unduly delay of other subsystem
> > interrupts while the UFS events are being handled. By limiting the
> > amount of time spent in hardirq context, we can still ensure that.
> >
> > The time limit itself was chosen because I have generally seen
> > interrupt handling to have been completed within 20 usecs, with the
> > occasional spikes of a couple 100 usecs.
> >
> > This commits brings UFS performance roughly back to original
> > performance, and should still avoid other subsystem's starvation thanks
> > to dealing with these spikes.
> >
> > fio results for 4k block size on Pixel 6, all values being the average
> > of 5 runs each:
> > read / 1 job original after this commit
> > min IOPS 4,653.60 2,704.40 3,902.80
> > max IOPS 6,151.80 4,847.60 6,103.40
> > avg IOPS 5,488.82 4,226.61 5,314.89
> > cpu % usr 1.85 1.72 1.97
> > cpu % sys 32.46 28.88 33.29
> > bw MB/s 21.46 16.50 20.76
> >
> > read / 8 jobs original after this commit
> > min IOPS 18,207.80 11,323.00 17,911.80
> > max IOPS 25,535.80 14,477.40 24,373.60
> > avg IOPS 22,529.93 13,325.59 21,868.85
> > cpu % usr 1.70 1.41 1.67
> > cpu % sys 27.89 21.85 27.23
> > bw MB/s 88.10 52.10 84.48
> >
> > write / 1 job original after this commit
> > min IOPS 6,524.20 3,136.00 5,988.40
> > max IOPS 7,303.60 5,144.40 7,232.40
> > avg IOPS 7,169.80 4,608.29 7,014.66
> > cpu % usr 2.29 2.34 2.23
> > cpu % sys 41.91 39.34 42.48
> > bw MB/s 28.02 18.00 27.42
> >
> > write / 8 jobs original after this commit
> > min IOPS 12,685.40 13,783.00 12,622.40
> > max IOPS 30,814.20 22,122.00 29,636.00
> > avg IOPS 21,539.04 18,552.63 21,134.65
> > cpu % usr 2.08 1.61 2.07
> > cpu % sys 30.86 23.88 30.64
> > bw MB/s 84.18 72.54 82.62
>
> Thanks for this updated change, I'm running the exact same run on SM8650 to check the impact,
> and I'll report something comparable.
Btw, my complete command was (should probably have added that
to the commit message in the first place):
for rw in read write ; do
echo "rw: ${rw}"
for jobs in 1 8 ; do
echo "jobs: ${jobs}"
for it in $(seq 1 5) ; do
fio --name=rand${rw} --rw=rand${rw} \
--ioengine=libaio --direct=1 \
--bs=4k --numjobs=${jobs} --size=32m \
--runtime=30 --time_based --end_fsync=1 \
--group_reporting --filename=/foo \
| grep -E '(iops|sys=|READ:|WRITE:)'
sleep 5
done
done
done
Cheers,
Andre'
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v2 2/2] scsi: ufs: core: move some irq handling back to hardirq (with time limit)
2025-07-28 14:49 ` André Draszik
@ 2025-07-28 15:19 ` Bart Van Assche
2025-07-28 16:55 ` Neil Armstrong
0 siblings, 1 reply; 9+ messages in thread
From: Bart Van Assche @ 2025-07-28 15:19 UTC (permalink / raw)
To: André Draszik, Neil Armstrong, Alim Akhtar, Avri Altman,
James E.J. Bottomley, Martin K. Petersen
Cc: Peter Griffin, Tudor Ambarus, Will McVicker,
Manivannan Sadhasivam, kernel-team, linux-arm-msm,
linux-samsung-soc, linux-scsi, linux-kernel, stable
On 7/28/25 7:49 AM, André Draszik wrote:
> Btw, my complete command was (should probably have added that
> to the commit message in the first place):
>
> for rw in read write ; do
> echo "rw: ${rw}"
> for jobs in 1 8 ; do
> echo "jobs: ${jobs}"
> for it in $(seq 1 5) ; do
> fio --name=rand${rw} --rw=rand${rw} \
> --ioengine=libaio --direct=1 \
> --bs=4k --numjobs=${jobs} --size=32m \
> --runtime=30 --time_based --end_fsync=1 \
> --group_reporting --filename=/foo \
> | grep -E '(iops|sys=|READ:|WRITE:)'
> sleep 5
> done
> done
> done
Please run performance tests in recovery mode against a block
device (/dev/block/sd...) instead of running performance tests on
top of a filesystem. One possible approach for retrieving the block
device name is as follows:
adb shell readlink /dev/block/by-name/userdata
There may be other approaches for retrieving the name of the block
device associated with /data. Additionally, tuning for maximum
performance is useful because it eliminates impact from the process
scheduler on block device performance measurement. An extract from a
scrip that I use myself to measure block device performance on Pixel
devices is available below.
Best regards,
Bart.
optimize() {
local clkgate_enable c d devfreq disable_cpuidle governor nomerges
iostats
local target_freq ufs_irq_path
if [ "$1" = performance ]; then
clkgate_enable=0
devfreq=max
disable_cpuidle=1
governor=performance
# Enable I/O statistics because the performance impact is low and
# because fio reports the I/O statistics.
iostats=1
# Disable merging to make tests follow the fio arguments.
nomerges=2
target_freq=cpuinfo_max_freq
persist_logs=false
else
clkgate_enable=1
devfreq=min
disable_cpuidle=0
governor=sched_pixel
iostats=1
nomerges=0
target_freq=cpuinfo_min_freq
persist_logs=true
fi
for c in $(adb shell "echo /sys/devices/system/cpu/cpu[0-9]*"); do
for d in $(adb shell "echo $c/cpuidle/state[1-9]*"); do
adb shell "if [ -e $d ]; then echo $disable_cpuidle > $d/disable; fi"
done
adb shell "cat $c/cpufreq/cpuinfo_max_freq > $c/cpufreq/scaling_max_freq;
cat $c/cpufreq/${target_freq} >
$c/cpufreq/scaling_min_freq;
echo ${governor} > $c/cpufreq/scaling_governor; true" \
2>/dev/null
done
if [ "$(adb shell grep -c ufshcd /proc/interrupts)" = 1 ]; then
# No MCQ or MCQ disabled. Make the fastest CPU core process UFS
# interrupts.
# shellcheck disable=SC2016
ufs_irq_path=$(adb shell 'a=$(echo /proc/irq/*/ufshcd); echo ${a%/ufshcd}')
adb shell "echo ${fastest_cpucore} > ${ufs_irq_path}/smp_affinity_list;
true"
else
# MCQ is enabled. Distribute the completion interrupts over the
# available CPU cores.
local i=0
local irqs
irqs=$(adb shell "sed -n 's/:.*GIC.*ufshcd.*//p' /proc/interrupts")
for irq in $irqs; do
adb shell "echo $i > /proc/irq/$irq/smp_affinity_list; true"
i=$((i+1))
done
fi
for d in $(adb shell echo /sys/class/devfreq/*); do
case "$d" in
*gpu0)
continue
;;
esac
local min_freq
min_freq=$(adb shell "cat $d/available_frequencies |
tr ' ' '\n' |
sort -n |
case $devfreq in
min) head -n1;;
max) tail -n1;;
esac")
adb shell "echo $min_freq > $d/min_freq"
# shellcheck disable=SC2086
if [ "$devfreq" = "max" ]; then
echo "$(basename $d)/min_freq: $(adb shell cat $d/min_freq) <>
$min_freq"
fi
done
for d in $(adb shell echo /sys/devices/platform/*.ufs); do
adb shell "echo $clkgate_enable > $d/clkgate_enable"
done
adb shell setprop logd.logpersistd.enable ${persist_logs}
adb shell "for b in /sys/class/block/{sd[a-z],dm*}; do
if [ -e \$b ]; then
[ -e \$b/queue/iostats ] && echo ${iostats} >\$b/queue/iostats;
[ -e \$b/queue/nomerges ] && echo ${nomerges} >\$b/queue/nomerges;
[ -e \$b/queue/rq_affinity ] && echo 2 >\$b/queue/rq_affinity;
[ -e \$b/queue/scheduler ] && echo ${iosched} >\$b/queue/scheduler;
fi
done; true"
adb shell "grep -q '^[^[:blank:]]* /sys/kernel/debug' /proc/mounts
|| mount -t debugfs none /sys/kernel/debug"
}
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v2 2/2] scsi: ufs: core: move some irq handling back to hardirq (with time limit)
2025-07-28 15:19 ` Bart Van Assche
@ 2025-07-28 16:55 ` Neil Armstrong
2025-07-29 11:20 ` André Draszik
0 siblings, 1 reply; 9+ messages in thread
From: Neil Armstrong @ 2025-07-28 16:55 UTC (permalink / raw)
To: Bart Van Assche, André Draszik, Alim Akhtar, Avri Altman,
James E.J. Bottomley, Martin K. Petersen
Cc: Peter Griffin, Tudor Ambarus, Will McVicker,
Manivannan Sadhasivam, kernel-team, linux-arm-msm,
linux-samsung-soc, linux-scsi, linux-kernel, stable
Hi,
On 28/07/2025 17:19, Bart Van Assche wrote:
> On 7/28/25 7:49 AM, André Draszik wrote:
>> Btw, my complete command was (should probably have added that
>> to the commit message in the first place):
>>
>> for rw in read write ; do
>> echo "rw: ${rw}"
>> for jobs in 1 8 ; do
>> echo "jobs: ${jobs}"
>> for it in $(seq 1 5) ; do
>> fio --name=rand${rw} --rw=rand${rw} \
>> --ioengine=libaio --direct=1 \
>> --bs=4k --numjobs=${jobs} --size=32m \
>> --runtime=30 --time_based --end_fsync=1 \
>> --group_reporting --filename=/foo \
>> | grep -E '(iops|sys=|READ:|WRITE:)'
>> sleep 5
>> done
>> done
>> done
>
> Please run performance tests in recovery mode against a block
> device (/dev/block/sd...) instead of running performance tests on
> top of a filesystem. One possible approach for retrieving the block
> device name is as follows:
>
> adb shell readlink /dev/block/by-name/userdata
>
> There may be other approaches for retrieving the name of the block
> device associated with /data. Additionally, tuning for maximum
> performance is useful because it eliminates impact from the process
> scheduler on block device performance measurement. An extract from a
> scrip that I use myself to measure block device performance on Pixel
> devices is available below.
Of course, I did all that and ran on the SM8650 QRD & HDK boards, one has
an UFS 3.1 device and the other an UFS 4.0 device.
Here's the raw data:
Board: sm8650-qrd
read / 1 job
v6.15 v6.16 v6.16 + this commit
min IOPS 3,996.00 5,921.60 3,424.80
max IOPS 4,772.80 6,491.20 4,541.20
avg IOPS 4,526.25 6,295.31 4,320.58
cpu % usr 4.62 2.96 4.50
cpu % sys 21.45 17.88 21.62
bw MB/s 18.54 25.78 17.64
read / 8 job
v6.15 v6.16 v6.16 + this commit
min IOPS 51,867.60 51,575.40 45,257.00
max IOPS 67,513.60 64,456.40 56,336.00
avg IOPS 64,314.80 62,136.76 52,505.72
cpu % usr 3.98 3.72 3.52
cpu % sys 16.70 17.16 18.74
bw MB/s 263.60 254.40 215.00
write / 1 job
v6.15 v6.16 v6.16 + this commit
min IOPS 5,654.80 8,060.00 5,730.80
max IOPS 6,720.40 8,852.00 6,981.20
avg IOPS 6,576.91 8,579.81 6,726.51
cpu % usr 7.48 3.79 8.49
cpu % sys 41.09 23.27 34.86
bw MB/s 26.96 35.16 27.52
write / 8 job
v6.15 v6.16 v6.16 + this commit
min IOPS 84,687.80 95,043.40 74,799.60
max IOPS 107,620.80 113,572.00 96,377.20
avg IOPS 97,910.86 105,927.38 87,239.07
cpu % usr 5.43 4.38 3.72
cpu % sys 21.73 20.29 30.97
bw MB/s 400.80 433.80 357.40
Board: sm8650-hdk
read / 1 job
v6.15 v6.16 v6.16 + this commit
min IOPS 4,867.20 5,596.80 4,242.80
max IOPS 5,211.60 5,970.00 4,548.80
avg IOPS 5,126.12 5,847.93 4,370.14
cpu % usr 3.83 2.81 2.62
cpu % sys 18.29 13.44 16.89
bw MB/s 20.98 17.88 23.96
read / 8 job
v6.15 v6.16 v6.16 + this commit
min IOPS 47,583.80 46,831.60 47,671.20
max IOPS 58,913.20 59,442.80 56,282.80
avg IOPS 53,609.04 44,396.88 53,621.46
cpu % usr 3.57 3.06 3.11
cpu % sys 15.23 19.31 15.90
bw MB/s 219.40 219.60 210.80
write / 1 job
v6.15 v6.16 v6.16 + this commit
min IOPS 6,529.42 8,367.20 6,492.80
max IOPS 7,856.92 9,244.40 7,184.80
avg IOPS 7,676.21 8,991.67 6,904.67
cpu % usr 10.17 7.98 3.68
cpu % sys 37.55 34.41 23.07
bw MB/s 31.44 28.28 36.84
write / 8 job
v6.15 v6.16 v6.16 + this commit
min IOPS 86,304.60 94,288.80 78,433.60
max IOPS 105,670.80 110,373.60 96,330.80
avg IOPS 97,418.81 103,789.76 88,468.27
cpu % usr 4.98 3.27 3.67
cpu % sys 21.45 30.85 20.08
bw MB/s 399.00 362.40 425.00
Assisted analysis gives:
IOPS (Input/Output Operations Per Second):
The v6.16 kernel shows a slight increase in average IOPS compared to v6.15 (43245.69 vs. 42144.88).
The v6.16+fix kernel significantly reduces average IOPS, dropping to 36946.17.
Bandwidth (MB/s):
The v6.16 kernel shows an increase in average bandwidth compared to v6.15 (180.72 MB/s vs. 172.59 MB/s).
The v6.16 with this commit significantly reduces average bandwidth, dropping to 151.32 MB/s.
Detailed Analysis:
Impact of v6.16 Kernel:
The v6.16 kernel introduces a minor improvement in IO performance compared to v6.15.
Both average IOPS and average bandwidth saw a small increase. This suggests that the v6.16
kernel might have introduced some optimizations that slightly improved overall IO performance.
Impact of the Fix:
The potential introduced appears to have a negative impact on both IOPS and bandwidth.
Both metrics show a substantial decrease compared to both v6.15 and v6.16.
This indicates that the fix might be detrimental to IO performance.
The threaded IRQ change did increase IOPS and Bandwidth, and stopped starving interrupts.
This change gives worse numbers than before the threaded IRQ.
Neil
>
> Best regards,
>
> Bart.
>
>
> optimize() {
> local clkgate_enable c d devfreq disable_cpuidle governor nomerges iostats
> local target_freq ufs_irq_path
>
> if [ "$1" = performance ]; then
> clkgate_enable=0
> devfreq=max
> disable_cpuidle=1
> governor=performance
> # Enable I/O statistics because the performance impact is low and
> # because fio reports the I/O statistics.
> iostats=1
> # Disable merging to make tests follow the fio arguments.
> nomerges=2
> target_freq=cpuinfo_max_freq
> persist_logs=false
> else
> clkgate_enable=1
> devfreq=min
> disable_cpuidle=0
> governor=sched_pixel
> iostats=1
> nomerges=0
> target_freq=cpuinfo_min_freq
> persist_logs=true
> fi
>
> for c in $(adb shell "echo /sys/devices/system/cpu/cpu[0-9]*"); do
> for d in $(adb shell "echo $c/cpuidle/state[1-9]*"); do
> adb shell "if [ -e $d ]; then echo $disable_cpuidle > $d/disable; fi"
> done
> adb shell "cat $c/cpufreq/cpuinfo_max_freq > $c/cpufreq/scaling_max_freq;
> cat $c/cpufreq/${target_freq} > $c/cpufreq/scaling_min_freq;
> echo ${governor} > $c/cpufreq/scaling_governor; true" \
> 2>/dev/null
> done
>
> if [ "$(adb shell grep -c ufshcd /proc/interrupts)" = 1 ]; then
> # No MCQ or MCQ disabled. Make the fastest CPU core process UFS
> # interrupts.
> # shellcheck disable=SC2016
> ufs_irq_path=$(adb shell 'a=$(echo /proc/irq/*/ufshcd); echo ${a%/ufshcd}')
> adb shell "echo ${fastest_cpucore} > ${ufs_irq_path}/smp_affinity_list; true"
> else
> # MCQ is enabled. Distribute the completion interrupts over the
> # available CPU cores.
> local i=0
> local irqs
> irqs=$(adb shell "sed -n 's/:.*GIC.*ufshcd.*//p' /proc/interrupts")
> for irq in $irqs; do
> adb shell "echo $i > /proc/irq/$irq/smp_affinity_list; true"
> i=$((i+1))
> done
> fi
>
> for d in $(adb shell echo /sys/class/devfreq/*); do
> case "$d" in
> *gpu0)
> continue
> ;;
> esac
> local min_freq
> min_freq=$(adb shell "cat $d/available_frequencies |
> tr ' ' '\n' |
> sort -n |
> case $devfreq in
> min) head -n1;;
> max) tail -n1;;
> esac")
> adb shell "echo $min_freq > $d/min_freq"
> # shellcheck disable=SC2086
> if [ "$devfreq" = "max" ]; then
> echo "$(basename $d)/min_freq: $(adb shell cat $d/min_freq) <> $min_freq"
> fi
> done
>
> for d in $(adb shell echo /sys/devices/platform/*.ufs); do
> adb shell "echo $clkgate_enable > $d/clkgate_enable"
> done
>
> adb shell setprop logd.logpersistd.enable ${persist_logs}
>
> adb shell "for b in /sys/class/block/{sd[a-z],dm*}; do
> if [ -e \$b ]; then
> [ -e \$b/queue/iostats ] && echo ${iostats} >\$b/queue/iostats;
> [ -e \$b/queue/nomerges ] && echo ${nomerges} >\$b/queue/nomerges;
> [ -e \$b/queue/rq_affinity ] && echo 2 >\$b/queue/rq_affinity;
> [ -e \$b/queue/scheduler ] && echo ${iosched} >\$b/queue/scheduler;
> fi
> done; true"
>
> adb shell "grep -q '^[^[:blank:]]* /sys/kernel/debug' /proc/mounts || mount -t debugfs none /sys/kernel/debug"
> }
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v2 2/2] scsi: ufs: core: move some irq handling back to hardirq (with time limit)
2025-07-28 16:55 ` Neil Armstrong
@ 2025-07-29 11:20 ` André Draszik
0 siblings, 0 replies; 9+ messages in thread
From: André Draszik @ 2025-07-29 11:20 UTC (permalink / raw)
To: Neil Armstrong, Bart Van Assche, Alim Akhtar, Avri Altman,
James E.J. Bottomley, Martin K. Petersen
Cc: Peter Griffin, Tudor Ambarus, Will McVicker,
Manivannan Sadhasivam, kernel-team, linux-arm-msm,
linux-samsung-soc, linux-scsi, linux-kernel, stable
On Mon, 2025-07-28 at 18:55 +0200, Neil Armstrong wrote:
> Hi,
>
> On 28/07/2025 17:19, Bart Van Assche wrote:
> > On 7/28/25 7:49 AM, André Draszik wrote:
> > > Btw, my complete command was (should probably have added that
> > > to the commit message in the first place):
> > >
> > > for rw in read write ; do
> > > echo "rw: ${rw}"
> > > for jobs in 1 8 ; do
> > > echo "jobs: ${jobs}"
> > > for it in $(seq 1 5) ; do
> > > fio --name=rand${rw} --rw=rand${rw} \
> > > --ioengine=libaio --direct=1 \
> > > --bs=4k --numjobs=${jobs} --size=32m \
> > > --runtime=30 --time_based --end_fsync=1 \
> > > --group_reporting --filename=/foo \
> > > | grep -E '(iops|sys=|READ:|WRITE:)'
> > > sleep 5
> > > done
> > > done
> > > done
> >
> > Please run performance tests in recovery mode against a block
> > device (/dev/block/sd...) instead of running performance tests on
> > top of a filesystem. One possible approach for retrieving the block
> > device name is as follows:
> >
> > adb shell readlink /dev/block/by-name/userdata
> >
> > There may be other approaches for retrieving the name of the block
> > device associated with /data. Additionally, tuning for maximum
> > performance is useful because it eliminates impact from the process
> > scheduler on block device performance measurement. An extract from a
> > scrip that I use myself to measure block device performance on Pixel
> > devices is available below.
>
> Of course, I did all that and ran on the SM8650 QRD & HDK boards, one has
> an UFS 3.1 device and the other an UFS 4.0 device.
>
> Here's the raw data:
>
> Board: sm8650-qrd
> read / 1 job
> v6.15 v6.16 v6.16 + this commit
> min IOPS 3,996.00 5,921.60 3,424.80
> max IOPS 4,772.80 6,491.20 4,541.20
> avg IOPS 4,526.25 6,295.31 4,320.58
> cpu % usr 4.62 2.96 4.50
> cpu % sys 21.45 17.88 21.62
> bw MB/s 18.54 25.78 17.64
>
> read / 8 job
> v6.15 v6.16 v6.16 + this commit
> min IOPS 51,867.60 51,575.40 45,257.00
> max IOPS 67,513.60 64,456.40 56,336.00
> avg IOPS 64,314.80 62,136.76 52,505.72
> cpu % usr 3.98 3.72 3.52
> cpu % sys 16.70 17.16 18.74
> bw MB/s 263.60 254.40 215.00
>
> write / 1 job
> v6.15 v6.16 v6.16 + this commit
> min IOPS 5,654.80 8,060.00 5,730.80
> max IOPS 6,720.40 8,852.00 6,981.20
> avg IOPS 6,576.91 8,579.81 6,726.51
> cpu % usr 7.48 3.79 8.49
> cpu % sys 41.09 23.27 34.86
> bw MB/s 26.96 35.16 27.52
>
> write / 8 job
> v6.15 v6.16 v6.16 + this commit
> min IOPS 84,687.80 95,043.40 74,799.60
> max IOPS 107,620.80 113,572.00 96,377.20
> avg IOPS 97,910.86 105,927.38 87,239.07
> cpu % usr 5.43 4.38 3.72
> cpu % sys 21.73 20.29 30.97
> bw MB/s 400.80 433.80 357.40
>
> Board: sm8650-hdk
> read / 1 job
> v6.15 v6.16 v6.16 + this commit
> min IOPS 4,867.20 5,596.80 4,242.80
> max IOPS 5,211.60 5,970.00 4,548.80
> avg IOPS 5,126.12 5,847.93 4,370.14
> cpu % usr 3.83 2.81 2.62
> cpu % sys 18.29 13.44 16.89
> bw MB/s 20.98 17.88 23.96
>
> read / 8 job
> v6.15 v6.16 v6.16 + this commit
> min IOPS 47,583.80 46,831.60 47,671.20
> max IOPS 58,913.20 59,442.80 56,282.80
> avg IOPS 53,609.04 44,396.88 53,621.46
> cpu % usr 3.57 3.06 3.11
> cpu % sys 15.23 19.31 15.90
> bw MB/s 219.40 219.60 210.80
>
> write / 1 job
> v6.15 v6.16 v6.16 + this commit
> min IOPS 6,529.42 8,367.20 6,492.80
> max IOPS 7,856.92 9,244.40 7,184.80
> avg IOPS 7,676.21 8,991.67 6,904.67
> cpu % usr 10.17 7.98 3.68
> cpu % sys 37.55 34.41 23.07
> bw MB/s 31.44 28.28 36.84
>
> write / 8 job
> v6.15 v6.16 v6.16 + this commit
> min IOPS 86,304.60 94,288.80 78,433.60
> max IOPS 105,670.80 110,373.60 96,330.80
> avg IOPS 97,418.81 103,789.76 88,468.27
> cpu % usr 4.98 3.27 3.67
> cpu % sys 21.45 30.85 20.08
> bw MB/s 399.00 362.40 425.00
>
> Assisted analysis gives:
>
> IOPS (Input/Output Operations Per Second):
> The v6.16 kernel shows a slight increase in average IOPS compared to v6.15 (43245.69 vs. 42144.88).
> The v6.16+fix kernel significantly reduces average IOPS, dropping to 36946.17.
>
> Bandwidth (MB/s):
> The v6.16 kernel shows an increase in average bandwidth compared to v6.15 (180.72 MB/s vs. 172.59 MB/s).
> The v6.16 with this commit significantly reduces average bandwidth, dropping to 151.32 MB/s.
>
> Detailed Analysis:
> Impact of v6.16 Kernel:
> The v6.16 kernel introduces a minor improvement in IO performance compared to v6.15.
> Both average IOPS and average bandwidth saw a small increase. This suggests that the v6.16
> kernel might have introduced some optimizations that slightly improved overall IO performance.
>
> Impact of the Fix:
> The potential introduced appears to have a negative impact on both IOPS and bandwidth.
> Both metrics show a substantial decrease compared to both v6.15 and v6.16.
> This indicates that the fix might be detrimental to IO performance.
>
> The threaded IRQ change did increase IOPS and Bandwidth, and stopped starving interrupts.
> This change gives worse numbers than before the threaded IRQ.
Thanks Neil for your numbers.
So there must be more to it... I was interested in overall performance
originally, and using block layer access and disabling all kernel debug
options, the absolute numbers of course change for me, but the general
trend is still the same:
fio results for 4k block size on Pixel 6, all values being the average
of 5 runs each:
read / 1 job original after this commit
min IOPS 7,741.60 6,500.00 (-16.04%) 9,175.60 ( 18.52%)
max IOPS 11,548.80 8,217.60 (-28.84%) 11,476.80 (- 0.62%)
avg IOPS 10,356.69 7,143.21 (-31.03%) 11,098.65 ( 7.16%)
cpu % usr 4.31 4.93 ( 14.33%) 3.34 (-22.63%)
cpu % sys 22.12 25.08 ( 13.40%) 18.14 (-17.97%)
bw MB/s 40.46 27.92 (-30.99%) 43.34 ( 7.12%)
read / 8 jobs original after this commit
min IOPS 53,834.00 49,145.20 (- 8.71%) 52,202.40 (- 3.03%)
max IOPS 62,489.20 55,378.00 (-11.38%) 61,207.20 (- 2.05%)
avg IOPS 60,733.97 52,305.85 (-13.88%) 58,617.45 (- 3.48%)
cpu % usr 5.59 4.24 (-24.22%) 5.31 (- 4.94%)
cpu % sys 28.32 21.56 (-23.85%) 27.44 (- 3.08%)
bw MB/s 237.40 204.40 (-13.90%) 228.80 (- 3.62%)
write / 1 job original after this commit
min IOPS 11,438.00 10,173.60 (-11.05%) 16,418.40 ( 43.54%)
max IOPS 19,752.00 13,366.80 (-32.33%) 19,666.00 (- 0.44%)
avg IOPS 18,329.57 11,656.83 (-36.40%) 18,685.83 ( 1.94%)
cpu % usr 6.46 6.30 (- 2.60%) 5.73 (-11.29%)
cpu % sys 33.74 31.34 (- 7.11%) 30.83 (- 8.63%)
bw MB/s 71.60 45.52 (-36.42%) 73.72 ( 2.96%)
write / 8 jobs original after this commit
min IOPS 68,824.20 59,397.20 (-13.70%) 60,699.60 (-11.80%)
max IOPS 89,382.00 68,318.60 (-23.57%) 90,247.00 ( 0.97%)
avg IOPS 79,589.76 65,048.45 (-18.27%) 75,100.38 (- 5.64%)
cpu % usr 6.23 5.76 (- 7.48%) 5.49 (-11.89%)
cpu % sys 31.76 27.99 (-11.85%) 28.29 (-10.93%)
bw MB/s 311.00 253.80 (-18.39%) 293.20 (- 5.72%)
'Original' is next-20250708 with the culprit commit reverted, 'after'
is unmodified linux-next.
While in the meantime I did change 'this commit' to use ktime rather than
jiffies for time limit calculation, we can still see the huge effects of
the two changes.
Bandwidth and avg IOPS are down between 13^ and 36% with the culprit commit.
Cheers,
Andre'
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2025-07-29 11:20 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-25 14:16 [PATCH v2 0/2] scsi: ufs: core: interrupt handling optimisations André Draszik
2025-07-25 14:16 ` [PATCH v2 1/2] scsi: ufs: core: complete polled requests also from interrupt context André Draszik
2025-07-25 14:16 ` [PATCH v2 2/2] scsi: ufs: core: move some irq handling back to hardirq (with time limit) André Draszik
2025-07-25 21:08 ` Bart Van Assche
2025-07-28 14:43 ` Neil Armstrong
2025-07-28 14:49 ` André Draszik
2025-07-28 15:19 ` Bart Van Assche
2025-07-28 16:55 ` Neil Armstrong
2025-07-29 11:20 ` André Draszik
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).