* [PATCH net 0/6] pds_core: updates and fixes
@ 2025-04-07 22:51 Shannon Nelson
2025-04-07 22:51 ` [PATCH net 1/6] pds_core: Prevent possible adminq overflow/stuck condition Shannon Nelson
` (5 more replies)
0 siblings, 6 replies; 17+ messages in thread
From: Shannon Nelson @ 2025-04-07 22:51 UTC (permalink / raw)
To: andrew+netdev, brett.creeley, davem, edumazet, kuba, pabeni,
michal.swiatkowski, linux-kernel, netdev
Cc: Shannon Nelson
This patchset has fixes for issues seen in recent internal testing
of error conditions and stress handling.
Note that the first patch in this series is a leftover from an
earlier patchset that was abandoned:
Link: https://lore.kernel.org/netdev/20250129004337.36898-2-shannon.nelson@amd.com/
Brett Creeley (3):
pds_core: Prevent possible adminq overflow/stuck condition
pds_core: handle unsupported PDS_CORE_CMD_FW_CONTROL result
pds_core: Remove unnecessary check in pds_client_adminq_cmd()
Shannon Nelson (3):
pds_core: remove extra name description
pds_core: smaller adminq poll starting interval
pds_core: make wait_context part of q_info
drivers/net/ethernet/amd/pds_core/adminq.c | 27 +++++++--------------
drivers/net/ethernet/amd/pds_core/auxbus.c | 3 ---
drivers/net/ethernet/amd/pds_core/core.c | 5 +---
drivers/net/ethernet/amd/pds_core/core.h | 9 +++++--
drivers/net/ethernet/amd/pds_core/devlink.c | 4 +--
include/linux/pds/pds_adminq.h | 3 +--
6 files changed, 19 insertions(+), 32 deletions(-)
--
2.17.1
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH net 1/6] pds_core: Prevent possible adminq overflow/stuck condition
2025-04-07 22:51 [PATCH net 0/6] pds_core: updates and fixes Shannon Nelson
@ 2025-04-07 22:51 ` Shannon Nelson
2025-04-09 9:37 ` Simon Horman
2025-04-07 22:51 ` [PATCH net 2/6] pds_core: remove extra name description Shannon Nelson
` (4 subsequent siblings)
5 siblings, 1 reply; 17+ messages in thread
From: Shannon Nelson @ 2025-04-07 22:51 UTC (permalink / raw)
To: andrew+netdev, brett.creeley, davem, edumazet, kuba, pabeni,
michal.swiatkowski, linux-kernel, netdev
Cc: Shannon Nelson
From: Brett Creeley <brett.creeley@amd.com>
The pds_core's adminq is protected by the adminq_lock, which prevents
more than 1 command to be posted onto it at any one time. This makes it
so the client drivers cannot simultaneously post adminq commands.
However, the completions happen in a different context, which means
multiple adminq commands can be posted sequentially and all waiting
on completion.
On the FW side, the backing adminq request queue is only 16 entries
long and the retry mechanism and/or overflow/stuck prevention is
lacking. This can cause the adminq to get stuck, so commands are no
longer processed and completions are no longer sent by the FW.
As an initial fix, prevent more than 16 outstanding adminq commands so
there's no way to cause the adminq from getting stuck. This works
because the backing adminq request queue will never have more than 16
pending adminq commands, so it will never overflow. This is done by
reducing the adminq depth to 16.
Fixes: 792d36ccc163 ("pds_core: Clean up init/uninit flows to be more readable")
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
---
drivers/net/ethernet/amd/pds_core/core.c | 5 +----
drivers/net/ethernet/amd/pds_core/core.h | 2 +-
2 files changed, 2 insertions(+), 5 deletions(-)
diff --git a/drivers/net/ethernet/amd/pds_core/core.c b/drivers/net/ethernet/amd/pds_core/core.c
index 1eb0d92786f7..55163457f12b 100644
--- a/drivers/net/ethernet/amd/pds_core/core.c
+++ b/drivers/net/ethernet/amd/pds_core/core.c
@@ -325,10 +325,7 @@ static int pdsc_core_init(struct pdsc *pdsc)
size_t sz;
int err;
- /* Scale the descriptor ring length based on number of CPUs and VFs */
- numdescs = max_t(int, PDSC_ADMINQ_MIN_LENGTH, num_online_cpus());
- numdescs += 2 * pci_sriov_get_totalvfs(pdsc->pdev);
- numdescs = roundup_pow_of_two(numdescs);
+ numdescs = PDSC_ADMINQ_MAX_LENGTH;
err = pdsc_qcq_alloc(pdsc, PDS_CORE_QTYPE_ADMINQ, 0, "adminq",
PDS_CORE_QCQ_F_CORE | PDS_CORE_QCQ_F_INTR,
numdescs,
diff --git a/drivers/net/ethernet/amd/pds_core/core.h b/drivers/net/ethernet/amd/pds_core/core.h
index 0bf320c43083..199473112c29 100644
--- a/drivers/net/ethernet/amd/pds_core/core.h
+++ b/drivers/net/ethernet/amd/pds_core/core.h
@@ -16,7 +16,7 @@
#define PDSC_WATCHDOG_SECS 5
#define PDSC_QUEUE_NAME_MAX_SZ 16
-#define PDSC_ADMINQ_MIN_LENGTH 16 /* must be a power of two */
+#define PDSC_ADMINQ_MAX_LENGTH 16 /* must be a power of two */
#define PDSC_NOTIFYQ_LENGTH 64 /* must be a power of two */
#define PDSC_TEARDOWN_RECOVERY false
#define PDSC_TEARDOWN_REMOVING true
--
2.17.1
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH net 2/6] pds_core: remove extra name description
2025-04-07 22:51 [PATCH net 0/6] pds_core: updates and fixes Shannon Nelson
2025-04-07 22:51 ` [PATCH net 1/6] pds_core: Prevent possible adminq overflow/stuck condition Shannon Nelson
@ 2025-04-07 22:51 ` Shannon Nelson
2025-04-09 9:41 ` Simon Horman
2025-04-07 22:51 ` [PATCH net 3/6] pds_core: handle unsupported PDS_CORE_CMD_FW_CONTROL result Shannon Nelson
` (3 subsequent siblings)
5 siblings, 1 reply; 17+ messages in thread
From: Shannon Nelson @ 2025-04-07 22:51 UTC (permalink / raw)
To: andrew+netdev, brett.creeley, davem, edumazet, kuba, pabeni,
michal.swiatkowski, linux-kernel, netdev
Cc: Shannon Nelson
Fix the kernel-doc complaint
include/linux/pds/pds_adminq.h:481: warning: Excess struct member 'name' description in 'pds_core_lif_getattr_comp'
Fixes: 45d76f492938 ("pds_core: set up device and adminq")
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
---
include/linux/pds/pds_adminq.h | 1 -
1 file changed, 1 deletion(-)
diff --git a/include/linux/pds/pds_adminq.h b/include/linux/pds/pds_adminq.h
index ddd111f04ca0..339156113fa5 100644
--- a/include/linux/pds/pds_adminq.h
+++ b/include/linux/pds/pds_adminq.h
@@ -463,7 +463,6 @@ struct pds_core_lif_getattr_cmd {
* @rsvd: Word boundary padding
* @comp_index: Index in the descriptor ring for which this is the completion
* @state: LIF state (enum pds_core_lif_state)
- * @name: LIF name string, 0 terminated
* @features: Features (enum pds_core_hw_features)
* @rsvd2: Word boundary padding
* @color: Color bit
--
2.17.1
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH net 3/6] pds_core: handle unsupported PDS_CORE_CMD_FW_CONTROL result
2025-04-07 22:51 [PATCH net 0/6] pds_core: updates and fixes Shannon Nelson
2025-04-07 22:51 ` [PATCH net 1/6] pds_core: Prevent possible adminq overflow/stuck condition Shannon Nelson
2025-04-07 22:51 ` [PATCH net 2/6] pds_core: remove extra name description Shannon Nelson
@ 2025-04-07 22:51 ` Shannon Nelson
2025-04-09 16:34 ` Simon Horman
2025-04-07 22:51 ` [PATCH net 4/6] pds_core: Remove unnecessary check in pds_client_adminq_cmd() Shannon Nelson
` (2 subsequent siblings)
5 siblings, 1 reply; 17+ messages in thread
From: Shannon Nelson @ 2025-04-07 22:51 UTC (permalink / raw)
To: andrew+netdev, brett.creeley, davem, edumazet, kuba, pabeni,
michal.swiatkowski, linux-kernel, netdev
Cc: Shannon Nelson
From: Brett Creeley <brett.creeley@amd.com>
If the FW doesn't support the PDS_CORE_CMD_FW_CONTROL command
the driver might at the least print garbage and at the worst
crash when the user runs the "devlink dev info" devlink command.
This happens because the stack variable fw_list is not 0
initialized which results in fw_list.num_fw_slots being a
garbage value from the stack. Then the driver tries to access
fw_list.fw_names[i] with i >= ARRAY_SIZE and runs off the end
of the array.
Fix this by initializing the fw_list and adding an ARRAY_SIZE
limiter to the loop, and by not failing completely if the
devcmd fails because other useful information is printed via
devlink dev info even if the devcmd fails.
Fixes: 45d76f492938 ("pds_core: set up device and adminq")
Signed-off-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
---
drivers/net/ethernet/amd/pds_core/devlink.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/drivers/net/ethernet/amd/pds_core/devlink.c b/drivers/net/ethernet/amd/pds_core/devlink.c
index c5c787df61a4..d8dc39da4161 100644
--- a/drivers/net/ethernet/amd/pds_core/devlink.c
+++ b/drivers/net/ethernet/amd/pds_core/devlink.c
@@ -105,7 +105,7 @@ int pdsc_dl_info_get(struct devlink *dl, struct devlink_info_req *req,
.fw_control.opcode = PDS_CORE_CMD_FW_CONTROL,
.fw_control.oper = PDS_CORE_FW_GET_LIST,
};
- struct pds_core_fw_list_info fw_list;
+ struct pds_core_fw_list_info fw_list = {};
struct pdsc *pdsc = devlink_priv(dl);
union pds_core_dev_comp comp;
char buf[32];
@@ -118,8 +118,6 @@ int pdsc_dl_info_get(struct devlink *dl, struct devlink_info_req *req,
if (!err)
memcpy_fromio(&fw_list, pdsc->cmd_regs->data, sizeof(fw_list));
mutex_unlock(&pdsc->devcmd_lock);
- if (err && err != -EIO)
- return err;
listlen = min(fw_list.num_fw_slots, ARRAY_SIZE(fw_list.fw_names));
for (i = 0; i < listlen; i++) {
--
2.17.1
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH net 4/6] pds_core: Remove unnecessary check in pds_client_adminq_cmd()
2025-04-07 22:51 [PATCH net 0/6] pds_core: updates and fixes Shannon Nelson
` (2 preceding siblings ...)
2025-04-07 22:51 ` [PATCH net 3/6] pds_core: handle unsupported PDS_CORE_CMD_FW_CONTROL result Shannon Nelson
@ 2025-04-07 22:51 ` Shannon Nelson
2025-04-09 17:07 ` Simon Horman
2025-04-07 22:51 ` [PATCH net 5/6] pds_core: smaller adminq poll starting interval Shannon Nelson
2025-04-07 22:51 ` [PATCH net 6/6] pds_core: make wait_context part of q_info Shannon Nelson
5 siblings, 1 reply; 17+ messages in thread
From: Shannon Nelson @ 2025-04-07 22:51 UTC (permalink / raw)
To: andrew+netdev, brett.creeley, davem, edumazet, kuba, pabeni,
michal.swiatkowski, linux-kernel, netdev
Cc: Shannon Nelson
From: Brett Creeley <brett.creeley@amd.com>
When the pds_core driver was first created there were some race
conditions around using the adminq, especially for client drivers.
To reduce the possibility of a race condition there's a check
against pf->state in pds_client_adminq_cmd(). This is problematic
for a couple of reasons:
1. The PDSC_S_INITING_DRIVER bit is set during probe, but not
cleared until after everything in probe is complete, which
includes creating the auxiliary devices. For pds_fwctl this
means it can't make any adminq commands until after pds_core's
probe is complete even though the adminq is fully up by the
time pds_fwctl's auxiliary device is created.
2. The race conditions around using the adminq have been fixed
and this path is already protected against client drivers
calling pds_client_adminq_cmd() if the adminq isn't ready,
i.e. see pdsc_adminq_post() -> pdsc_adminq_inc_if_up().
Fix this by removing the pf->state check in pds_client_adminq_cmd()
because invalid accesses to pds_core's adminq is already handled by
pdsc_adminq_post()->pdsc_adminq_inc_if_up().
Fixes: 10659034c622 ("pds_core: add the aux client API")
Signed-off-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
---
drivers/net/ethernet/amd/pds_core/auxbus.c | 3 ---
1 file changed, 3 deletions(-)
diff --git a/drivers/net/ethernet/amd/pds_core/auxbus.c b/drivers/net/ethernet/amd/pds_core/auxbus.c
index eeb72b1809ea..c9aac27883a3 100644
--- a/drivers/net/ethernet/amd/pds_core/auxbus.c
+++ b/drivers/net/ethernet/amd/pds_core/auxbus.c
@@ -107,9 +107,6 @@ int pds_client_adminq_cmd(struct pds_auxiliary_dev *padev,
dev_dbg(pf->dev, "%s: %s opcode %d\n",
__func__, dev_name(&padev->aux_dev.dev), req->opcode);
- if (pf->state)
- return -ENXIO;
-
/* Wrap the client's request */
cmd.client_request.opcode = PDS_AQ_CMD_CLIENT_CMD;
cmd.client_request.client_id = cpu_to_le16(padev->client_id);
--
2.17.1
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH net 5/6] pds_core: smaller adminq poll starting interval
2025-04-07 22:51 [PATCH net 0/6] pds_core: updates and fixes Shannon Nelson
` (3 preceding siblings ...)
2025-04-07 22:51 ` [PATCH net 4/6] pds_core: Remove unnecessary check in pds_client_adminq_cmd() Shannon Nelson
@ 2025-04-07 22:51 ` Shannon Nelson
2025-04-09 16:50 ` Simon Horman
2025-04-07 22:51 ` [PATCH net 6/6] pds_core: make wait_context part of q_info Shannon Nelson
5 siblings, 1 reply; 17+ messages in thread
From: Shannon Nelson @ 2025-04-07 22:51 UTC (permalink / raw)
To: andrew+netdev, brett.creeley, davem, edumazet, kuba, pabeni,
michal.swiatkowski, linux-kernel, netdev
Cc: Shannon Nelson
Shorten the adminq poll starting interval in order to speed
up the transaction response time.
Fixes: 01ba61b55b20 ("pds_core: Add adminq processing and commands")
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
---
drivers/net/ethernet/amd/pds_core/adminq.c | 4 ++--
include/linux/pds/pds_adminq.h | 2 +-
2 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/net/ethernet/amd/pds_core/adminq.c b/drivers/net/ethernet/amd/pds_core/adminq.c
index c83a0a80d533..2e840112efea 100644
--- a/drivers/net/ethernet/amd/pds_core/adminq.c
+++ b/drivers/net/ethernet/amd/pds_core/adminq.c
@@ -235,7 +235,7 @@ int pdsc_adminq_post(struct pdsc *pdsc,
.wait_completion =
COMPLETION_INITIALIZER_ONSTACK(wc.wait_completion),
};
- unsigned long poll_interval = 1;
+ unsigned long poll_interval = 200;
unsigned long poll_jiffies;
unsigned long time_limit;
unsigned long time_start;
@@ -261,7 +261,7 @@ int pdsc_adminq_post(struct pdsc *pdsc,
time_limit = time_start + HZ * pdsc->devcmd_timeout;
do {
/* Timeslice the actual wait to catch IO errors etc early */
- poll_jiffies = msecs_to_jiffies(poll_interval);
+ poll_jiffies = usecs_to_jiffies(poll_interval);
remaining = wait_for_completion_timeout(&wc.wait_completion,
poll_jiffies);
if (remaining)
diff --git a/include/linux/pds/pds_adminq.h b/include/linux/pds/pds_adminq.h
index 339156113fa5..40ff0ec2b879 100644
--- a/include/linux/pds/pds_adminq.h
+++ b/include/linux/pds/pds_adminq.h
@@ -4,7 +4,7 @@
#ifndef _PDS_CORE_ADMINQ_H_
#define _PDS_CORE_ADMINQ_H_
-#define PDSC_ADMINQ_MAX_POLL_INTERVAL 256
+#define PDSC_ADMINQ_MAX_POLL_INTERVAL 256000 /* usecs */
enum pds_core_adminq_flags {
PDS_AQ_FLAG_FASTPOLL = BIT(1), /* completion poll at 1ms */
--
2.17.1
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH net 6/6] pds_core: make wait_context part of q_info
2025-04-07 22:51 [PATCH net 0/6] pds_core: updates and fixes Shannon Nelson
` (4 preceding siblings ...)
2025-04-07 22:51 ` [PATCH net 5/6] pds_core: smaller adminq poll starting interval Shannon Nelson
@ 2025-04-07 22:51 ` Shannon Nelson
2025-04-09 17:06 ` Simon Horman
5 siblings, 1 reply; 17+ messages in thread
From: Shannon Nelson @ 2025-04-07 22:51 UTC (permalink / raw)
To: andrew+netdev, brett.creeley, davem, edumazet, kuba, pabeni,
michal.swiatkowski, linux-kernel, netdev
Cc: Shannon Nelson
Make the wait_context a full part of the q_info struct rather
than a stack variable that goes away after pdsc_adminq_post()
is done so that the context is still available after the wait
loop has given up.
There was a case where a slow development firmware caused
the adminq request to time out, but then later the FW finally
finished the request and sent the interrupt. The handler tried
to complete_all() the completion context that had been created
on the stack in pdsc_adminq_post() but no longer existed.
This caused bad pointer usage, kernel crashes, and much wailing
and gnashing of teeth.
Fixes: 01ba61b55b20 ("pds_core: Add adminq processing and commands")
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
---
drivers/net/ethernet/amd/pds_core/adminq.c | 23 +++++++---------------
drivers/net/ethernet/amd/pds_core/core.h | 7 ++++++-
2 files changed, 13 insertions(+), 17 deletions(-)
diff --git a/drivers/net/ethernet/amd/pds_core/adminq.c b/drivers/net/ethernet/amd/pds_core/adminq.c
index 2e840112efea..86a6371e5821 100644
--- a/drivers/net/ethernet/amd/pds_core/adminq.c
+++ b/drivers/net/ethernet/amd/pds_core/adminq.c
@@ -5,11 +5,6 @@
#include "core.h"
-struct pdsc_wait_context {
- struct pdsc_qcq *qcq;
- struct completion wait_completion;
-};
-
static int pdsc_process_notifyq(struct pdsc_qcq *qcq)
{
union pds_core_notifyq_comp *comp;
@@ -112,7 +107,7 @@ void pdsc_process_adminq(struct pdsc_qcq *qcq)
/* Copy out the completion data */
memcpy(q_info->dest, comp, sizeof(*comp));
- complete_all(&q_info->wc->wait_completion);
+ complete_all(&q_info->wc.wait_completion);
if (cq->tail_idx == cq->num_descs - 1)
cq->done_color = !cq->done_color;
@@ -162,8 +157,7 @@ irqreturn_t pdsc_adminq_isr(int irq, void *data)
static int __pdsc_adminq_post(struct pdsc *pdsc,
struct pdsc_qcq *qcq,
union pds_core_adminq_cmd *cmd,
- union pds_core_adminq_comp *comp,
- struct pdsc_wait_context *wc)
+ union pds_core_adminq_comp *comp)
{
struct pdsc_queue *q = &qcq->q;
struct pdsc_q_info *q_info;
@@ -205,7 +199,6 @@ static int __pdsc_adminq_post(struct pdsc *pdsc,
/* Post the request */
index = q->head_idx;
q_info = &q->info[index];
- q_info->wc = wc;
q_info->dest = comp;
memcpy(q_info->desc, cmd, sizeof(*cmd));
@@ -231,11 +224,8 @@ int pdsc_adminq_post(struct pdsc *pdsc,
union pds_core_adminq_comp *comp,
bool fast_poll)
{
- struct pdsc_wait_context wc = {
- .wait_completion =
- COMPLETION_INITIALIZER_ONSTACK(wc.wait_completion),
- };
unsigned long poll_interval = 200;
+ struct pdsc_wait_context *wc;
unsigned long poll_jiffies;
unsigned long time_limit;
unsigned long time_start;
@@ -250,19 +240,20 @@ int pdsc_adminq_post(struct pdsc *pdsc,
return -ENXIO;
}
- wc.qcq = &pdsc->adminqcq;
- index = __pdsc_adminq_post(pdsc, &pdsc->adminqcq, cmd, comp, &wc);
+ index = __pdsc_adminq_post(pdsc, &pdsc->adminqcq, cmd, comp);
if (index < 0) {
err = index;
goto err_out;
}
+ wc = &pdsc->adminqcq.q.info[index].wc;
+ wc->wait_completion = COMPLETION_INITIALIZER_ONSTACK(wc->wait_completion);
time_start = jiffies;
time_limit = time_start + HZ * pdsc->devcmd_timeout;
do {
/* Timeslice the actual wait to catch IO errors etc early */
poll_jiffies = usecs_to_jiffies(poll_interval);
- remaining = wait_for_completion_timeout(&wc.wait_completion,
+ remaining = wait_for_completion_timeout(&wc->wait_completion,
poll_jiffies);
if (remaining)
break;
diff --git a/drivers/net/ethernet/amd/pds_core/core.h b/drivers/net/ethernet/amd/pds_core/core.h
index 199473112c29..84fd814d7904 100644
--- a/drivers/net/ethernet/amd/pds_core/core.h
+++ b/drivers/net/ethernet/amd/pds_core/core.h
@@ -88,6 +88,11 @@ struct pdsc_buf_info {
u32 len;
};
+struct pdsc_wait_context {
+ struct pdsc_qcq *qcq;
+ struct completion wait_completion;
+};
+
struct pdsc_q_info {
union {
void *desc;
@@ -96,7 +101,7 @@ struct pdsc_q_info {
unsigned int bytes;
unsigned int nbufs;
struct pdsc_buf_info bufs[PDS_CORE_MAX_FRAGS];
- struct pdsc_wait_context *wc;
+ struct pdsc_wait_context wc;
void *dest;
};
--
2.17.1
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH net 1/6] pds_core: Prevent possible adminq overflow/stuck condition
2025-04-07 22:51 ` [PATCH net 1/6] pds_core: Prevent possible adminq overflow/stuck condition Shannon Nelson
@ 2025-04-09 9:37 ` Simon Horman
2025-04-09 23:32 ` Nelson, Shannon
0 siblings, 1 reply; 17+ messages in thread
From: Simon Horman @ 2025-04-09 9:37 UTC (permalink / raw)
To: Shannon Nelson
Cc: andrew+netdev, brett.creeley, davem, edumazet, kuba, pabeni,
michal.swiatkowski, linux-kernel, netdev
On Mon, Apr 07, 2025 at 03:51:08PM -0700, Shannon Nelson wrote:
> From: Brett Creeley <brett.creeley@amd.com>
>
> The pds_core's adminq is protected by the adminq_lock, which prevents
> more than 1 command to be posted onto it at any one time. This makes it
> so the client drivers cannot simultaneously post adminq commands.
> However, the completions happen in a different context, which means
> multiple adminq commands can be posted sequentially and all waiting
> on completion.
>
> On the FW side, the backing adminq request queue is only 16 entries
> long and the retry mechanism and/or overflow/stuck prevention is
> lacking. This can cause the adminq to get stuck, so commands are no
> longer processed and completions are no longer sent by the FW.
>
> As an initial fix, prevent more than 16 outstanding adminq commands so
> there's no way to cause the adminq from getting stuck. This works
> because the backing adminq request queue will never have more than 16
> pending adminq commands, so it will never overflow. This is done by
> reducing the adminq depth to 16.
>
> Fixes: 792d36ccc163 ("pds_core: Clean up init/uninit flows to be more readable")
Hi Brett and Shannon,
I see that the cited commit added the lines that are being updated
to pdsc_core_init(). But it seems to me that it did so by moving
them from pdsc_setup(). So I wonder if it is actually the commit
that added the code to pdsc_setup() that is being fixed.
If so, perhaps:
Fixes: 45d76f492938 ("pds_core: set up device and adminq")
> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
> Signed-off-by: Brett Creeley <brett.creeley@amd.com>
> Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
...
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH net 2/6] pds_core: remove extra name description
2025-04-07 22:51 ` [PATCH net 2/6] pds_core: remove extra name description Shannon Nelson
@ 2025-04-09 9:41 ` Simon Horman
0 siblings, 0 replies; 17+ messages in thread
From: Simon Horman @ 2025-04-09 9:41 UTC (permalink / raw)
To: Shannon Nelson
Cc: andrew+netdev, brett.creeley, davem, edumazet, kuba, pabeni,
michal.swiatkowski, linux-kernel, netdev
On Mon, Apr 07, 2025 at 03:51:09PM -0700, Shannon Nelson wrote:
> Fix the kernel-doc complaint
> include/linux/pds/pds_adminq.h:481: warning: Excess struct member 'name' description in 'pds_core_lif_getattr_comp'
>
> Fixes: 45d76f492938 ("pds_core: set up device and adminq")
> Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
FWIIW, I'm of two minds about this a) having a fixes tag and b) being for
net. But I agree that it is a good change. And not worth nit-pick energy.
Reviewed-by: Simon Horman <horms@kernel.org>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH net 3/6] pds_core: handle unsupported PDS_CORE_CMD_FW_CONTROL result
2025-04-07 22:51 ` [PATCH net 3/6] pds_core: handle unsupported PDS_CORE_CMD_FW_CONTROL result Shannon Nelson
@ 2025-04-09 16:34 ` Simon Horman
2025-04-09 23:35 ` Nelson, Shannon
0 siblings, 1 reply; 17+ messages in thread
From: Simon Horman @ 2025-04-09 16:34 UTC (permalink / raw)
To: Shannon Nelson
Cc: andrew+netdev, brett.creeley, davem, edumazet, kuba, pabeni,
michal.swiatkowski, linux-kernel, netdev
On Mon, Apr 07, 2025 at 03:51:10PM -0700, Shannon Nelson wrote:
> From: Brett Creeley <brett.creeley@amd.com>
>
> If the FW doesn't support the PDS_CORE_CMD_FW_CONTROL command
> the driver might at the least print garbage and at the worst
> crash when the user runs the "devlink dev info" devlink command.
>
> This happens because the stack variable fw_list is not 0
> initialized which results in fw_list.num_fw_slots being a
> garbage value from the stack. Then the driver tries to access
> fw_list.fw_names[i] with i >= ARRAY_SIZE and runs off the end
> of the array.
>
> Fix this by initializing the fw_list and adding an ARRAY_SIZE
> limiter to the loop, and by not failing completely if the
> devcmd fails because other useful information is printed via
> devlink dev info even if the devcmd fails.
Hi Brett, and Shannon,
It looks like the ARRAY_SIZE limiter on the loop exists since
commit 8c817eb26230 ("pds_core: limit loop over fw name list").
And, if so, I think the patch description should be reworked a bit.
>
> Fixes: 45d76f492938 ("pds_core: set up device and adminq")
> Signed-off-by: Brett Creeley <brett.creeley@amd.com>
> Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
...
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH net 5/6] pds_core: smaller adminq poll starting interval
2025-04-07 22:51 ` [PATCH net 5/6] pds_core: smaller adminq poll starting interval Shannon Nelson
@ 2025-04-09 16:50 ` Simon Horman
2025-04-09 23:28 ` Nelson, Shannon
0 siblings, 1 reply; 17+ messages in thread
From: Simon Horman @ 2025-04-09 16:50 UTC (permalink / raw)
To: Shannon Nelson
Cc: andrew+netdev, brett.creeley, davem, edumazet, kuba, pabeni,
michal.swiatkowski, linux-kernel, netdev
On Mon, Apr 07, 2025 at 03:51:12PM -0700, Shannon Nelson wrote:
> Shorten the adminq poll starting interval in order to speed
> up the transaction response time.
Hi Shannon,
I think this warrants some further explanation as to why this is a bug fix.
>
> Fixes: 01ba61b55b20 ("pds_core: Add adminq processing and commands")
> Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
...
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH net 6/6] pds_core: make wait_context part of q_info
2025-04-07 22:51 ` [PATCH net 6/6] pds_core: make wait_context part of q_info Shannon Nelson
@ 2025-04-09 17:06 ` Simon Horman
0 siblings, 0 replies; 17+ messages in thread
From: Simon Horman @ 2025-04-09 17:06 UTC (permalink / raw)
To: Shannon Nelson
Cc: andrew+netdev, brett.creeley, davem, edumazet, kuba, pabeni,
michal.swiatkowski, linux-kernel, netdev
On Mon, Apr 07, 2025 at 03:51:13PM -0700, Shannon Nelson wrote:
> Make the wait_context a full part of the q_info struct rather
> than a stack variable that goes away after pdsc_adminq_post()
> is done so that the context is still available after the wait
> loop has given up.
>
> There was a case where a slow development firmware caused
> the adminq request to time out, but then later the FW finally
> finished the request and sent the interrupt. The handler tried
> to complete_all() the completion context that had been created
> on the stack in pdsc_adminq_post() but no longer existed.
> This caused bad pointer usage, kernel crashes, and much wailing
> and gnashing of teeth.
>
> Fixes: 01ba61b55b20 ("pds_core: Add adminq processing and commands")
> Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Reviewed-by: Simon Horman <horms@kernel.org>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH net 4/6] pds_core: Remove unnecessary check in pds_client_adminq_cmd()
2025-04-07 22:51 ` [PATCH net 4/6] pds_core: Remove unnecessary check in pds_client_adminq_cmd() Shannon Nelson
@ 2025-04-09 17:07 ` Simon Horman
0 siblings, 0 replies; 17+ messages in thread
From: Simon Horman @ 2025-04-09 17:07 UTC (permalink / raw)
To: Shannon Nelson
Cc: andrew+netdev, brett.creeley, davem, edumazet, kuba, pabeni,
michal.swiatkowski, linux-kernel, netdev
On Mon, Apr 07, 2025 at 03:51:11PM -0700, Shannon Nelson wrote:
> From: Brett Creeley <brett.creeley@amd.com>
>
> When the pds_core driver was first created there were some race
> conditions around using the adminq, especially for client drivers.
> To reduce the possibility of a race condition there's a check
> against pf->state in pds_client_adminq_cmd(). This is problematic
> for a couple of reasons:
>
> 1. The PDSC_S_INITING_DRIVER bit is set during probe, but not
> cleared until after everything in probe is complete, which
> includes creating the auxiliary devices. For pds_fwctl this
> means it can't make any adminq commands until after pds_core's
> probe is complete even though the adminq is fully up by the
> time pds_fwctl's auxiliary device is created.
>
> 2. The race conditions around using the adminq have been fixed
> and this path is already protected against client drivers
> calling pds_client_adminq_cmd() if the adminq isn't ready,
> i.e. see pdsc_adminq_post() -> pdsc_adminq_inc_if_up().
>
> Fix this by removing the pf->state check in pds_client_adminq_cmd()
> because invalid accesses to pds_core's adminq is already handled by
> pdsc_adminq_post()->pdsc_adminq_inc_if_up().
>
> Fixes: 10659034c622 ("pds_core: add the aux client API")
I'm assuming that backporting this patch that far only
makes sense if other fixes have been backported too.
And that their fixes tags should enable that happening.
If so, this seems fine to me.
> Signed-off-by: Brett Creeley <brett.creeley@amd.com>
> Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Reviewed-by: Simon Horman <horms@kernel.org>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH net 5/6] pds_core: smaller adminq poll starting interval
2025-04-09 16:50 ` Simon Horman
@ 2025-04-09 23:28 ` Nelson, Shannon
0 siblings, 0 replies; 17+ messages in thread
From: Nelson, Shannon @ 2025-04-09 23:28 UTC (permalink / raw)
To: Simon Horman
Cc: andrew+netdev, brett.creeley, davem, edumazet, kuba, pabeni,
michal.swiatkowski, linux-kernel, netdev
On 4/9/2025 9:50 AM, Simon Horman wrote:
>
> On Mon, Apr 07, 2025 at 03:51:12PM -0700, Shannon Nelson wrote:
>> Shorten the adminq poll starting interval in order to speed
>> up the transaction response time.
>
> Hi Shannon,
>
> I think this warrants some further explanation as to why this is a bug fix.
I suppose this does look more like an error handling performance
enhancement rather than a bug fix. I can pull this out and re-submit
for net-next.
Thanks,
sln
>
>>
>> Fixes: 01ba61b55b20 ("pds_core: Add adminq processing and commands")
>> Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
>
> ...
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH net 1/6] pds_core: Prevent possible adminq overflow/stuck condition
2025-04-09 9:37 ` Simon Horman
@ 2025-04-09 23:32 ` Nelson, Shannon
2025-04-11 18:59 ` Simon Horman
0 siblings, 1 reply; 17+ messages in thread
From: Nelson, Shannon @ 2025-04-09 23:32 UTC (permalink / raw)
To: Simon Horman
Cc: andrew+netdev, brett.creeley, davem, edumazet, kuba, pabeni,
michal.swiatkowski, linux-kernel, netdev
On 4/9/2025 2:37 AM, Simon Horman wrote:
>
> On Mon, Apr 07, 2025 at 03:51:08PM -0700, Shannon Nelson wrote:
>> From: Brett Creeley <brett.creeley@amd.com>
>>
>> The pds_core's adminq is protected by the adminq_lock, which prevents
>> more than 1 command to be posted onto it at any one time. This makes it
>> so the client drivers cannot simultaneously post adminq commands.
>> However, the completions happen in a different context, which means
>> multiple adminq commands can be posted sequentially and all waiting
>> on completion.
>>
>> On the FW side, the backing adminq request queue is only 16 entries
>> long and the retry mechanism and/or overflow/stuck prevention is
>> lacking. This can cause the adminq to get stuck, so commands are no
>> longer processed and completions are no longer sent by the FW.
>>
>> As an initial fix, prevent more than 16 outstanding adminq commands so
>> there's no way to cause the adminq from getting stuck. This works
>> because the backing adminq request queue will never have more than 16
>> pending adminq commands, so it will never overflow. This is done by
>> reducing the adminq depth to 16.
>>
>> Fixes: 792d36ccc163 ("pds_core: Clean up init/uninit flows to be more readable")
>
> Hi Brett and Shannon,
>
> I see that the cited commit added the lines that are being updated
> to pdsc_core_init(). But it seems to me that it did so by moving
> them from pdsc_setup(). So I wonder if it is actually the commit
> that added the code to pdsc_setup() that is being fixed.
>
> If so, perhaps:
>
> Fixes: 45d76f492938 ("pds_core: set up device and adminq")
Perhaps... is it better to call out the older commit even tho' lines
have moved around and this possibly won't apply?
sln
>
>> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
>> Signed-off-by: Brett Creeley <brett.creeley@amd.com>
>> Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
>
> ...
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH net 3/6] pds_core: handle unsupported PDS_CORE_CMD_FW_CONTROL result
2025-04-09 16:34 ` Simon Horman
@ 2025-04-09 23:35 ` Nelson, Shannon
0 siblings, 0 replies; 17+ messages in thread
From: Nelson, Shannon @ 2025-04-09 23:35 UTC (permalink / raw)
To: Simon Horman
Cc: andrew+netdev, brett.creeley, davem, edumazet, kuba, pabeni,
michal.swiatkowski, linux-kernel, netdev
On 4/9/2025 9:34 AM, Simon Horman wrote:
>
> On Mon, Apr 07, 2025 at 03:51:10PM -0700, Shannon Nelson wrote:
>> From: Brett Creeley <brett.creeley@amd.com>
>>
>> If the FW doesn't support the PDS_CORE_CMD_FW_CONTROL command
>> the driver might at the least print garbage and at the worst
>> crash when the user runs the "devlink dev info" devlink command.
>>
>> This happens because the stack variable fw_list is not 0
>> initialized which results in fw_list.num_fw_slots being a
>> garbage value from the stack. Then the driver tries to access
>> fw_list.fw_names[i] with i >= ARRAY_SIZE and runs off the end
>> of the array.
>>
>> Fix this by initializing the fw_list and adding an ARRAY_SIZE
>> limiter to the loop, and by not failing completely if the
>> devcmd fails because other useful information is printed via
>> devlink dev info even if the devcmd fails.
>
> Hi Brett, and Shannon,
>
> It looks like the ARRAY_SIZE limiter on the loop exists since
> commit 8c817eb26230 ("pds_core: limit loop over fw name list").
> And, if so, I think the patch description should be reworked a bit.
Yes, you're right... that's what I get for pushing patches out while
Brett is on vacation. I'll trim that up.
sln
>
>>
>> Fixes: 45d76f492938 ("pds_core: set up device and adminq")
>> Signed-off-by: Brett Creeley <brett.creeley@amd.com>
>> Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
>
> ...
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH net 1/6] pds_core: Prevent possible adminq overflow/stuck condition
2025-04-09 23:32 ` Nelson, Shannon
@ 2025-04-11 18:59 ` Simon Horman
0 siblings, 0 replies; 17+ messages in thread
From: Simon Horman @ 2025-04-11 18:59 UTC (permalink / raw)
To: Nelson, Shannon
Cc: andrew+netdev, brett.creeley, davem, edumazet, kuba, pabeni,
michal.swiatkowski, linux-kernel, netdev
On Wed, Apr 09, 2025 at 04:32:26PM -0700, Nelson, Shannon wrote:
> On 4/9/2025 2:37 AM, Simon Horman wrote:
> >
> > On Mon, Apr 07, 2025 at 03:51:08PM -0700, Shannon Nelson wrote:
> > > From: Brett Creeley <brett.creeley@amd.com>
> > >
> > > The pds_core's adminq is protected by the adminq_lock, which prevents
> > > more than 1 command to be posted onto it at any one time. This makes it
> > > so the client drivers cannot simultaneously post adminq commands.
> > > However, the completions happen in a different context, which means
> > > multiple adminq commands can be posted sequentially and all waiting
> > > on completion.
> > >
> > > On the FW side, the backing adminq request queue is only 16 entries
> > > long and the retry mechanism and/or overflow/stuck prevention is
> > > lacking. This can cause the adminq to get stuck, so commands are no
> > > longer processed and completions are no longer sent by the FW.
> > >
> > > As an initial fix, prevent more than 16 outstanding adminq commands so
> > > there's no way to cause the adminq from getting stuck. This works
> > > because the backing adminq request queue will never have more than 16
> > > pending adminq commands, so it will never overflow. This is done by
> > > reducing the adminq depth to 16.
> > >
> > > Fixes: 792d36ccc163 ("pds_core: Clean up init/uninit flows to be more readable")
> >
> > Hi Brett and Shannon,
> >
> > I see that the cited commit added the lines that are being updated
> > to pdsc_core_init(). But it seems to me that it did so by moving
> > them from pdsc_setup(). So I wonder if it is actually the commit
> > that added the code to pdsc_setup() that is being fixed.
> >
> > If so, perhaps:
> >
> > Fixes: 45d76f492938 ("pds_core: set up device and adminq")
>
> Perhaps... is it better to call out the older commit even tho' lines have
> moved around and this possibly won't apply?
Hi Shannon,
Sorry for not answering earlier, somehow I missed your email.
I see your point. But I think that it's best to cite the root cause.
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2025-04-11 18:59 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-07 22:51 [PATCH net 0/6] pds_core: updates and fixes Shannon Nelson
2025-04-07 22:51 ` [PATCH net 1/6] pds_core: Prevent possible adminq overflow/stuck condition Shannon Nelson
2025-04-09 9:37 ` Simon Horman
2025-04-09 23:32 ` Nelson, Shannon
2025-04-11 18:59 ` Simon Horman
2025-04-07 22:51 ` [PATCH net 2/6] pds_core: remove extra name description Shannon Nelson
2025-04-09 9:41 ` Simon Horman
2025-04-07 22:51 ` [PATCH net 3/6] pds_core: handle unsupported PDS_CORE_CMD_FW_CONTROL result Shannon Nelson
2025-04-09 16:34 ` Simon Horman
2025-04-09 23:35 ` Nelson, Shannon
2025-04-07 22:51 ` [PATCH net 4/6] pds_core: Remove unnecessary check in pds_client_adminq_cmd() Shannon Nelson
2025-04-09 17:07 ` Simon Horman
2025-04-07 22:51 ` [PATCH net 5/6] pds_core: smaller adminq poll starting interval Shannon Nelson
2025-04-09 16:50 ` Simon Horman
2025-04-09 23:28 ` Nelson, Shannon
2025-04-07 22:51 ` [PATCH net 6/6] pds_core: make wait_context part of q_info Shannon Nelson
2025-04-09 17:06 ` Simon Horman
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).