[PATCH net-next 0/3] mlxsw: Improvements

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net-next 0/3] mlxsw: Improvements
@ 2024-07-01 16:41 Petr Machata
  2024-07-01 16:41 ` [PATCH net-next 1/3] mlxsw: Warn about invalid accesses to array fields Petr Machata
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Petr Machata @ 2024-07-01 16:41 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	netdev
  Cc: Petr Machata, Ido Schimmel, mlxsw

This patchset contains assortments of improvements to the mlxsw driver.
Please see individual patches for details.

Ido Schimmel (2):
  mlxsw: core_thermal: Report valid current state during cooling device
    registration
  mlxsw: pci: Lock configuration space of upstream bridge during reset

Petr Machata (1):
  mlxsw: Warn about invalid accesses to array fields

 .../ethernet/mellanox/mlxsw/core_thermal.c    | 50 ++++++++++---------
 drivers/net/ethernet/mellanox/mlxsw/item.h    |  2 +
 drivers/net/ethernet/mellanox/mlxsw/pci.c     |  6 +++
 3 files changed, 34 insertions(+), 24 deletions(-)

-- 
2.45.0


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH net-next 1/3] mlxsw: Warn about invalid accesses to array fields
  2024-07-01 16:41 [PATCH net-next 0/3] mlxsw: Improvements Petr Machata
@ 2024-07-01 16:41 ` Petr Machata
  2024-07-02  7:08   ` Przemek Kitszel
  2024-07-01 16:41 ` [PATCH net-next 2/3] mlxsw: core_thermal: Report valid current state during cooling device registration Petr Machata
  2024-07-01 16:41 ` [PATCH net-next 3/3] mlxsw: pci: Lock configuration space of upstream bridge during reset Petr Machata
  2 siblings, 1 reply; 13+ messages in thread
From: Petr Machata @ 2024-07-01 16:41 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	netdev
  Cc: Petr Machata, Ido Schimmel, mlxsw

A forgotten or buggy variable initialization can cause out-of-bounds access
to a register or other item array field. For an overflow, such access would
mangle adjacent parts of the register payload. For an underflow, due to all
variables being unsigned, the access would likely trample unrelated memory.
Since neither is correct, replace these accesses with accesses at the index
of 0, and warn about the issue.

Suggested-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlxsw/item.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/item.h b/drivers/net/ethernet/mellanox/mlxsw/item.h
index cfafbeb42586..9f7133735760 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/item.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/item.h
@@ -218,6 +218,8 @@ __mlxsw_item_bit_array_offset(const struct mlxsw_item *item,
 	}
 
 	max_index = (item->size.bytes << 3) / item->element_size - 1;
+	if (WARN_ON(index > max_index))
+		index = 0;
 	be_index = max_index - index;
 	offset = be_index * item->element_size >> 3;
 	in_byte_index  = index % (BITS_PER_BYTE / item->element_size);
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next 1/3] mlxsw: Warn about invalid accesses to array fields
  2024-07-01 16:41 ` [PATCH net-next 1/3] mlxsw: Warn about invalid accesses to array fields Petr Machata
@ 2024-07-02  7:08   ` Przemek Kitszel
  2024-07-03 12:40     ` Ido Schimmel
  0 siblings, 1 reply; 13+ messages in thread
From: Przemek Kitszel @ 2024-07-02  7:08 UTC (permalink / raw)
  To: Petr Machata, Ido Schimmel
  Cc: mlxsw, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	netdev

On 7/1/24 18:41, Petr Machata wrote:
> A forgotten or buggy variable initialization can cause out-of-bounds access
> to a register or other item array field. For an overflow, such access would
> mangle adjacent parts of the register payload. For an underflow, due to all
> variables being unsigned, the access would likely trample unrelated memory.
> Since neither is correct, replace these accesses with accesses at the index
> of 0, and warn about the issue.

That is not correct either, but indeed better.

> 
> Suggested-by: Ido Schimmel <idosch@nvidia.com>
> Signed-off-by: Petr Machata <petrm@nvidia.com>
> Reviewed-by: Ido Schimmel <idosch@nvidia.com>
> ---
>   drivers/net/ethernet/mellanox/mlxsw/item.h | 2 ++
>   1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlxsw/item.h b/drivers/net/ethernet/mellanox/mlxsw/item.h
> index cfafbeb42586..9f7133735760 100644
> --- a/drivers/net/ethernet/mellanox/mlxsw/item.h
> +++ b/drivers/net/ethernet/mellanox/mlxsw/item.h
> @@ -218,6 +218,8 @@ __mlxsw_item_bit_array_offset(const struct mlxsw_item *item,
>   	}
>   
>   	max_index = (item->size.bytes << 3) / item->element_size - 1;
> +	if (WARN_ON(index > max_index))
> +		index = 0;

you have BUG*() calls just above those lines :(
anyway, WARN_ON_ONCE(), and perhaps you need to print some additional
data to finally fix this?

>   	be_index = max_index - index;
>   	offset = be_index * item->element_size >> 3;
>   	in_byte_index  = index % (BITS_PER_BYTE / item->element_size);


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next 1/3] mlxsw: Warn about invalid accesses to array fields
  2024-07-02  7:08   ` Przemek Kitszel
@ 2024-07-03 12:40     ` Ido Schimmel
  2024-07-08  9:45       ` Petr Machata
  0 siblings, 1 reply; 13+ messages in thread
From: Ido Schimmel @ 2024-07-03 12:40 UTC (permalink / raw)
  To: Przemek Kitszel
  Cc: Petr Machata, mlxsw, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, netdev

On Tue, Jul 02, 2024 at 09:08:17AM +0200, Przemek Kitszel wrote:
> On 7/1/24 18:41, Petr Machata wrote:
> > A forgotten or buggy variable initialization can cause out-of-bounds access
> > to a register or other item array field. For an overflow, such access would
> > mangle adjacent parts of the register payload. For an underflow, due to all
> > variables being unsigned, the access would likely trample unrelated memory.
> > Since neither is correct, replace these accesses with accesses at the index
> > of 0, and warn about the issue.
> 
> That is not correct either, but indeed better.
> 
> > 
> > Suggested-by: Ido Schimmel <idosch@nvidia.com>
> > Signed-off-by: Petr Machata <petrm@nvidia.com>
> > Reviewed-by: Ido Schimmel <idosch@nvidia.com>
> > ---
> >   drivers/net/ethernet/mellanox/mlxsw/item.h | 2 ++
> >   1 file changed, 2 insertions(+)
> > 
> > diff --git a/drivers/net/ethernet/mellanox/mlxsw/item.h b/drivers/net/ethernet/mellanox/mlxsw/item.h
> > index cfafbeb42586..9f7133735760 100644
> > --- a/drivers/net/ethernet/mellanox/mlxsw/item.h
> > +++ b/drivers/net/ethernet/mellanox/mlxsw/item.h
> > @@ -218,6 +218,8 @@ __mlxsw_item_bit_array_offset(const struct mlxsw_item *item,
> >   	}
> >   	max_index = (item->size.bytes << 3) / item->element_size - 1;
> > +	if (WARN_ON(index > max_index))
> > +		index = 0;
> 
> you have BUG*() calls just above those lines :(
> anyway, WARN_ON_ONCE(), and perhaps you need to print some additional
> data to finally fix this?

The trace should be enough, but more info can be added:

diff --git a/drivers/net/ethernet/mellanox/mlxsw/item.h b/drivers/net/ethernet/mellanox/mlxsw/item.h
index 9f7133735760..a619a0736bd1 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/item.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/item.h
@@ -218,7 +218,9 @@ __mlxsw_item_bit_array_offset(const struct mlxsw_item *item,
        }
 
        max_index = (item->size.bytes << 3) / item->element_size - 1;
-       if (WARN_ON(index > max_index))
+       if (WARN_ONCE(index > max_index,
+                     "name=%s,index=%u,max_index=%u\n", item->name, index,
+                     max_index))
                index = 0;
        be_index = max_index - index;
        offset = be_index * item->element_size >> 3;

Will leave it to Petr to decide what he wants to include there.

> 
> >   	be_index = max_index - index;
> >   	offset = be_index * item->element_size >> 3;
> >   	in_byte_index  = index % (BITS_PER_BYTE / item->element_size);
> 

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next 1/3] mlxsw: Warn about invalid accesses to array fields
  2024-07-03 12:40     ` Ido Schimmel
@ 2024-07-08  9:45       ` Petr Machata
  0 siblings, 0 replies; 13+ messages in thread
From: Petr Machata @ 2024-07-08  9:45 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: Przemek Kitszel, Petr Machata, mlxsw, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, netdev


Ido Schimmel <idosch@nvidia.com> writes:

> On Tue, Jul 02, 2024 at 09:08:17AM +0200, Przemek Kitszel wrote:
>> On 7/1/24 18:41, Petr Machata wrote:
>> > 
>> > Suggested-by: Ido Schimmel <idosch@nvidia.com>
>> > Signed-off-by: Petr Machata <petrm@nvidia.com>
>> > Reviewed-by: Ido Schimmel <idosch@nvidia.com>
>> > ---
>> >   drivers/net/ethernet/mellanox/mlxsw/item.h | 2 ++
>> >   1 file changed, 2 insertions(+)
>> > 
>> > diff --git a/drivers/net/ethernet/mellanox/mlxsw/item.h b/drivers/net/ethernet/mellanox/mlxsw/item.h
>> > index cfafbeb42586..9f7133735760 100644
>> > --- a/drivers/net/ethernet/mellanox/mlxsw/item.h
>> > +++ b/drivers/net/ethernet/mellanox/mlxsw/item.h
>> > @@ -218,6 +218,8 @@ __mlxsw_item_bit_array_offset(const struct mlxsw_item *item,
>> >   	}
>> >   	max_index = (item->size.bytes << 3) / item->element_size - 1;
>> > +	if (WARN_ON(index > max_index))
>> > +		index = 0;
>> 
>> you have BUG*() calls just above those lines :(
>> anyway, WARN_ON_ONCE(), and perhaps you need to print some additional
>> data to finally fix this?
>
> The trace should be enough, but more info can be added:
>
> diff --git a/drivers/net/ethernet/mellanox/mlxsw/item.h b/drivers/net/ethernet/mellanox/mlxsw/item.h
> index 9f7133735760..a619a0736bd1 100644
> --- a/drivers/net/ethernet/mellanox/mlxsw/item.h
> +++ b/drivers/net/ethernet/mellanox/mlxsw/item.h
> @@ -218,7 +218,9 @@ __mlxsw_item_bit_array_offset(const struct mlxsw_item *item,
>         }
>  
>         max_index = (item->size.bytes << 3) / item->element_size - 1;
> -       if (WARN_ON(index > max_index))
> +       if (WARN_ONCE(index > max_index,
> +                     "name=%s,index=%u,max_index=%u\n", item->name, index,
> +                     max_index))
>                 index = 0;
>         be_index = max_index - index;
>         offset = be_index * item->element_size >> 3;
>
> Will leave it to Petr to decide what he wants to include there.

Thanks, I'll send a v2.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH net-next 2/3] mlxsw: core_thermal: Report valid current state during cooling device registration
  2024-07-01 16:41 [PATCH net-next 0/3] mlxsw: Improvements Petr Machata
  2024-07-01 16:41 ` [PATCH net-next 1/3] mlxsw: Warn about invalid accesses to array fields Petr Machata
@ 2024-07-01 16:41 ` Petr Machata
  2024-07-02  7:27   ` Przemek Kitszel
  2024-07-09 16:06   ` Rafael J. Wysocki
  2024-07-01 16:41 ` [PATCH net-next 3/3] mlxsw: pci: Lock configuration space of upstream bridge during reset Petr Machata
  2 siblings, 2 replies; 13+ messages in thread
From: Petr Machata @ 2024-07-01 16:41 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	netdev
  Cc: Petr Machata, Ido Schimmel, mlxsw, linux-pm, Vadim Pasternak

From: Ido Schimmel <idosch@nvidia.com>

Commit 31a0fa0019b0 ("thermal/debugfs: Pass cooling device state to
thermal_debug_cdev_add()") changed the thermal core to read the current
state of the cooling device as part of the cooling device's
registration. This is incompatible with the current implementation of
the cooling device operations in mlxsw, leading to initialization
failure with errors such as:

mlxsw_spectrum 0000:01:00.0: Failed to register cooling device
mlxsw_spectrum 0000:01:00.0: cannot register bus device

The reason for the failure is that when the get current state operation
is invoked the driver tries to derive the index of the cooling device by
walking a per thermal zone array and looking for the matching cooling
device pointer. However, the pointer is returned from the registration
function and therefore only set in the array after the registration.

The issue was later fixed by commit 1af89dedc8a5 ("thermal: core: Do not
fail cdev registration because of invalid initial state") by not failing
the registration of the cooling device if it cannot report a valid
current state during registration, although drivers are responsible for
ensuring that this will not happen.

Therefore, make sure the driver is able to report a valid current state
for the cooling device during registration by passing to the
registration function a per cooling device private data that already has
the cooling device index populated.

Cc: linux-pm@vger.kernel.org
Reviewed-by: Vadim Pasternak <vadimp@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
---
 .../ethernet/mellanox/mlxsw/core_thermal.c    | 50 ++++++++++---------
 1 file changed, 26 insertions(+), 24 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c b/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c
index 5c511e1a8efa..eee3e37983ca 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c
@@ -100,6 +100,12 @@ static const struct mlxsw_cooling_states default_cooling_states[] = {
 
 struct mlxsw_thermal;
 
+struct mlxsw_thermal_cooling_device {
+	struct mlxsw_thermal *thermal;
+	struct thermal_cooling_device *cdev;
+	unsigned int idx;
+};
+
 struct mlxsw_thermal_module {
 	struct mlxsw_thermal *parent;
 	struct thermal_zone_device *tzdev;
@@ -123,7 +129,7 @@ struct mlxsw_thermal {
 	const struct mlxsw_bus_info *bus_info;
 	struct thermal_zone_device *tzdev;
 	int polling_delay;
-	struct thermal_cooling_device *cdevs[MLXSW_MFCR_PWMS_MAX];
+	struct mlxsw_thermal_cooling_device cdevs[MLXSW_MFCR_PWMS_MAX];
 	struct thermal_trip trips[MLXSW_THERMAL_NUM_TRIPS];
 	struct mlxsw_cooling_states cooling_states[MLXSW_THERMAL_NUM_TRIPS];
 	struct mlxsw_thermal_area line_cards[];
@@ -147,7 +153,7 @@ static int mlxsw_get_cooling_device_idx(struct mlxsw_thermal *thermal,
 	int i;
 
 	for (i = 0; i < MLXSW_MFCR_PWMS_MAX; i++)
-		if (thermal->cdevs[i] == cdev)
+		if (thermal->cdevs[i].cdev == cdev)
 			return i;
 
 	/* Allow mlxsw thermal zone binding to an external cooling device */
@@ -352,17 +358,14 @@ static int mlxsw_thermal_get_cur_state(struct thermal_cooling_device *cdev,
 				       unsigned long *p_state)
 
 {
-	struct mlxsw_thermal *thermal = cdev->devdata;
+	struct mlxsw_thermal_cooling_device *mlxsw_cdev = cdev->devdata;
+	struct mlxsw_thermal *thermal = mlxsw_cdev->thermal;
 	struct device *dev = thermal->bus_info->dev;
 	char mfsc_pl[MLXSW_REG_MFSC_LEN];
-	int err, idx;
 	u8 duty;
+	int err;
 
-	idx = mlxsw_get_cooling_device_idx(thermal, cdev);
-	if (idx < 0)
-		return idx;
-
-	mlxsw_reg_mfsc_pack(mfsc_pl, idx, 0);
+	mlxsw_reg_mfsc_pack(mfsc_pl, mlxsw_cdev->idx, 0);
 	err = mlxsw_reg_query(thermal->core, MLXSW_REG(mfsc), mfsc_pl);
 	if (err) {
 		dev_err(dev, "Failed to query PWM duty\n");
@@ -378,22 +381,19 @@ static int mlxsw_thermal_set_cur_state(struct thermal_cooling_device *cdev,
 				       unsigned long state)
 
 {
-	struct mlxsw_thermal *thermal = cdev->devdata;
+	struct mlxsw_thermal_cooling_device *mlxsw_cdev = cdev->devdata;
+	struct mlxsw_thermal *thermal = mlxsw_cdev->thermal;
 	struct device *dev = thermal->bus_info->dev;
 	char mfsc_pl[MLXSW_REG_MFSC_LEN];
-	int idx;
 	int err;
 
 	if (state > MLXSW_THERMAL_MAX_STATE)
 		return -EINVAL;
 
-	idx = mlxsw_get_cooling_device_idx(thermal, cdev);
-	if (idx < 0)
-		return idx;
-
 	/* Normalize the state to the valid speed range. */
 	state = max_t(unsigned long, MLXSW_THERMAL_MIN_STATE, state);
-	mlxsw_reg_mfsc_pack(mfsc_pl, idx, mlxsw_state_to_duty(state));
+	mlxsw_reg_mfsc_pack(mfsc_pl, mlxsw_cdev->idx,
+			    mlxsw_state_to_duty(state));
 	err = mlxsw_reg_write(thermal->core, MLXSW_REG(mfsc), mfsc_pl);
 	if (err) {
 		dev_err(dev, "Failed to write PWM duty\n");
@@ -753,17 +753,21 @@ int mlxsw_thermal_init(struct mlxsw_core *core,
 	}
 	for (i = 0; i < MLXSW_MFCR_PWMS_MAX; i++) {
 		if (pwm_active & BIT(i)) {
+			struct mlxsw_thermal_cooling_device *mlxsw_cdev;
 			struct thermal_cooling_device *cdev;
 
+			mlxsw_cdev = &thermal->cdevs[i];
+			mlxsw_cdev->thermal = thermal;
+			mlxsw_cdev->idx = i;
 			cdev = thermal_cooling_device_register("mlxsw_fan",
-							       thermal,
+							       mlxsw_cdev,
 							       &mlxsw_cooling_ops);
 			if (IS_ERR(cdev)) {
 				err = PTR_ERR(cdev);
 				dev_err(dev, "Failed to register cooling device\n");
 				goto err_thermal_cooling_device_register;
 			}
-			thermal->cdevs[i] = cdev;
+			mlxsw_cdev->cdev = cdev;
 		}
 	}
 
@@ -824,8 +828,8 @@ int mlxsw_thermal_init(struct mlxsw_core *core,
 err_thermal_zone_device_register:
 err_thermal_cooling_device_register:
 	for (i = 0; i < MLXSW_MFCR_PWMS_MAX; i++)
-		if (thermal->cdevs[i])
-			thermal_cooling_device_unregister(thermal->cdevs[i]);
+		if (thermal->cdevs[i].cdev)
+			thermal_cooling_device_unregister(thermal->cdevs[i].cdev);
 err_reg_write:
 err_reg_query:
 	kfree(thermal);
@@ -848,10 +852,8 @@ void mlxsw_thermal_fini(struct mlxsw_thermal *thermal)
 	}
 
 	for (i = 0; i < MLXSW_MFCR_PWMS_MAX; i++) {
-		if (thermal->cdevs[i]) {
-			thermal_cooling_device_unregister(thermal->cdevs[i]);
-			thermal->cdevs[i] = NULL;
-		}
+		if (thermal->cdevs[i].cdev)
+			thermal_cooling_device_unregister(thermal->cdevs[i].cdev);
 	}
 
 	kfree(thermal);
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next 2/3] mlxsw: core_thermal: Report valid current state during cooling device registration
  2024-07-01 16:41 ` [PATCH net-next 2/3] mlxsw: core_thermal: Report valid current state during cooling device registration Petr Machata
@ 2024-07-02  7:27   ` Przemek Kitszel
  2024-07-09 16:06   ` Rafael J. Wysocki
  1 sibling, 0 replies; 13+ messages in thread
From: Przemek Kitszel @ 2024-07-02  7:27 UTC (permalink / raw)
  To: Petr Machata
  Cc: Ido Schimmel, mlxsw, linux-pm, Vadim Pasternak, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, netdev

On 7/1/24 18:41, Petr Machata wrote:
> From: Ido Schimmel <idosch@nvidia.com>
> 
> Commit 31a0fa0019b0 ("thermal/debugfs: Pass cooling device state to
> thermal_debug_cdev_add()") changed the thermal core to read the current
> state of the cooling device as part of the cooling device's
> registration. This is incompatible with the current implementation of
> the cooling device operations in mlxsw, leading to initialization
> failure with errors such as:
> 
> mlxsw_spectrum 0000:01:00.0: Failed to register cooling device
> mlxsw_spectrum 0000:01:00.0: cannot register bus device
> 
> The reason for the failure is that when the get current state operation
> is invoked the driver tries to derive the index of the cooling device by
> walking a per thermal zone array and looking for the matching cooling
> device pointer. However, the pointer is returned from the registration
> function and therefore only set in the array after the registration.
> 
> The issue was later fixed by commit 1af89dedc8a5 ("thermal: core: Do not
> fail cdev registration because of invalid initial state") by not failing
> the registration of the cooling device if it cannot report a valid
> current state during registration, although drivers are responsible for
> ensuring that this will not happen.
> 
> Therefore, make sure the driver is able to report a valid current state
> for the cooling device during registration by passing to the
> registration function a per cooling device private data that already has
> the cooling device index populated.
> 
> Cc: linux-pm@vger.kernel.org
> Reviewed-by: Vadim Pasternak <vadimp@nvidia.com>
> Signed-off-by: Ido Schimmel <idosch@nvidia.com>
> Signed-off-by: Petr Machata <petrm@nvidia.com>
> ---
>   .../ethernet/mellanox/mlxsw/core_thermal.c    | 50 ++++++++++---------
>   1 file changed, 26 insertions(+), 24 deletions(-)
> 

just two nitpicks

> @@ -824,8 +828,8 @@ int mlxsw_thermal_init(struct mlxsw_core *core,
>   err_thermal_zone_device_register:
>   err_thermal_cooling_device_register:
>   	for (i = 0; i < MLXSW_MFCR_PWMS_MAX; i++)
> -		if (thermal->cdevs[i])
> -			thermal_cooling_device_unregister(thermal->cdevs[i]);
> +		if (thermal->cdevs[i].cdev)

this check is done by thermal_cooling_device_unregister()

> +			thermal_cooling_device_unregister(thermal->cdevs[i].cdev);
>   err_reg_write:
>   err_reg_query:
>   	kfree(thermal);
> @@ -848,10 +852,8 @@ void mlxsw_thermal_fini(struct mlxsw_thermal *thermal)
>   	}
>   
>   	for (i = 0; i < MLXSW_MFCR_PWMS_MAX; i++) {
> -		if (thermal->cdevs[i]) {
> -			thermal_cooling_device_unregister(thermal->cdevs[i]);
> -			thermal->cdevs[i] = NULL;
> -		}
> +		if (thermal->cdevs[i].cdev)

ditto

> +			thermal_cooling_device_unregister(thermal->cdevs[i].cdev);
>   	}
>   
>   	kfree(thermal);


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next 2/3] mlxsw: core_thermal: Report valid current state during cooling device registration
  2024-07-01 16:41 ` [PATCH net-next 2/3] mlxsw: core_thermal: Report valid current state during cooling device registration Petr Machata
  2024-07-02  7:27   ` Przemek Kitszel
@ 2024-07-09 16:06   ` Rafael J. Wysocki
  1 sibling, 0 replies; 13+ messages in thread
From: Rafael J. Wysocki @ 2024-07-09 16:06 UTC (permalink / raw)
  To: Petr Machata
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	netdev, Ido Schimmel, mlxsw, linux-pm, Vadim Pasternak

On Mon, Jul 1, 2024 at 6:45 PM Petr Machata <petrm@nvidia.com> wrote:
>
> From: Ido Schimmel <idosch@nvidia.com>
>
> Commit 31a0fa0019b0 ("thermal/debugfs: Pass cooling device state to
> thermal_debug_cdev_add()") changed the thermal core to read the current
> state of the cooling device as part of the cooling device's
> registration. This is incompatible with the current implementation of
> the cooling device operations in mlxsw, leading to initialization
> failure with errors such as:
>
> mlxsw_spectrum 0000:01:00.0: Failed to register cooling device
> mlxsw_spectrum 0000:01:00.0: cannot register bus device
>
> The reason for the failure is that when the get current state operation
> is invoked the driver tries to derive the index of the cooling device by
> walking a per thermal zone array and looking for the matching cooling
> device pointer. However, the pointer is returned from the registration
> function and therefore only set in the array after the registration.
>
> The issue was later fixed by commit 1af89dedc8a5 ("thermal: core: Do not
> fail cdev registration because of invalid initial state") by not failing
> the registration of the cooling device if it cannot report a valid
> current state during registration, although drivers are responsible for
> ensuring that this will not happen.
>
> Therefore, make sure the driver is able to report a valid current state
> for the cooling device during registration by passing to the
> registration function a per cooling device private data that already has
> the cooling device index populated.
>
> Cc: linux-pm@vger.kernel.org
> Reviewed-by: Vadim Pasternak <vadimp@nvidia.com>
> Signed-off-by: Ido Schimmel <idosch@nvidia.com>
> Signed-off-by: Petr Machata <petrm@nvidia.com>

Acked-by: Rafael J. Wysocki <rafael@kernel.org>

> ---
>  .../ethernet/mellanox/mlxsw/core_thermal.c    | 50 ++++++++++---------
>  1 file changed, 26 insertions(+), 24 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c b/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c
> index 5c511e1a8efa..eee3e37983ca 100644
> --- a/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c
> +++ b/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c
> @@ -100,6 +100,12 @@ static const struct mlxsw_cooling_states default_cooling_states[] = {
>
>  struct mlxsw_thermal;
>
> +struct mlxsw_thermal_cooling_device {
> +       struct mlxsw_thermal *thermal;
> +       struct thermal_cooling_device *cdev;
> +       unsigned int idx;
> +};
> +
>  struct mlxsw_thermal_module {
>         struct mlxsw_thermal *parent;
>         struct thermal_zone_device *tzdev;
> @@ -123,7 +129,7 @@ struct mlxsw_thermal {
>         const struct mlxsw_bus_info *bus_info;
>         struct thermal_zone_device *tzdev;
>         int polling_delay;
> -       struct thermal_cooling_device *cdevs[MLXSW_MFCR_PWMS_MAX];
> +       struct mlxsw_thermal_cooling_device cdevs[MLXSW_MFCR_PWMS_MAX];
>         struct thermal_trip trips[MLXSW_THERMAL_NUM_TRIPS];
>         struct mlxsw_cooling_states cooling_states[MLXSW_THERMAL_NUM_TRIPS];
>         struct mlxsw_thermal_area line_cards[];
> @@ -147,7 +153,7 @@ static int mlxsw_get_cooling_device_idx(struct mlxsw_thermal *thermal,
>         int i;
>
>         for (i = 0; i < MLXSW_MFCR_PWMS_MAX; i++)
> -               if (thermal->cdevs[i] == cdev)
> +               if (thermal->cdevs[i].cdev == cdev)
>                         return i;
>
>         /* Allow mlxsw thermal zone binding to an external cooling device */
> @@ -352,17 +358,14 @@ static int mlxsw_thermal_get_cur_state(struct thermal_cooling_device *cdev,
>                                        unsigned long *p_state)
>
>  {
> -       struct mlxsw_thermal *thermal = cdev->devdata;
> +       struct mlxsw_thermal_cooling_device *mlxsw_cdev = cdev->devdata;
> +       struct mlxsw_thermal *thermal = mlxsw_cdev->thermal;
>         struct device *dev = thermal->bus_info->dev;
>         char mfsc_pl[MLXSW_REG_MFSC_LEN];
> -       int err, idx;
>         u8 duty;
> +       int err;
>
> -       idx = mlxsw_get_cooling_device_idx(thermal, cdev);
> -       if (idx < 0)
> -               return idx;
> -
> -       mlxsw_reg_mfsc_pack(mfsc_pl, idx, 0);
> +       mlxsw_reg_mfsc_pack(mfsc_pl, mlxsw_cdev->idx, 0);
>         err = mlxsw_reg_query(thermal->core, MLXSW_REG(mfsc), mfsc_pl);
>         if (err) {
>                 dev_err(dev, "Failed to query PWM duty\n");
> @@ -378,22 +381,19 @@ static int mlxsw_thermal_set_cur_state(struct thermal_cooling_device *cdev,
>                                        unsigned long state)
>
>  {
> -       struct mlxsw_thermal *thermal = cdev->devdata;
> +       struct mlxsw_thermal_cooling_device *mlxsw_cdev = cdev->devdata;
> +       struct mlxsw_thermal *thermal = mlxsw_cdev->thermal;
>         struct device *dev = thermal->bus_info->dev;
>         char mfsc_pl[MLXSW_REG_MFSC_LEN];
> -       int idx;
>         int err;
>
>         if (state > MLXSW_THERMAL_MAX_STATE)
>                 return -EINVAL;
>
> -       idx = mlxsw_get_cooling_device_idx(thermal, cdev);
> -       if (idx < 0)
> -               return idx;
> -
>         /* Normalize the state to the valid speed range. */
>         state = max_t(unsigned long, MLXSW_THERMAL_MIN_STATE, state);
> -       mlxsw_reg_mfsc_pack(mfsc_pl, idx, mlxsw_state_to_duty(state));
> +       mlxsw_reg_mfsc_pack(mfsc_pl, mlxsw_cdev->idx,
> +                           mlxsw_state_to_duty(state));
>         err = mlxsw_reg_write(thermal->core, MLXSW_REG(mfsc), mfsc_pl);
>         if (err) {
>                 dev_err(dev, "Failed to write PWM duty\n");
> @@ -753,17 +753,21 @@ int mlxsw_thermal_init(struct mlxsw_core *core,
>         }
>         for (i = 0; i < MLXSW_MFCR_PWMS_MAX; i++) {
>                 if (pwm_active & BIT(i)) {
> +                       struct mlxsw_thermal_cooling_device *mlxsw_cdev;
>                         struct thermal_cooling_device *cdev;
>
> +                       mlxsw_cdev = &thermal->cdevs[i];
> +                       mlxsw_cdev->thermal = thermal;
> +                       mlxsw_cdev->idx = i;
>                         cdev = thermal_cooling_device_register("mlxsw_fan",
> -                                                              thermal,
> +                                                              mlxsw_cdev,
>                                                                &mlxsw_cooling_ops);
>                         if (IS_ERR(cdev)) {
>                                 err = PTR_ERR(cdev);
>                                 dev_err(dev, "Failed to register cooling device\n");
>                                 goto err_thermal_cooling_device_register;
>                         }
> -                       thermal->cdevs[i] = cdev;
> +                       mlxsw_cdev->cdev = cdev;
>                 }
>         }
>
> @@ -824,8 +828,8 @@ int mlxsw_thermal_init(struct mlxsw_core *core,
>  err_thermal_zone_device_register:
>  err_thermal_cooling_device_register:
>         for (i = 0; i < MLXSW_MFCR_PWMS_MAX; i++)
> -               if (thermal->cdevs[i])
> -                       thermal_cooling_device_unregister(thermal->cdevs[i]);
> +               if (thermal->cdevs[i].cdev)
> +                       thermal_cooling_device_unregister(thermal->cdevs[i].cdev);
>  err_reg_write:
>  err_reg_query:
>         kfree(thermal);
> @@ -848,10 +852,8 @@ void mlxsw_thermal_fini(struct mlxsw_thermal *thermal)
>         }
>
>         for (i = 0; i < MLXSW_MFCR_PWMS_MAX; i++) {
> -               if (thermal->cdevs[i]) {
> -                       thermal_cooling_device_unregister(thermal->cdevs[i]);
> -                       thermal->cdevs[i] = NULL;
> -               }
> +               if (thermal->cdevs[i].cdev)
> +                       thermal_cooling_device_unregister(thermal->cdevs[i].cdev);
>         }
>
>         kfree(thermal);
> --
> 2.45.0
>
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH net-next 3/3] mlxsw: pci: Lock configuration space of upstream bridge during reset
  2024-07-01 16:41 [PATCH net-next 0/3] mlxsw: Improvements Petr Machata
  2024-07-01 16:41 ` [PATCH net-next 1/3] mlxsw: Warn about invalid accesses to array fields Petr Machata
  2024-07-01 16:41 ` [PATCH net-next 2/3] mlxsw: core_thermal: Report valid current state during cooling device registration Petr Machata
@ 2024-07-01 16:41 ` Petr Machata
  2024-07-02  7:35   ` Przemek Kitszel
  2 siblings, 1 reply; 13+ messages in thread
From: Petr Machata @ 2024-07-01 16:41 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	netdev
  Cc: Petr Machata, Ido Schimmel, mlxsw, linux-pci

From: Ido Schimmel <idosch@nvidia.com>

The driver triggers a "Secondary Bus Reset" (SBR) by calling
__pci_reset_function_locked() which asserts the SBR bit in the "Bridge
Control Register" in the configuration space of the upstream bridge for
2ms. This is done without locking the configuration space of the
upstream bridge port, allowing user space to access it concurrently.

Linux 6.11 will start warning about such unlocked resets [1][2]:

pcieport 0000:00:01.0: unlocked secondary bus reset via: pci_reset_bus_function+0x51c/0x6a0

Avoid the warning by locking the configuration space of the upstream
bridge prior to the reset and unlocking it afterwards.

[1] https://lore.kernel.org/all/171711746953.1628941.4692125082286867825.stgit@dwillia2-xfh.jf.intel.com/
[2] https://lore.kernel.org/all/20240531213150.GA610983@bhelgaas/

Cc: linux-pci@vger.kernel.org
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlxsw/pci.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/pci.c b/drivers/net/ethernet/mellanox/mlxsw/pci.c
index 0320dabd1380..060e5b939211 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/pci.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/pci.c
@@ -1784,6 +1784,7 @@ static int mlxsw_pci_reset_at_pci_disable(struct mlxsw_pci *mlxsw_pci,
 {
 	struct pci_dev *pdev = mlxsw_pci->pdev;
 	char mrsr_pl[MLXSW_REG_MRSR_LEN];
+	struct pci_dev *bridge;
 	int err;
 
 	if (!pci_reset_sbr_supported) {
@@ -1800,6 +1801,9 @@ static int mlxsw_pci_reset_at_pci_disable(struct mlxsw_pci *mlxsw_pci,
 sbr:
 	device_lock_assert(&pdev->dev);
 
+	bridge = pci_upstream_bridge(pdev);
+	if (bridge)
+		pci_cfg_access_lock(bridge);
 	pci_cfg_access_lock(pdev);
 	pci_save_state(pdev);
 
@@ -1809,6 +1813,8 @@ static int mlxsw_pci_reset_at_pci_disable(struct mlxsw_pci *mlxsw_pci,
 
 	pci_restore_state(pdev);
 	pci_cfg_access_unlock(pdev);
+	if (bridge)
+		pci_cfg_access_unlock(bridge);
 
 	return err;
 }
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next 3/3] mlxsw: pci: Lock configuration space of upstream bridge during reset
  2024-07-01 16:41 ` [PATCH net-next 3/3] mlxsw: pci: Lock configuration space of upstream bridge during reset Petr Machata
@ 2024-07-02  7:35   ` Przemek Kitszel
  2024-07-03 14:42     ` Ido Schimmel
  0 siblings, 1 reply; 13+ messages in thread
From: Przemek Kitszel @ 2024-07-02  7:35 UTC (permalink / raw)
  To: Petr Machata, Ido Schimmel
  Cc: mlxsw, linux-pci, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, netdev

On 7/1/24 18:41, Petr Machata wrote:
> From: Ido Schimmel <idosch@nvidia.com>
> 
> The driver triggers a "Secondary Bus Reset" (SBR) by calling
> __pci_reset_function_locked() which asserts the SBR bit in the "Bridge
> Control Register" in the configuration space of the upstream bridge for
> 2ms. This is done without locking the configuration space of the
> upstream bridge port, allowing user space to access it concurrently.

This means your patch is a bugfix.

> 
> Linux 6.11 will start warning about such unlocked resets [1][2]:
> 
> pcieport 0000:00:01.0: unlocked secondary bus reset via: pci_reset_bus_function+0x51c/0x6a0
> 
> Avoid the warning by locking the configuration space of the upstream
> bridge prior to the reset and unlocking it afterwards.

You are not avoiding the warning but protecting concurrent access,
please add a Fixes tag.

> 
> [1] https://lore.kernel.org/all/171711746953.1628941.4692125082286867825.stgit@dwillia2-xfh.jf.intel.com/
> [2] https://lore.kernel.org/all/20240531213150.GA610983@bhelgaas/
> 
> Cc: linux-pci@vger.kernel.org
> Signed-off-by: Ido Schimmel <idosch@nvidia.com>
> Signed-off-by: Petr Machata <petrm@nvidia.com>
> ---
>   drivers/net/ethernet/mellanox/mlxsw/pci.c | 6 ++++++
>   1 file changed, 6 insertions(+)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlxsw/pci.c b/drivers/net/ethernet/mellanox/mlxsw/pci.c
> index 0320dabd1380..060e5b939211 100644
> --- a/drivers/net/ethernet/mellanox/mlxsw/pci.c
> +++ b/drivers/net/ethernet/mellanox/mlxsw/pci.c
> @@ -1784,6 +1784,7 @@ static int mlxsw_pci_reset_at_pci_disable(struct mlxsw_pci *mlxsw_pci,
>   {
>   	struct pci_dev *pdev = mlxsw_pci->pdev;
>   	char mrsr_pl[MLXSW_REG_MRSR_LEN];
> +	struct pci_dev *bridge;
>   	int err;
>   
>   	if (!pci_reset_sbr_supported) {
> @@ -1800,6 +1801,9 @@ static int mlxsw_pci_reset_at_pci_disable(struct mlxsw_pci *mlxsw_pci,
>   sbr:
>   	device_lock_assert(&pdev->dev);
>   
> +	bridge = pci_upstream_bridge(pdev);
> +	if (bridge)
> +		pci_cfg_access_lock(bridge);
>   	pci_cfg_access_lock(pdev);
>   	pci_save_state(pdev);
>   
> @@ -1809,6 +1813,8 @@ static int mlxsw_pci_reset_at_pci_disable(struct mlxsw_pci *mlxsw_pci,
>   
>   	pci_restore_state(pdev);
>   	pci_cfg_access_unlock(pdev);
> +	if (bridge)
> +		pci_cfg_access_unlock(bridge);
>   
>   	return err;
>   }


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next 3/3] mlxsw: pci: Lock configuration space of upstream bridge during reset
  2024-07-02  7:35   ` Przemek Kitszel
@ 2024-07-03 14:42     ` Ido Schimmel
  2024-07-12 21:21       ` Bjorn Helgaas
  0 siblings, 1 reply; 13+ messages in thread
From: Ido Schimmel @ 2024-07-03 14:42 UTC (permalink / raw)
  To: Przemek Kitszel, helgaas
  Cc: Petr Machata, mlxsw, linux-pci, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, netdev

On Tue, Jul 02, 2024 at 09:35:50AM +0200, Przemek Kitszel wrote:
> On 7/1/24 18:41, Petr Machata wrote:
> > From: Ido Schimmel <idosch@nvidia.com>
> > 
> > The driver triggers a "Secondary Bus Reset" (SBR) by calling
> > __pci_reset_function_locked() which asserts the SBR bit in the "Bridge
> > Control Register" in the configuration space of the upstream bridge for
> > 2ms. This is done without locking the configuration space of the
> > upstream bridge port, allowing user space to access it concurrently.
> 
> This means your patch is a bugfix.
> 
> > 
> > Linux 6.11 will start warning about such unlocked resets [1][2]:
> > 
> > pcieport 0000:00:01.0: unlocked secondary bus reset via: pci_reset_bus_function+0x51c/0x6a0
> > 
> > Avoid the warning by locking the configuration space of the upstream
> > bridge prior to the reset and unlocking it afterwards.
> 
> You are not avoiding the warning but protecting concurrent access,
> please add a Fixes tag.

The patch that added the missing lock in PCI core was posted without a
Fixes tag and merged as part of the 6.10 PR. See commit 7e89efc6e9e4
("PCI: Lock upstream bridge for pci_reset_function()").

I don't see a good reason for root to poke in the configuration space of
the upstream bridge during SBR, but AFAICT the worst that can happen is
that reset will fail and while it is a bug, it is not a regression.

Bjorn, do you see a reason to post this as a fix?

Thanks

> 
> > 
> > [1] https://lore.kernel.org/all/171711746953.1628941.4692125082286867825.stgit@dwillia2-xfh.jf.intel.com/
> > [2] https://lore.kernel.org/all/20240531213150.GA610983@bhelgaas/
> > 
> > Cc: linux-pci@vger.kernel.org
> > Signed-off-by: Ido Schimmel <idosch@nvidia.com>
> > Signed-off-by: Petr Machata <petrm@nvidia.com>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next 3/3] mlxsw: pci: Lock configuration space of upstream bridge during reset
  2024-07-03 14:42     ` Ido Schimmel
@ 2024-07-12 21:21       ` Bjorn Helgaas
  2024-07-14 11:29         ` Ido Schimmel
  0 siblings, 1 reply; 13+ messages in thread
From: Bjorn Helgaas @ 2024-07-12 21:21 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: Przemek Kitszel, Petr Machata, mlxsw, linux-pci, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, netdev, Dan Williams

[+cc Dan]

On Wed, Jul 03, 2024 at 05:42:05PM +0300, Ido Schimmel wrote:
> On Tue, Jul 02, 2024 at 09:35:50AM +0200, Przemek Kitszel wrote:
> > On 7/1/24 18:41, Petr Machata wrote:
> > > From: Ido Schimmel <idosch@nvidia.com>
> > > 
> > > The driver triggers a "Secondary Bus Reset" (SBR) by calling
> > > __pci_reset_function_locked() which asserts the SBR bit in the "Bridge
> > > Control Register" in the configuration space of the upstream bridge for
> > > 2ms. This is done without locking the configuration space of the
> > > upstream bridge port, allowing user space to access it concurrently.
> > 
> > This means your patch is a bugfix.
> > 
> > > Linux 6.11 will start warning about such unlocked resets [1][2]:
> > > 
> > > pcieport 0000:00:01.0: unlocked secondary bus reset via: pci_reset_bus_function+0x51c/0x6a0
> > > 
> > > Avoid the warning by locking the configuration space of the upstream
> > > bridge prior to the reset and unlocking it afterwards.
> > 
> > You are not avoiding the warning but protecting concurrent access,
> > please add a Fixes tag.
> 
> The patch that added the missing lock in PCI core was posted without a
> Fixes tag and merged as part of the 6.10 PR. See commit 7e89efc6e9e4
> ("PCI: Lock upstream bridge for pci_reset_function()").
> 
> I don't see a good reason for root to poke in the configuration space of
> the upstream bridge during SBR, but AFAICT the worst that can happen is
> that reset will fail and while it is a bug, it is not a regression.
> 
> Bjorn, do you see a reason to post this as a fix?

Sorry, I was on vacation and missed this when I returned.

mlxsw is one of the few users of __pci_reset_function_locked().
Others are liquidio (octeon), VFIO, and Xen.

You need __pci_reset_function_locked() if you're already holding the
device mutex, i.e., device_lock(&pdev->dev).  I looked at the
mlxsw_pci_reset_at_pci_disable() path, and didn't see where it holds
that device lock, but I probably missed it.

The usual pci_reset_function() path, which would be preferable if you
can use it, does basically this:

  pci_dev_lock(bridge)
    device_lock(&bridge->dev)
    pci_cfg_access_lock(bridge)
  pci_dev_lock(pdev)
    device_lock(&pdev->dev)
    pci_cfg_access_lock(pdev)
  pci_dev_save_and_disable(dev)
  __pci_reset_function_locked(pdev)

This patch adds pci_cfg_access_lock(bridge), but doesn't acquire the
device_lock for the bridge.

It looks like you always reset the device at mlxsw_pci_probe()-time,
which is quite unusual in the first place, but I suppose there's some
good reason for it.

If you can use pci_reset_function() directly (or avoid the reset
altogether), it would be far preferable and would avoid potential
issues like the warning here.

Bjorn

> > > [1] https://lore.kernel.org/all/171711746953.1628941.4692125082286867825.stgit@dwillia2-xfh.jf.intel.com/
> > > [2] https://lore.kernel.org/all/20240531213150.GA610983@bhelgaas/
> > > 
> > > Cc: linux-pci@vger.kernel.org
> > > Signed-off-by: Ido Schimmel <idosch@nvidia.com>
> > > Signed-off-by: Petr Machata <petrm@nvidia.com>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next 3/3] mlxsw: pci: Lock configuration space of upstream bridge during reset
  2024-07-12 21:21       ` Bjorn Helgaas
@ 2024-07-14 11:29         ` Ido Schimmel
  0 siblings, 0 replies; 13+ messages in thread
From: Ido Schimmel @ 2024-07-14 11:29 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Przemek Kitszel, Petr Machata, mlxsw, linux-pci, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, netdev, Dan Williams

Hi Bjorn,

On Fri, Jul 12, 2024 at 04:21:57PM -0500, Bjorn Helgaas wrote:
> [+cc Dan]
> 
> On Wed, Jul 03, 2024 at 05:42:05PM +0300, Ido Schimmel wrote:
> > On Tue, Jul 02, 2024 at 09:35:50AM +0200, Przemek Kitszel wrote:
> > > On 7/1/24 18:41, Petr Machata wrote:
> > > > From: Ido Schimmel <idosch@nvidia.com>
> > > > 
> > > > The driver triggers a "Secondary Bus Reset" (SBR) by calling
> > > > __pci_reset_function_locked() which asserts the SBR bit in the "Bridge
> > > > Control Register" in the configuration space of the upstream bridge for
> > > > 2ms. This is done without locking the configuration space of the
> > > > upstream bridge port, allowing user space to access it concurrently.
> > > 
> > > This means your patch is a bugfix.
> > > 
> > > > Linux 6.11 will start warning about such unlocked resets [1][2]:
> > > > 
> > > > pcieport 0000:00:01.0: unlocked secondary bus reset via: pci_reset_bus_function+0x51c/0x6a0
> > > > 
> > > > Avoid the warning by locking the configuration space of the upstream
> > > > bridge prior to the reset and unlocking it afterwards.
> > > 
> > > You are not avoiding the warning but protecting concurrent access,
> > > please add a Fixes tag.
> > 
> > The patch that added the missing lock in PCI core was posted without a
> > Fixes tag and merged as part of the 6.10 PR. See commit 7e89efc6e9e4
> > ("PCI: Lock upstream bridge for pci_reset_function()").
> > 
> > I don't see a good reason for root to poke in the configuration space of
> > the upstream bridge during SBR, but AFAICT the worst that can happen is
> > that reset will fail and while it is a bug, it is not a regression.
> > 
> > Bjorn, do you see a reason to post this as a fix?
> 
> Sorry, I was on vacation and missed this when I returned.
> 
> mlxsw is one of the few users of __pci_reset_function_locked().
> Others are liquidio (octeon), VFIO, and Xen.
> 
> You need __pci_reset_function_locked() if you're already holding the
> device mutex, i.e., device_lock(&pdev->dev).  I looked at the
> mlxsw_pci_reset_at_pci_disable() path, and didn't see where it holds
> that device lock, but I probably missed it.

It is locked. There is device_lock_assert(&pdev->dev) before the call to
__pci_reset_function_locked(pdev) to make sure this is the case.

> The usual pci_reset_function() path, which would be preferable if you
> can use it, does basically this:
> 
>   pci_dev_lock(bridge)
>     device_lock(&bridge->dev)
>     pci_cfg_access_lock(bridge)
>   pci_dev_lock(pdev)
>     device_lock(&pdev->dev)
>     pci_cfg_access_lock(pdev)
>   pci_dev_save_and_disable(dev)
>   __pci_reset_function_locked(pdev)
> 
> This patch adds pci_cfg_access_lock(bridge), but doesn't acquire the
> device_lock for the bridge.
> 
> It looks like you always reset the device at mlxsw_pci_probe()-time,
> which is quite unusual in the first place, but I suppose there's some
> good reason for it.

The driver resets the device to bring it to a known and clean state.
Older devices can be reset using a firmware command, but current
generation requires a PCI reset.

> If you can use pci_reset_function() directly (or avoid the reset
> altogether), it would be far preferable and would avoid potential
> issues like the warning here.

We couldn't use pci_reset_function() even if we didn't reset during
probe. Another call path that triggers the reset is "devlink reload"
which holds the devlink instance lock. Trying to acquire the device lock
while holding the devlink instance lock would result in lock inversion.
We modified devlink to acquire the device lock before the instance lock
so that we could call PCI APIs with the device lock held. See:

https://lore.kernel.org/netdev/20231017074257.3389177-1-idosch@nvidia.com/

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2024-07-14 11:30 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-01 16:41 [PATCH net-next 0/3] mlxsw: Improvements Petr Machata
2024-07-01 16:41 ` [PATCH net-next 1/3] mlxsw: Warn about invalid accesses to array fields Petr Machata
2024-07-02  7:08   ` Przemek Kitszel
2024-07-03 12:40     ` Ido Schimmel
2024-07-08  9:45       ` Petr Machata
2024-07-01 16:41 ` [PATCH net-next 2/3] mlxsw: core_thermal: Report valid current state during cooling device registration Petr Machata
2024-07-02  7:27   ` Przemek Kitszel
2024-07-09 16:06   ` Rafael J. Wysocki
2024-07-01 16:41 ` [PATCH net-next 3/3] mlxsw: pci: Lock configuration space of upstream bridge during reset Petr Machata
2024-07-02  7:35   ` Przemek Kitszel
2024-07-03 14:42     ` Ido Schimmel
2024-07-12 21:21       ` Bjorn Helgaas
2024-07-14 11:29         ` Ido Schimmel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).