Linux Documentation
 help / color / mirror / Atom feed
* Re: [PATCH v10 07/21] gpu: nova-core: mm: Add TLB flush support
From: Alexandre Courbot @ 2026-04-08  7:40 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Joel Fernandes, linux-kernel, Miguel Ojeda, Boqun Feng, Gary Guo,
	Bjorn Roy Baron, Benno Lossin, Andreas Hindborg, Alice Ryhl,
	Trevor Gross, Danilo Krummrich, Dave Airlie, Daniel Almeida,
	Koen Koning, dri-devel, rust-for-linux, Nikola Djukic,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
	Simona Vetter, Jonathan Corbet, Alex Deucher, Christian Koenig,
	Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
	Huang Rui, Matthew Auld, Lucas De Marchi, Thomas Hellstrom,
	Helge Deller, Alex Gaynor, Boqun Feng, John Hubbard,
	Alistair Popple, Timur Tabi, Edwin Peer, Andrea Righi,
	Andy Ritger, Zhi Wang, Balbir Singh, Philipp Stanner,
	Elle Rhumsaa, alexeyi, Eliot Courtney, joel, linux-doc, amd-gfx,
	intel-gfx, intel-xe, linux-fbdev
In-Reply-To: <adSSrZp6a551xNTu@gsse-cloud1.jf.intel.com>

On Tue Apr 7, 2026 at 2:14 PM JST, Matthew Brost wrote:
> On Mon, Apr 06, 2026 at 06:10:07PM -0400, Joel Fernandes wrote:
>> 
>> 
>> On 4/6/2026 5:24 PM, Joel Fernandes wrote:
>> > 
>> > 
>> > On 4/2/2026 1:59 AM, Matthew Brost wrote:
>> >> On Tue, Mar 31, 2026 at 05:20:34PM -0400, Joel Fernandes wrote:
>> >>> Add TLB (Translation Lookaside Buffer) flush support for GPU MMU.
>> >>>
>> >>> After modifying page table entries, the GPU's TLB must be invalidated
>> >>> to ensure the new mappings take effect. The Tlb struct provides flush
>> >>> functionality through BAR0 registers.
>> >>>
>> >>> The flush operation writes the page directory base address and triggers
>> >>> an invalidation, polling for completion with a 2 second timeout matching
>> >>> the Nouveau driver.
>> >>>
>> >>> Cc: Nikola Djukic <ndjukic@nvidia.com>
>> >>> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
>> >>> ---
>> >>>  drivers/gpu/nova-core/mm.rs     |  1 +
>> >>>  drivers/gpu/nova-core/mm/tlb.rs | 95 +++++++++++++++++++++++++++++++++
>> >>>  drivers/gpu/nova-core/regs.rs   | 42 +++++++++++++++
>> >>>  3 files changed, 138 insertions(+)
>> >>>  create mode 100644 drivers/gpu/nova-core/mm/tlb.rs
>> >>>
>> >>> diff --git a/drivers/gpu/nova-core/mm.rs b/drivers/gpu/nova-core/mm.rs
>> >>> index 8f3089a5fa88..cfe9cbe11d57 100644
>> >>> --- a/drivers/gpu/nova-core/mm.rs
>> >>> +++ b/drivers/gpu/nova-core/mm.rs
>> >>> @@ -5,6 +5,7 @@
>> >>>  #![expect(dead_code)]
>> >>>  
>> >>>  pub(crate) mod pramin;
>> >>> +pub(crate) mod tlb;
>> >>>  
>> >>>  use kernel::sizes::SZ_4K;
>> >>>  
>> >>> diff --git a/drivers/gpu/nova-core/mm/tlb.rs b/drivers/gpu/nova-core/mm/tlb.rs
>> >>> new file mode 100644
>> >>> index 000000000000..cd3cbcf4c739
>> >>> --- /dev/null
>> >>> +++ b/drivers/gpu/nova-core/mm/tlb.rs
>> >>> @@ -0,0 +1,95 @@
>> >>> +// SPDX-License-Identifier: GPL-2.0
>> >>> +
>> >>> +//! TLB (Translation Lookaside Buffer) flush support for GPU MMU.
>> >>> +//!
>> >>> +//! After modifying page table entries, the GPU's TLB must be flushed to
>> >>> +//! ensure the new mappings take effect. This module provides TLB flush
>> >>> +//! functionality for virtual memory managers.
>> >>> +//!
>> >>> +//! # Example
>> >>> +//!
>> >>> +//! ```ignore
>> >>> +//! use crate::mm::tlb::Tlb;
>> >>> +//!
>> >>> +//! fn page_table_update(tlb: &Tlb, pdb_addr: VramAddress) -> Result<()> {
>> >>> +//!     // ... modify page tables ...
>> >>> +//!
>> >>> +//!     // Flush TLB to make changes visible (polls for completion).
>> >>> +//!     tlb.flush(pdb_addr)?;
>> >>> +//!
>> >>> +//!     Ok(())
>> >>> +//! }
>> >>> +//! ```
>> >>> +
>> >>> +use kernel::{
>> >>> +    devres::Devres,
>> >>> +    io::poll::read_poll_timeout,
>> >>> +    io::Io,
>> >>> +    new_mutex,
>> >>> +    prelude::*,
>> >>> +    sync::{
>> >>> +        Arc,
>> >>> +        Mutex, //
>> >>> +    },
>> >>> +    time::Delta, //
>> >>> +};
>> >>> +
>> >>> +use crate::{
>> >>> +    driver::Bar0,
>> >>> +    mm::VramAddress,
>> >>> +    regs, //
>> >>> +};
>> >>> +
>> >>> +/// TLB manager for GPU translation buffer operations.
>> >>> +#[pin_data]
>> >>> +pub(crate) struct Tlb {
>> >>> +    bar: Arc<Devres<Bar0>>,
>> >>> +    /// TLB flush serialization lock: This lock is acquired during the
>> >>> +    /// DMA fence signalling critical path. It must NEVER be held across any
>> >>> +    /// reclaimable CPU memory allocations because the memory reclaim path can
>> >>> +    /// call `dma_fence_wait()`, which would deadlock with this lock held.
>> >>> +    #[pin]
>> >>> +    lock: Mutex<()>,
>> >>> +}
>> >>> +
>> >>> +impl Tlb {
>> >>> +    /// Create a new TLB manager.
>> >>> +    pub(super) fn new(bar: Arc<Devres<Bar0>>) -> impl PinInit<Self> {
>> >>> +        pin_init!(Self {
>> >>> +            bar,
>> >>> +            lock <- new_mutex!((), "tlb_flush"),
>> >>> +        })
>> >>> +    }
>> >>> +
>> >>> +    /// Flush the GPU TLB for a specific page directory base.
>> >>> +    ///
>> >>> +    /// This invalidates all TLB entries associated with the given PDB address.
>> >>> +    /// Must be called after modifying page table entries to ensure the GPU sees
>> >>> +    /// the updated mappings.
>> >>> +    pub(crate) fn flush(&self, pdb_addr: VramAddress) -> Result {
>> >>
>> >> This landed on my list randomly, so I took a look.
>> >>
>> >> Wouldn’t you want to virtualize the invalidation based on your device?
>> >> For example, what if you need to register interface changes on future hardware?
>> > 
>> > Good point, for future hardware it indeed makes sense. I will do that.
>> Actually, at least in the future as far as I can see, the register definitions
>> are the same for TLB invalidation are the same, so we are good and I will not be
>> making any change in this regard.
>> 
>> But, thanks for raising the point and forcing me to double check!
>> 
>
> Not my driver, but this looks like a classic “works now” change that may
> not hold up later, which is why I replied to something that isn’t really
> my business.
>
> Again, not my area, but I’ve been through this before. Generally,
> getting the abstractions right up front pays off.

Our policy so far has been to introduce abstractions when there is a
justifiable hardware difference within the (reduced) set of GPUs that we
support. Adding HALs imply using virtual calls, which are harder to
trace, or monomorphization using generic types, which complicate the
code (and are hard to justify when there is only one implementation).

`Tlb` is already an abstraction of sorts; the register accesses are
well-contained within this leaf module, so if the need arises to support
a different scheme this can be done transparently to callers, which is
what matter imho.

^ permalink raw reply

* Re: [PATCH] docs: proc: document ProtectionKey in smaps
From: David Hildenbrand (Arm) @ 2026-04-08  7:39 UTC (permalink / raw)
  To: Kevin Brodsky, Randy Dunlap, Dave Hansen, linux-doc
  Cc: linux-kernel, Yury Khrustalev, Jonathan Corbet, Shuah Khan,
	Dave Hansen, Andrew Morton, Lorenzo Stoakes, Vlastimil Babka,
	Mark Rutland, linux-fsdevel, linux-mm
In-Reply-To: <fee59f61-cf62-4a60-9d8a-4543a02a9c48@arm.com>

On 4/8/26 09:15, Kevin Brodsky wrote:
> On 08/04/2026 09:05, David Hildenbrand (Arm) wrote:
>>> I think that "system" is too nebulous there, so I would prefer to see
>>> "hardware" instead.
>> What if you're running in a VM where the feature is hidden ... ?
> 
> Of course that's also possible, "hardware" has to be interpreted in the
> context of virtualisation... But granted it is possible to hide features
> even on the host with the right kernel parameter, on arm64 at least.
> 
> "If the kernel supports protection keys (pkeys) and the hardware feature
> is detected"? Still vague but a little more accurate.

Can we just talk about CPU support, to avoid using "system" or "hardware" ?

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH v10 01/21] gpu: nova-core: gsp: Return GspStaticInfo from boot()
From: Alexandre Courbot @ 2026-04-08  7:34 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: linux-kernel, Miguel Ojeda, Boqun Feng, Gary Guo, Bjorn Roy Baron,
	Benno Lossin, Andreas Hindborg, Alice Ryhl, Trevor Gross,
	Danilo Krummrich, Dave Airlie, Daniel Almeida, Koen Koning,
	dri-devel, rust-for-linux, Nikola Djukic, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
	Jonathan Corbet, Alex Deucher, Christian Koenig, Jani Nikula,
	Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin, Huang Rui,
	Matthew Auld, Matthew Brost, Lucas De Marchi, Thomas Hellstrom,
	Helge Deller, Alex Gaynor, Boqun Feng, John Hubbard,
	Alistair Popple, Timur Tabi, Edwin Peer, Andrea Righi,
	Andy Ritger, Zhi Wang, Balbir Singh, Philipp Stanner,
	Elle Rhumsaa, alexeyi, Eliot Courtney, joel, linux-doc, amd-gfx,
	intel-gfx, intel-xe, linux-fbdev
In-Reply-To: <20260331212048.2229260-2-joelagnelf@nvidia.com>

On Wed Apr 1, 2026 at 6:20 AM JST, Joel Fernandes wrote:
> Refactor the GSP boot function to return only the GspStaticInfo,
> removing the FbLayout from the return tuple.

We are not returning the `FbLayout` - that bit was introduced from an
earlier revision of this series and is not in the original code.

^ permalink raw reply

* Re: [PATCH v10 02/21] gpu: nova-core: gsp: Extract usable FB region from GSP
From: Alexandre Courbot @ 2026-04-08  7:33 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: linux-kernel, Miguel Ojeda, Boqun Feng, Gary Guo, Bjorn Roy Baron,
	Benno Lossin, Andreas Hindborg, Alice Ryhl, Trevor Gross,
	Danilo Krummrich, Dave Airlie, Daniel Almeida, Koen Koning,
	dri-devel, rust-for-linux, Nikola Djukic, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Simona Vetter,
	Jonathan Corbet, Alex Deucher, Christian Koenig, Jani Nikula,
	Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin, Huang Rui,
	Matthew Auld, Matthew Brost, Lucas De Marchi, Thomas Hellstrom,
	Helge Deller, Alex Gaynor, Boqun Feng, John Hubbard,
	Alistair Popple, Timur Tabi, Edwin Peer, Andrea Righi,
	Andy Ritger, Zhi Wang, Balbir Singh, Philipp Stanner,
	Elle Rhumsaa, alexeyi, Eliot Courtney, joel, linux-doc, amd-gfx,
	intel-gfx, intel-xe, linux-fbdev
In-Reply-To: <20260331212048.2229260-3-joelagnelf@nvidia.com>

On Wed Apr 1, 2026 at 6:20 AM JST, Joel Fernandes wrote:
> Add first_usable_fb_region() to GspStaticConfigInfo to extract the first
> usable FB region from GSP's fbRegionInfoParams. Usable regions are those
> that are not reserved or protected.
>
> The extracted region is stored in GetGspStaticInfoReply and exposed as
> usable_fb_region field for use by the memory subsystem.
>
> Cc: Nikola Djukic <ndjukic@nvidia.com>
> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
> ---
>  drivers/gpu/nova-core/gsp/commands.rs    | 11 ++++--
>  drivers/gpu/nova-core/gsp/fw/commands.rs | 44 +++++++++++++++++++++++-
>  2 files changed, 52 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/nova-core/gsp/commands.rs b/drivers/gpu/nova-core/gsp/commands.rs
> index c89c7b57a751..41742c1633c8 100644
> --- a/drivers/gpu/nova-core/gsp/commands.rs
> +++ b/drivers/gpu/nova-core/gsp/commands.rs
> @@ -4,6 +4,7 @@
>      array,
>      convert::Infallible,
>      ffi::FromBytesUntilNulError,
> +    ops::Range,
>      str::Utf8Error, //
>  };
>  
> @@ -189,22 +190,28 @@ fn init(&self) -> impl Init<Self::Command, Self::InitError> {
>      }
>  }
>  
> -/// The reply from the GSP to the [`GetGspInfo`] command.
> +/// The reply from the GSP to the [`GetGspStaticInfo`] command.
>  pub(crate) struct GetGspStaticInfoReply {
>      gpu_name: [u8; 64],
> +    /// Usable FB (VRAM) region for driver memory allocation.
> +    #[expect(dead_code)]
> +    pub(crate) usable_fb_region: Range<u64>,

Let's print the region when creating the GPU (using `dev_info` or
`dev_dbg`) - this can be useful to the user, and lets us remove this
`dead_code`.

^ permalink raw reply

* Re: [PATCH] docs: proc: document ProtectionKey in smaps
From: Kevin Brodsky @ 2026-04-08  7:15 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Randy Dunlap, Dave Hansen, linux-doc
  Cc: linux-kernel, Yury Khrustalev, Jonathan Corbet, Shuah Khan,
	Dave Hansen, Andrew Morton, Lorenzo Stoakes, Vlastimil Babka,
	Mark Rutland, linux-fsdevel, linux-mm
In-Reply-To: <8a5e4afd-cd0a-400a-8624-79c1dc9e3ff3@kernel.org>

On 08/04/2026 09:05, David Hildenbrand (Arm) wrote:
>>>> To me "system" is a bit ambiguous here but _can_ refer to the whole
>>>> hardware/software system as a whole. To avoid redundancy, I'd say either:
>>>>
>>>> 	If both the kernel and the processor support protection keys...
>>>>
>>>> or
>>>>
>>>> 	If the system supports protection keys...
>>> I see your point. By "system" I essentially mean the hardware (the SoC).
>>> In general I would tend to avoid "processor" because not all CPUs in a
>>> system necessarily have the same features, and some features require
>>> hardware support beyond the CPU itself. Terminology is hard...
>>>
>>> Happy to replace "system" with "hardware" if that's clearer 🙂
>> I think that "system" is too nebulous there, so I would prefer to see
>> "hardware" instead.
> What if you're running in a VM where the feature is hidden ... ?

Of course that's also possible, "hardware" has to be interpreted in the
context of virtualisation... But granted it is possible to hide features
even on the host with the right kernel parameter, on arm64 at least.

"If the kernel supports protection keys (pkeys) and the hardware feature
is detected"? Still vague but a little more accurate.

- Kevin

^ permalink raw reply

* Re: [PATCH] docs: proc: document ProtectionKey in smaps
From: Kevin Brodsky @ 2026-04-08  7:06 UTC (permalink / raw)
  To: Randy Dunlap, Dave Hansen, linux-doc
  Cc: linux-kernel, Yury Khrustalev, Jonathan Corbet, Shuah Khan,
	Dave Hansen, Andrew Morton, Lorenzo Stoakes, Vlastimil Babka,
	David Hildenbrand, Mark Rutland, linux-fsdevel, linux-mm
In-Reply-To: <18e2042c-d414-40fb-8819-5e930d5b1584@infradead.org>

On 07/04/2026 20:58, Randy Dunlap wrote:
>>> To me "system" is a bit ambiguous here but _can_ refer to the whole
>>> hardware/software system as a whole. To avoid redundancy, I'd say either:
>>>
>>> 	If both the kernel and the processor support protection keys...
>>>
>>> or
>>>
>>> 	If the system supports protection keys...
>> I see your point. By "system" I essentially mean the hardware (the SoC).
>> In general I would tend to avoid "processor" because not all CPUs in a
>> system necessarily have the same features, and some features require
>> hardware support beyond the CPU itself. Terminology is hard...
>>
>> Happy to replace "system" with "hardware" if that's clearer 🙂
> I think that "system" is too nebulous there, so I would prefer to see
> "hardware" instead.

Ack, will send a v2.

- Kevin

^ permalink raw reply

* Re: [PATCH] docs: proc: document ProtectionKey in smaps
From: David Hildenbrand (Arm) @ 2026-04-08  7:05 UTC (permalink / raw)
  To: Randy Dunlap, Kevin Brodsky, Dave Hansen, linux-doc
  Cc: linux-kernel, Yury Khrustalev, Jonathan Corbet, Shuah Khan,
	Dave Hansen, Andrew Morton, Lorenzo Stoakes, Vlastimil Babka,
	Mark Rutland, linux-fsdevel, linux-mm
In-Reply-To: <18e2042c-d414-40fb-8819-5e930d5b1584@infradead.org>

On 4/7/26 20:58, Randy Dunlap wrote:
> 
> 
> On 4/7/26 8:12 AM, Kevin Brodsky wrote:
>> On 07/04/2026 16:42, Dave Hansen wrote:
>>> I think you're trying to get across the point here that the kernel needs
>>> to know about protection keys, have it enabled, and be running on a CPU
>>> with pkey support.
>>
>> Indeed.
>>
>>> To me "system" is a bit ambiguous here but _can_ refer to the whole
>>> hardware/software system as a whole. To avoid redundancy, I'd say either:
>>>
>>> 	If both the kernel and the processor support protection keys...
>>>
>>> or
>>>
>>> 	If the system supports protection keys...
>>
>> I see your point. By "system" I essentially mean the hardware (the SoC).
>> In general I would tend to avoid "processor" because not all CPUs in a
>> system necessarily have the same features, and some features require
>> hardware support beyond the CPU itself. Terminology is hard...
>>
>> Happy to replace "system" with "hardware" if that's clearer :)
> 
> I think that "system" is too nebulous there, so I would prefer to see
> "hardware" instead.

What if you're running in a VM where the feature is hidden ... ?
-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH v10 12/21] gpu: nova-core: mm: Add unified page table entry wrapper enums
From: Alexandre Courbot @ 2026-04-08  7:03 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Eliot Courtney, linux-kernel, Miguel Ojeda, Boqun Feng, Gary Guo,
	Bjorn Roy Baron, Benno Lossin, Andreas Hindborg, Alice Ryhl,
	Trevor Gross, Danilo Krummrich, Dave Airlie, Daniel Almeida,
	Koen Koning, dri-devel, rust-for-linux, Nikola Djukic,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
	Simona Vetter, Jonathan Corbet, Alex Deucher, Christian Koenig,
	Jani Nikula, Joonas Lahtinen, Rodrigo Vivi, Tvrtko Ursulin,
	Huang Rui, Matthew Auld, Matthew Brost, Lucas De Marchi,
	Thomas Hellstrom, Helge Deller, Alex Gaynor, Boqun Feng,
	John Hubbard, Alistair Popple, Timur Tabi, Edwin Peer,
	Andrea Righi, Andy Ritger, Zhi Wang, Balbir Singh,
	Philipp Stanner, Elle Rhumsaa, alexeyi, joel, linux-doc, amd-gfx,
	intel-gfx, intel-xe, linux-fbdev
In-Reply-To: <537a8c5a-3885-4c47-99f6-963b48ddf87d@nvidia.com>

On Tue Apr 7, 2026 at 10:59 PM JST, Joel Fernandes wrote:
> Hi Eliot,
>
> On 4/7/2026 9:42 AM, Eliot Courtney wrote:
>> On Tue Apr 7, 2026 at 6:55 AM JST, Joel Fernandes wrote:
>>>>> +    /// Compute upper bound on page table pages needed for `num_virt_pages`.
>>>>> +    ///
>>>>> +    /// Walks from PTE level up through PDE levels, accumulating the tree.
>>>>> +    pub(crate) fn pt_pages_upper_bound(&self, num_virt_pages: usize) -> usize {
>>>>> +        let mut total = 0;
>>>>> +
>>>>> +        // PTE pages at the leaf level.
>>>>> +        let pte_epp = self.entries_per_page(self.pte_level());
>>>>> +        let mut pages_at_level = num_virt_pages.div_ceil(pte_epp);
>>>>> +        total += pages_at_level;
>>>>> +
>>>>> +        // Walk PDE levels bottom-up (reverse of pde_levels()).
>>>>> +        for &level in self.pde_levels().iter().rev() {
>>>>> +            let epp = self.entries_per_page(level);
>>>>> +
>>>>> +            // How many pages at this level do we need to point to
>>>>> +            // the previous pages_at_level?
>>>>> +            pages_at_level = pages_at_level.div_ceil(epp);
>>>>> +            total += pages_at_level;
>>>>> +        }
>>>>> +
>>>>> +        total
>>>>> +    }
>>>>> +}
>>>>> +
>>>>
>>>> We have a lot of matches on the MMU version here (and below in Pte, Pde,
>>>> DualPde). What about making MmuVersion into a trait (e.g. Mmu) with
>>>> associated types for Pte, Pde, DualPde which can implement traits
>>>> defining their common operations too?
>>>
>>> I coded this up and it did not look pretty, there's not much LOC savings and the
>>> code becomes harder to read because of parametrization of several functions. Also:
>> 
>> Thanks for looking into it. Sorry to be a bother, but would you have a
>> branch around with the code? I'm curious what didn't look good about it.
>
> Sorry but I already mentioned that above, the parameterizing of dozens of
> function call sites, 3-4 new traits (because each struct like
> Pte/Pde/DualPde etc each need their own trait which different MMU versions
> implement) etc. The code because hard to read and readability is the top
> critical criteria for me - I am personally strictly against "Lets use shiny
> features in language at the cost of making code unreadable". Because that
> translates into bugs and nightmare for maintainability.

After a quick look I'd say that having a trait here would actually be
*good* for correctness and maintainability.

The current design implies that every operation on a page table (most
likely using the walker) goes through a branching point. Just looking at
`PtWalk::read_pte_at_level`, there are already at least 6
`if version == 2 { } else { }` branches that all resolve to the same
result. Include walking down the PDEs and you have at least a dozen of
these just to resolve a virtual address. I know CPUs are fast, but this
is still wasted cycles for no good reason.

If you use a trait here, and make `PtWalk` generic against it, you can
optimize this away. We had a similar situation when we introduced Turing
support and the v2 ucode header, and tried both approaches: the
trait-based one was slightly shorter, and arguably more readable.

But the main argument to use a trait here IMO is that it enables
associated types and constants. That's particularly critical since some
equivalent fields have different lengths between v2 and v3. An
associated `Bounded` type for these would force the caller to validate
the length of these fields before calling a non-fallible operation,
which is exactly the level of caution that we want when dealing with
page tables.

In order to fully benefit from it, we will need the bitfield macro from
the `kernel` crate so the PDE/PTE fields can be `Bounded`, I will try to
make it available quickly in a patch that you can depend on.

But long story short, and although I need to dive deeper into the code,
this looks like a good candidate for using a trait and associated types.

^ permalink raw reply

* [PATCH 0/4] docs/zh_CN: update rust/ subsystem translations
From: Ben Guo @ 2026-04-08  5:05 UTC (permalink / raw)
  To: Alex Shi, Yanteng Si, Dongliang Mu, Jonathan Corbet
  Cc: linux-doc, linux-kernel, rust-for-linux, Ben Guo

Update Chinese translations for the Rust subsystem documentation,
syncing with the latest upstream changes.

- arch-support.rst: add ARM (ARMv7) support, update RISC-V and UM notes
- coding-guidelines.rst: add imports formatting, private item docs,
  C FFI types, and Lints sections
- quick-start.rst: add distro-specific install instructions, update
  rustc/bindgen sections, remove cargo section
- index.rst: remove experimental notice and genindex

Ben Guo (4):
  docs/zh_CN: update rust/arch-support.rst translation
  docs/zh_CN: update rust/coding-guidelines.rst translation
  docs/zh_CN: update rust/quick-start.rst translation
  docs/zh_CN: update rust/index.rst translation

 .../translations/zh_CN/rust/arch-support.rst  |   9 +-
 .../zh_CN/rust/coding-guidelines.rst          | 262 +++++++++++++++++-
 .../translations/zh_CN/rust/index.rst         |  17 --
 .../translations/zh_CN/rust/quick-start.rst   | 190 ++++++++++---
 4 files changed, 401 insertions(+), 77 deletions(-)

-- 
2.53.0

^ permalink raw reply

* [PATCH 4/4] docs/zh_CN: update rust/index.rst translation
From: Ben Guo @ 2026-04-08  5:05 UTC (permalink / raw)
  To: Alex Shi, Yanteng Si, Dongliang Mu, Jonathan Corbet
  Cc: linux-doc, linux-kernel, rust-for-linux, Ben Guo
In-Reply-To: <cover.1775619061.git.ben.guo@openatom.club>

Update the translation of .../rust/index.rst into Chinese.

Update the translation through commit a592a36e4937
("Documentation: use a source-read extension for the index link boilerplate")

Reviewed-by: Dongliang Mu <dzm91@hust.edu.cn>
Signed-off-by: Ben Guo <ben.guo@openatom.club>
---
 Documentation/translations/zh_CN/rust/index.rst | 17 -----------------
 1 file changed, 17 deletions(-)

diff --git a/Documentation/translations/zh_CN/rust/index.rst b/Documentation/translations/zh_CN/rust/index.rst
index 5347d472958..138e057bee4 100644
--- a/Documentation/translations/zh_CN/rust/index.rst
+++ b/Documentation/translations/zh_CN/rust/index.rst
@@ -12,16 +12,6 @@ Rust
 
 与内核中的Rust有关的文档。若要开始在内核中使用Rust,请阅读 quick-start.rst 指南。
 
-Rust 实验
----------
-Rust 支持在 v6.1 版本中合并到主线,以帮助确定 Rust 作为一种语言是否适合内核,
-即是否值得进行权衡。
-
-目前,Rust 支持主要面向对 Rust 支持感兴趣的内核开发人员和维护者,
-以便他们可以开始处理抽象和驱动程序,并帮助开发基础设施和工具。
-
-如果您是终端用户,请注意,目前没有适合或旨在生产使用的内置驱动程序或模块,
-并且 Rust 支持仍处于开发/实验阶段,尤其是对于特定内核配置。
 
 代码文档
 --------
@@ -50,10 +40,3 @@ Rust 支持在 v6.1 版本中合并到主线,以帮助确定 Rust 作为一种
     testing
 
 你还可以在 :doc:`../../../process/kernel-docs` 中找到 Rust 的学习材料。
-
-.. only::  subproject and html
-
-   Indices
-   =======
-
-   * :ref:`genindex`
-- 
2.53.0

^ permalink raw reply related

* [PATCH 2/4] docs/zh_CN: update rust/coding-guidelines.rst translation
From: Ben Guo @ 2026-04-08  5:05 UTC (permalink / raw)
  To: Alex Shi, Yanteng Si, Dongliang Mu, Jonathan Corbet
  Cc: linux-doc, linux-kernel, rust-for-linux, Ben Guo
In-Reply-To: <cover.1775619061.git.ben.guo@openatom.club>

Update the translation of .../rust/coding-guidelines.rst into Chinese.

Update the translation through commit 4a9cb2eecc78
("docs: rust: add section on imports formatting")

Reviewed-by: Dongliang Mu <dzm91@hust.edu.cn>
Signed-off-by: Ben Guo <ben.guo@openatom.club>
---
 .../zh_CN/rust/coding-guidelines.rst          | 262 +++++++++++++++++-
 1 file changed, 248 insertions(+), 14 deletions(-)

diff --git a/Documentation/translations/zh_CN/rust/coding-guidelines.rst b/Documentation/translations/zh_CN/rust/coding-guidelines.rst
index 419143b938e..54b902322db 100644
--- a/Documentation/translations/zh_CN/rust/coding-guidelines.rst
+++ b/Documentation/translations/zh_CN/rust/coding-guidelines.rst
@@ -37,6 +37,73 @@
 像内核其他部分的 ``clang-format`` 一样, ``rustfmt`` 在单个文件上工作,并且不需要
 内核配置。有时,它甚至可以与破碎的代码一起工作。
 
+导入
+~~~~
+
+``rustfmt`` 默认会以一种在合并和变基时容易产生冲突的方式格式化导入,因为在某些情况下
+它会将多个条目合并到同一行。例如:
+
+.. code-block:: rust
+
+	// Do not use this style.
+	use crate::{
+	    example1,
+	    example2::{example3, example4, example5},
+	    example6, example7,
+	    example8::example9,
+	};
+
+相反,内核使用如下所示的垂直布局:
+
+.. code-block:: rust
+
+	use crate::{
+	    example1,
+	    example2::{
+	        example3,
+	        example4,
+	        example5, //
+	    },
+	    example6,
+	    example7,
+	    example8::example9, //
+	};
+
+也就是说,每个条目占一行,只要列表中有多个条目就使用花括号。
+
+末尾的空注释可以保留这种格式。不仅如此, ``rustfmt`` 在添加空注释后实际上会将导入重
+新格式化为垂直布局。也就是说,可以通过对如下输入运行 ``rustfmt`` 来轻松地将原始示例
+重新格式化为预期的风格:
+
+.. code-block:: rust
+
+	// Do not use this style.
+	use crate::{
+	    example1,
+	    example2::{example3, example4, example5, //
+	    },
+	    example6, example7,
+	    example8::example9, //
+	};
+
+末尾的空注释适用于嵌套导入(如上所示)以及单条目导入——这有助于最小化补丁系列中的差
+异:
+
+.. code-block:: rust
+
+	use crate::{
+	    example1, //
+	};
+
+末尾的空注释可以放在花括号内的任何一行中,但建议放在最后一个条目上,因为这让人联想到其
+他格式化工具中的末尾逗号。有时在补丁系列中由于列表的变更,避免多次移动注释可能更简单。
+
+在某些情况下可能需要例外处理,即以上都不是硬性规则。也有一些代码尚未迁移到这种风格,但
+请不要引入其他风格的代码。
+
+最终目标是让 ``rustfmt`` 在稳定版本中自动支持这种格式化风格(或类似的风格),而无需
+末尾的空注释。因此,在某个时候,目标是移除这些注释。
+
 
 注释
 ----
@@ -77,6 +144,16 @@
 	    // ...
 	}
 
+这适用于公共和私有项目。这增加了与公共项目的一致性,允许在更改可见性时减少涉及的更改,
+并允许我们将来也为私有项目生成文档。换句话说,如果为私有项目编写了文档,那么仍然应该使
+用 ``///`` 。例如:
+
+.. code-block:: rust
+
+	/// My private function.
+	// TODO: ...
+	fn f() {}
+
 一种特殊的注释是 ``// SAFETY:`` 注释。这些注释必须出现在每个 ``unsafe`` 块之前,它们
 解释了为什么该块内的代码是正确/健全的,即为什么它在任何情况下都不会触发未定义行为,例如:
 
@@ -131,27 +208,27 @@ https://commonmark.org/help/
 
 这个例子展示了一些 ``rustdoc`` 的特性和内核中遵循的一些惯例:
 
-  - 第一段必须是一个简单的句子,简要地描述被记录的项目的作用。进一步的解释必须放在额
-    外的段落中。
+- 第一段必须是一个简单的句子,简要地描述被记录的项目的作用。进一步的解释必须放在额
+  外的段落中。
 
-  - 不安全的函数必须在 ``# Safety`` 部分记录其安全前提条件。
+- 不安全的函数必须在 ``# Safety`` 部分记录其安全前提条件。
 
-  - 虽然这里没有显示,但如果一个函数可能会恐慌,那么必须在 ``# Panics`` 部分描述发
-    生这种情况的条件。
+- 虽然这里没有显示,但如果一个函数可能会恐慌,那么必须在 ``# Panics`` 部分描述发
+  生这种情况的条件。
 
-    请注意,恐慌应该是非常少见的,只有在有充分理由的情况下才会使用。几乎在所有的情况下,
-    都应该使用一个可失败的方法,通常是返回一个 ``Result``。
+  请注意,恐慌应该是非常少见的,只有在有充分理由的情况下才会使用。几乎在所有的情况下,
+  都应该使用一个可失败的方法,通常是返回一个 ``Result``。
 
-  - 如果提供使用实例对读者有帮助的话,必须写在一个叫做``# Examples``的部分。
+- 如果提供使用实例对读者有帮助的话,必须写在一个叫做``# Examples``的部分。
 
-  - Rust项目(函数、类型、常量……)必须有适当的链接(``rustdoc`` 会自动创建一个
-    链接)。
+- Rust项目(函数、类型、常量……)必须有适当的链接(``rustdoc`` 会自动创建一个
+  链接)。
 
-  - 任何 ``unsafe`` 的代码块都必须在前面加上一个 ``// SAFETY:`` 的注释,描述里面
-    的代码为什么是正确的。
+- 任何 ``unsafe`` 的代码块都必须在前面加上一个 ``// SAFETY:`` 的注释,描述里面
+  的代码为什么是正确的。
 
-    虽然有时原因可能看起来微不足道,但写这些注释不仅是记录已经考虑到的问题的好方法,
-    最重要的是,它提供了一种知道没有额外隐含约束的方法。
+  虽然有时原因可能看起来微不足道,但写这些注释不仅是记录已经考虑到的问题的好方法,
+  最重要的是,它提供了一种知道没有额外隐含约束的方法。
 
 要了解更多关于如何编写Rust和拓展功能的文档,请看看 ``rustdoc`` 这本书,网址是:
 
@@ -170,6 +247,22 @@ https://commonmark.org/help/
        /// [`struct mutex`]: srctree/include/linux/mutex.h
 
 
+C FFI 类型
+----------
+
+Rust 内核代码使用类型别名(如 ``c_int``)来引用 C 类型(如 ``int``),这些别名可
+以直接从 ``kernel`` 预导入(prelude)中获取。请不要使用 ``core::ffi`` 中的别
+名——它们可能无法映射到正确的类型。
+
+这些别名通常应该直接通过其标识符引用,即作为单段路径。例如:
+
+.. code-block:: rust
+
+	fn f(p: *const c_char) -> c_int {
+	    // ...
+	}
+
+
 命名
 ----
 
@@ -202,3 +295,144 @@ Rust内核代码遵循通常的Rust命名空间:
 
 也就是说, ``GPIO_LINE_DIRECTION_IN`` 的等价物将被称为 ``gpio::LineDirection::In`` 。
 特别是,它不应该被命名为 ``gpio::gpio_line_direction::GPIO_LINE_DIRECTION_IN`` 。
+
+
+代码检查提示(Lints)
+---------------------
+
+在 Rust 中,可以在局部 ``allow`` 特定的警告(诊断信息、代码检查提示(lint)),
+使编译器忽略给定函数、模块、代码块等中给定警告的实例。
+
+这类似于 C 中的 ``#pragma GCC diagnostic push`` + ``ignored`` + ``pop``
+[#]_:
+
+.. code-block:: c
+
+	#pragma GCC diagnostic push
+	#pragma GCC diagnostic ignored "-Wunused-function"
+	static void f(void) {}
+	#pragma GCC diagnostic pop
+
+.. [#] 在这个特定情况下,可以使用内核的 ``__{always,maybe}_unused`` 属性
+       (C23 的 ``[[maybe_unused]]``);然而,此示例旨在反映下文讨论的 Rust 中
+       的等效代码检查提示。
+
+但要简洁得多:
+
+.. code-block:: rust
+
+	#[allow(dead_code)]
+	fn f() {}
+
+凭借这一点,可以更方便地默认启用更多诊断(即在 ``W=`` 级别之外)。特别是那些可能有
+一些误报但在其他方面非常有用的诊断,保持启用可以捕获潜在的错误。
+
+在此基础上,Rust 提供了 ``expect`` 属性,更进一步。如果警告没有产生,它会让编译器
+发出警告。例如,以下代码将确保当 ``f()`` 在某处被调用时,我们必须移除该属性:
+
+.. code-block:: rust
+
+	#[expect(dead_code)]
+	fn f() {}
+
+如果我们不这样做,编译器会发出警告::
+
+	warning: this lint expectation is unfulfilled
+	 --> x.rs:3:10
+	  |
+	3 | #[expect(dead_code)]
+	  |          ^^^^^^^^^
+	  |
+	  = note: `#[warn(unfulfilled_lint_expectations)]` on by default
+
+这意味着 ``expect`` 不会在不需要时被遗忘,这可能发生在以下几种情况中:
+
+- 开发过程中添加的临时属性。
+
+- 编译器、Clippy 或自定义工具中代码检查提示的改进可能消除误报。
+
+- 当代码检查提示不再需要时,因为预期它会在某个时候被移除,例如上面的
+  ``dead_code`` 示例。
+
+这也增加了剩余 ``allow`` 的可见性,并减少了误用的可能性。
+
+因此,优先使用 ``expect`` 而不是 ``allow``,除非:
+
+- 条件编译在某些情况下触发警告,在其他情况下不触发。
+
+  如果与总的相比,只有少数情况触发(或不触发)警告,那么可以考虑使用条件
+  ``expect``(即 ``cfg_attr(..., expect(...))``)。否则,使用 ``allow`` 可
+  能更简单。
+
+- 在宏内部,不同的调用可能会创建在某些情况下触发警告而在其他情况下不触发的展开代码。
+
+- 当代码可能在某些架构上触发警告但在其他架构上不触发时,例如到 C FFI 类型的 ``as``
+  转换。
+
+作为一个更详细的示例,考虑以下程序:
+
+.. code-block:: rust
+
+	fn g() {}
+
+	fn main() {
+	    #[cfg(CONFIG_X)]
+	    g();
+	}
+
+这里,如果 ``CONFIG_X`` 未设置,函数 ``g()`` 是死代码。我们可以在这里使用
+``expect`` 吗?
+
+.. code-block:: rust
+
+	#[expect(dead_code)]
+	fn g() {}
+
+	fn main() {
+	    #[cfg(CONFIG_X)]
+	    g();
+	}
+
+如果 ``CONFIG_X`` 被设置,这将产生代码检查提示,因为在该配置中它不是死代码。因
+此,在这种情况下,我们不能直接使用 ``expect``。
+
+一个简单的可能性是使用 ``allow``:
+
+.. code-block:: rust
+
+	#[allow(dead_code)]
+	fn g() {}
+
+	fn main() {
+	    #[cfg(CONFIG_X)]
+	    g();
+	}
+
+另一种方法是使用条件 ``expect``:
+
+.. code-block:: rust
+
+	#[cfg_attr(not(CONFIG_X), expect(dead_code))]
+	fn g() {}
+
+	fn main() {
+	    #[cfg(CONFIG_X)]
+	    g();
+	}
+
+这将确保如果有人在某处引入了对 ``g()`` 的另一个调用(例如无条件的),那么将会被发现
+它不再是死代码。然而, ``cfg_attr`` 比简单的 ``allow`` 更复杂。
+
+因此,当涉及多个配置或者代码检查提示可能由于非局部更改(如 ``dead_code``)而触发
+时,使用条件 ``expect`` 可能不值得。
+
+有关 Rust 中诊断的更多信息,请参阅:
+
+	https://doc.rust-lang.org/stable/reference/attributes/diagnostics.html
+
+错误处理
+--------
+
+有关 Rust for Linux 特定错误处理的背景和指南,请参阅:
+
+	https://rust.docs.kernel.org/kernel/error/type.Result.html#error-codes-in-c-and-rust
-- 
2.53.0

^ permalink raw reply related

* [PATCH 3/4] docs/zh_CN: update rust/quick-start.rst translation
From: Ben Guo @ 2026-04-08  5:05 UTC (permalink / raw)
  To: Alex Shi, Yanteng Si, Dongliang Mu, Jonathan Corbet
  Cc: linux-doc, linux-kernel, rust-for-linux, Ben Guo
In-Reply-To: <cover.1775619061.git.ben.guo@openatom.club>

Update the translation of .../rust/quick-start.rst into Chinese.

Update the translation through commit 5935461b4584
("docs: rust: quick-start: add Debian 13 (Trixie)")

Reviewed-by: Dongliang Mu <dzm91@hust.edu.cn>
Signed-off-by: Ben Guo <ben.guo@openatom.club>
---
 .../translations/zh_CN/rust/quick-start.rst   | 190 ++++++++++++++----
 1 file changed, 148 insertions(+), 42 deletions(-)

diff --git a/Documentation/translations/zh_CN/rust/quick-start.rst b/Documentation/translations/zh_CN/rust/quick-start.rst
index 8616556ae4d..5f0ece6411f 100644
--- a/Documentation/translations/zh_CN/rust/quick-start.rst
+++ b/Documentation/translations/zh_CN/rust/quick-start.rst
@@ -13,16 +13,138 @@
 
 本文介绍了如何开始使用Rust进行内核开发。
 
+安装内核开发所需的 Rust 工具链有几种方式。一种简单的方式是使用 Linux 发行版的软件包
+(如果它们合适的话)——下面的第一节解释了这种方法。这种方法的一个优势是,通常发行版会
+匹配 Rust 和 Clang 所使用的 LLVM。
+
+另一种方式是使用 `kernel.org <https://kernel.org/pub/tools/llvm/rust/>`_ 上提
+供的预构建稳定版本的 LLVM+Rust。这些与 :ref:`获取 LLVM <zh_cn_getting_llvm>` 中的精
+简快速 LLVM 工具链相同,并添加了 Rust for Linux 支持的 Rust 版本。提供了两套工具
+链:"最新 LLVM" 和 "匹配 LLVM"(请参阅链接了解更多信息)。
+
+或者,接下来的两个 "依赖" 章节将解释每个组件以及如何通过 ``rustup``、Rust 的独立
+安装程序或从源码构建来安装它们。
+
+本文档的其余部分解释了有关如何入门的其他方面。
+
+
+发行版
+------
+
+Arch Linux
+**********
+
+Arch Linux 提供较新的 Rust 版本,因此通常开箱即用,例如::
+
+	pacman -S rust rust-src rust-bindgen
+
+
+Debian
+******
+
+Debian 13(Trixie)以及 Testing 和 Debian Unstable(Sid)提供较新的 Rust 版
+本,因此通常开箱即用,例如::
+
+	apt install rustc rust-src bindgen rustfmt rust-clippy
+
+
+Fedora Linux
+************
+
+Fedora Linux 提供较新的 Rust 版本,因此通常开箱即用,例如::
+
+	dnf install rust rust-src bindgen-cli rustfmt clippy
+
+
+Gentoo Linux
+************
+
+Gentoo Linux(尤其是 testing 分支)提供较新的 Rust 版本,因此通常开箱即用,
+例如::
+
+	USE='rust-src rustfmt clippy' emerge dev-lang/rust dev-util/bindgen
+
+可能需要设置 ``LIBCLANG_PATH``。
+
+
+Nix
+***
+
+Nix(unstable 频道)提供较新的 Rust 版本,因此通常开箱即用,例如::
+
+	{ pkgs ? import <nixpkgs> {} }:
+	pkgs.mkShell {
+	  nativeBuildInputs = with pkgs; [ rustc rust-bindgen rustfmt clippy ];
+	  RUST_LIB_SRC = "${pkgs.rust.packages.stable.rustPlatform.rustLibSrc}";
+	}
+
+
+openSUSE
+********
+
+openSUSE Slowroll 和 openSUSE Tumbleweed 提供较新的 Rust 版本,因此通常开箱
+即用,例如::
+
+	zypper install rust rust1.79-src rust-bindgen clang
+
+
+Ubuntu
+******
+
+25.04
+~~~~~
+
+最新的 Ubuntu 版本提供较新的 Rust 版本,因此通常开箱即用,例如::
+
+	apt install rustc rust-src bindgen rustfmt rust-clippy
+
+此外,需要设置 ``RUST_LIB_SRC``,例如::
+
+	RUST_LIB_SRC=/usr/src/rustc-$(rustc --version | cut -d' ' -f2)/library
+
+为方便起见,可以将 ``RUST_LIB_SRC`` 导出到全局环境中。
+
+
+24.04 LTS 及更早版本
+~~~~~~~~~~~~~~~~~~~~
+
+虽然 Ubuntu 24.04 LTS 及更早版本仍然提供较新的 Rust 版本,但它们需要一些额外的配
+置,使用带版本号的软件包,例如::
+
+	apt install rustc-1.80 rust-1.80-src bindgen-0.65 rustfmt-1.80 \
+		rust-1.80-clippy
+	ln -s /usr/lib/rust-1.80/bin/rustfmt /usr/bin/rustfmt-1.80
+	ln -s /usr/lib/rust-1.80/bin/clippy-driver /usr/bin/clippy-driver-1.80
+
+这些软件包都不会将其工具设置为默认值;因此应该显式指定它们,例如::
+
+	make LLVM=1 RUSTC=rustc-1.80 RUSTDOC=rustdoc-1.80 RUSTFMT=rustfmt-1.80 \
+		CLIPPY_DRIVER=clippy-driver-1.80 BINDGEN=bindgen-0.65
+
+或者,修改 ``PATH`` 变量将 Rust 1.80 的二进制文件放在前面,并将 ``bindgen`` 设
+置为默认值,例如::
+
+	PATH=/usr/lib/rust-1.80/bin:$PATH
+	update-alternatives --install /usr/bin/bindgen bindgen \
+		/usr/bin/bindgen-0.65 100
+	update-alternatives --set bindgen /usr/bin/bindgen-0.65
+
+使用带版本号的软件包时需要设置 ``RUST_LIB_SRC``,例如::
+
+	RUST_LIB_SRC=/usr/src/rustc-$(rustc-1.80 --version | cut -d' ' -f2)/library
+
+为方便起见,可以将 ``RUST_LIB_SRC`` 导出到全局环境中。
+
+此外, ``bindgen-0.65`` 在较新的版本(24.04 LTS 和 24.10)中可用,但在更早的版
+本(20.04 LTS 和 22.04 LTS)中可能不可用,因此可能需要手动构建 ``bindgen``
+(请参见下文)。
+
 
 构建依赖
 --------
 
 本节描述了如何获取构建所需的工具。
 
-其中一些依赖也许可以从Linux发行版中获得,包名可能是 ``rustc`` , ``rust-src`` ,
-``rust-bindgen`` 等。然而,在写这篇文章的时候,它们很可能还不够新,除非发行版跟踪最
-新的版本。
-
 为了方便检查是否满足要求,可以使用以下目标::
 
 	make LLVM=1 rustavailable
@@ -34,15 +156,14 @@
 rustc
 *****
 
-需要一个特定版本的Rust编译器。较新的版本可能会也可能不会工作,因为就目前而言,内核依赖
-于一些不稳定的Rust特性。
+需要一个较新版本的Rust编译器。
 
 如果使用的是 ``rustup`` ,请进入内核编译目录(或者用 ``--path=<build-dir>`` 参数
-来 ``设置`` sub-command)并运行::
+来 ``设置`` sub-command),例如运行::
 
-	rustup override set $(scripts/min-tool-version.sh rustc)
+	rustup override set stable
 
-+这将配置你的工作目录使用正确版本的 ``rustc``,而不影响你的默认工具链。
+这将配置你的工作目录使用给定版本的 ``rustc``,而不影响你的默认工具链。
 
 请注意覆盖应用当前的工作目录(和它的子目录)。
 
@@ -54,7 +175,7 @@ rustc
 Rust标准库源代码
 ****************
 
-Rust标准库的源代码是必需的,因为构建系统会交叉编译 ``core`` 和 ``alloc`` 。
+Rust标准库的源代码是必需的,因为构建系统会交叉编译 ``core`` 。
 
 如果正在使用 ``rustup`` ,请运行::
 
@@ -64,10 +185,10 @@ Rust标准库的源代码是必需的,因为构建系统会交叉编译 ``core
 
 否则,如果使用独立的安装程序,可以将Rust源码树下载到安装工具链的文件夹中::
 
-       curl -L "https://static.rust-lang.org/dist/rust-src-$(scripts/min-tool-version.sh rustc).tar.gz" |
-               tar -xzf - -C "$(rustc --print sysroot)/lib" \
-               "rust-src-$(scripts/min-tool-version.sh rustc)/rust-src/lib/" \
-               --strip-components=3
+	curl -L "https://static.rust-lang.org/dist/rust-src-$(rustc --version | cut -d' ' -f2).tar.gz" |
+		tar -xzf - -C "$(rustc --print sysroot)/lib" \
+		"rust-src-$(rustc --version | cut -d' ' -f2)/rust-src/lib/" \
+		--strip-components=3
 
 在这种情况下,以后升级Rust编译器版本需要手动更新这个源代码树(这可以通过移除
 ``$(rustc --print sysroot)/lib/rustlib/src/rust`` ,然后重新执行上
@@ -97,24 +218,21 @@ Linux发行版中可能会有合适的包,所以最好先检查一下。
 bindgen
 *******
 
-内核的C端绑定是在构建时使用 ``bindgen`` 工具生成的。这需要特定的版本。
-
-通过以下方式安装它(注意,这将从源码下载并构建该工具)::
-
-	cargo install --locked --version $(scripts/min-tool-version.sh bindgen) bindgen-cli
+内核的C端绑定是在构建时使用 ``bindgen`` 工具生成的。
 
-``bindgen`` 需要找到合适的 ``libclang`` 才能工作。如果没有找到(或者找到的
-``libclang`` 与应该使用的 ``libclang`` 不同),则可以使用 ``clang-sys``
-理解的环境变量(Rust绑定创建的 ``bindgen`` 用来访问 ``libclang``):
+例如,通过以下方式安装它(注意,这将从源码下载并构建该工具)::
 
+	cargo install --locked bindgen-cli
 
-* ``LLVM_CONFIG_PATH`` 可以指向一个 ``llvm-config`` 可执行文件。
+``bindgen`` 使用 ``clang-sys`` crate 来查找合适的 ``libclang`` (可以静态链
+接、动态链接或在运行时加载)。默认情况下,上面的 ``cargo`` 命令会生成一个在运行时
+加载 ``libclang`` 的 ``bindgen`` 二进制文件。如果没有找到(或者应该使用与找到的
+不同的 ``libclang``),可以调整该过程,例如使用 ``LIBCLANG_PATH`` 环境变量。详
+情请参阅 ``clang-sys`` 的文档:
 
-* 或者 ``LIBCLANG_PATH`` 可以指向 ``libclang`` 共享库或包含它的目录。
+	https://github.com/KyleMayes/clang-sys#linking
 
-* 或者 ``CLANG_PATH`` 可以指向 ``clang`` 可执行文件。
-
-详情请参阅 ``clang-sys`` 的文档:
+	https://github.com/KyleMayes/clang-sys#environment-variables
 
 
 开发依赖
@@ -151,18 +269,6 @@ clippy
 独立的安装程序也带有 ``clippy`` 。
 
 
-cargo
-*****
-
-``cargo`` 是Rust的本地构建系统。目前需要它来运行测试,因为它被用来构建一个自定义的标准
-库,其中包含了内核中自定义 ``alloc`` 所提供的设施。测试可以使用 ``rusttest`` Make 目标
-来运行。
-
-如果使用的是 ``rustup`` ,所有的配置文件都已经安装了该工具,因此不需要再做什么。
-
-独立的安装程序也带有 ``cargo`` 。
-
-
 rustdoc
 *******
 
@@ -223,7 +329,7 @@ Rust支持(CONFIG_RUST)需要在 ``General setup`` 菜单中启用。在其
 如果使用的是GDB/Binutils,而Rust符号没有被demangled,原因是工具链还不支持Rust的新v0
 mangling方案。有几个办法可以解决:
 
-  - 安装一个较新的版本(GDB >= 10.2, Binutils >= 2.36)。
+- 安装一个较新的版本(GDB >= 10.2, Binutils >= 2.36)。
 
-  - 一些版本的GDB(例如vanilla GDB 10.1)能够使用嵌入在调试信息(``CONFIG_DEBUG_INFO``)
-    中的pre-demangled的名字。
+- 一些版本的GDB(例如vanilla GDB 10.1)能够使用嵌入在调试信息(``CONFIG_DEBUG_INFO``)
+  中的pre-demangled的名字。
-- 
2.53.0

^ permalink raw reply related

* [PATCH 1/4] docs/zh_CN: update rust/arch-support.rst translation
From: Ben Guo @ 2026-04-08  5:05 UTC (permalink / raw)
  To: Alex Shi, Yanteng Si, Dongliang Mu, Jonathan Corbet
  Cc: linux-doc, linux-kernel, rust-for-linux, Ben Guo
In-Reply-To: <cover.1775619061.git.ben.guo@openatom.club>

Update the translation of .../rust/arch-support.rst into Chinese.

Update the translation through commit ccb8ce526807
("ARM: 9441/1: rust: Enable Rust support for ARMv7")

Reviewed-by: Dongliang Mu <dzm91@hust.edu.cn>
Signed-off-by: Ben Guo <ben.guo@openatom.club>
---
 Documentation/translations/zh_CN/rust/arch-support.rst | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/Documentation/translations/zh_CN/rust/arch-support.rst b/Documentation/translations/zh_CN/rust/arch-support.rst
index abd708d48f8..f5ae44588a5 100644
--- a/Documentation/translations/zh_CN/rust/arch-support.rst
+++ b/Documentation/translations/zh_CN/rust/arch-support.rst
@@ -19,9 +19,10 @@
 =============  ================  ==============================================
 架构           支持水平           限制因素
 =============  ================  ==============================================
-``arm64``      Maintained        只有小端序
+``arm``        Maintained        仅 ARMv7 小端序。
+``arm64``      Maintained        仅小端序。
 ``loongarch``  Maintained        \-
-``riscv``      Maintained        只有 ``riscv64``
-``um``         Maintained        只有 ``x86_64``
-``x86``        Maintained        只有 ``x86_64``
+``riscv``      Maintained        仅 ``riscv64``,且仅限 LLVM/Clang。
+``um``         Maintained        \-
+``x86``        Maintained        仅 ``x86_64``。
 =============  ================  ==============================================
-- 
2.53.0

^ permalink raw reply related

* Re: [PATCH v2 00/16] fs,x86/resctrl: Add kernel-mode (e.g., PLZA) support to the resctrl subsystem
From: Reinette Chatre @ 2026-04-08  4:45 UTC (permalink / raw)
  To: Babu Moger, corbet@lwn.net, tony.luck@intel.com,
	Dave.Martin@arm.com, james.morse@arm.com, tglx@kernel.org,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com
  Cc: skhan@linuxfoundation.org, x86@kernel.org, hpa@zytor.com,
	peterz@infradead.org, juri.lelli@redhat.com,
	vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
	rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
	vschneid@redhat.com, kas@kernel.org, rick.p.edgecombe@intel.com,
	akpm@linux-foundation.org, pmladek@suse.com,
	rdunlap@infradead.org, dapeng1.mi@linux.intel.com,
	kees@kernel.org, elver@google.com, paulmck@kernel.org,
	lirongqing@baidu.com, safinaskar@gmail.com, fvdl@google.com,
	seanjc@google.com, pawan.kumar.gupta@linux.intel.com,
	xin@zytor.com, tiala@microsoft.com, Neeraj.Upadhyay@amd.com,
	chang.seok.bae@intel.com, Lendacky, Thomas,
	elena.reshetova@intel.com, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-coco@lists.linux.dev,
	kvm@vger.kernel.org, eranian@google.com, peternewman@google.com
In-Reply-To: <c6f574b7-fe5f-49ae-9865-0e4dbb2f9803@amd.com>

Hi Babu,

On 4/7/26 6:01 PM, Babu Moger wrote:
> Hi Reinette,
> 
> On 4/7/26 12:48, Reinette Chatre wrote:
>> Hi Babu,
>>
>> On 4/6/26 3:45 PM, Babu Moger wrote:
>>> Hi Reinette,
>>>
>>> Sorry for the late response. I was trying to get confirmation about the use case.
>>
>> No problem. I appreciate that you did this so that we can make sure resctrl supports
>> needed use cases.
>>
>>>
>>> On 3/31/26 17:24, Reinette Chatre wrote:
>>>> On 3/30/26 11:46 AM, Babu Moger wrote:
>>>>> On 3/27/26 17:11, Reinette Chatre wrote:
>>>>>> On 3/26/26 10:12 AM, Babu Moger wrote:
>>>>>>> On 3/24/26 17:51, Reinette Chatre wrote:
>>>>>>>> On 3/12/26 1:36 PM, Babu Moger wrote:
>>
>>>> can have domains that span different CPUs. There thus seem to be a built in assumption of what a "domain"
>>>> means for PQR_PLZA_ASSOC so it sounds to me as though, instead of saying that "PQR_PLZA_ASSOC needs
>>>> to be the same in QoS domain" it may be more accurate to, for example, say that "PQR_PLZA_ASSOC has L3 scope"?
>>>
>>> Yes.
>>
>> Above is about L3 scope ...
> 
> Yes. The scope for PQR_PLZA_ASSOC is L3.
> 
> Is that what you are asking here?

I was trying to point out that there appears to be a mismatch between the actual scope and
the planned implementation. As highlighted below during the discussion about "global" this is
fine with me and I just wanted to confirm that this matches your intentions.

> 
>>  
>>>>
>>>> This seems to be what this implementation does since it hardcodes PQR_PLZA_ASSOC scope to the L3
>>>> resource but that creates dependency to the L3 resource that would make PLZA unusable if, for example,
>>>> the user boots with "rdt=!l3cat" while wanting to use PLZA to manage MBA allocations when in kernel?
>>>
>>> Yes. that is correct. It should not be attached to one resource. We need to change it to global scope.
>>
>> Can I interpret "global scope" as "all online CPUs"? Doing so will simplify
> 
> Yes. That is correct.
> 
> 
>> supporting this feature. It does not sound practical for a user wanting to assign
>> different resource groups to kernel work done in different domains ... the guidance should
>> instead be to just set the allocations of one resource group to what is needed in the different
>> domains? There may be more flexibility when supporting per-domain RMIDs though but so far
>> it sounds as though the focus is global. We can consider what needs to be done to support
>> some type of "per-domain" assignment as exercise whether current interface could support it
>> in the future.
> 
> Yes. Makes sense.
> 
>>

...

>>> The PLZA MSR is updated when user changes the association to the
>>> file. No context switch code changes are needed. This will be
>>> dedicated group. The current resctrl group files, "cpus, cpus_list
>>
>> Why does this have to be a dedicated group? One of the conclusions from v1
>> discussion was that the "PLZA group" need *not* be a dedicated group. I repeated that
>> in my earlier response that I left quoted above. You did not respond to these
>> conclusions and statements in this regard while you keep coming back to this
>> needing to be a dedicated group without providing a motivation to do so.
>> Could you please elaborate why a dedicated group is required?
> 
> If the same group applies identical limits to both user and kernel
> space, it essentially behaves like a current resctrl group. In that
> sense, it’s not really a PLZA group. PLZA’s key value is the ability
> to separate allocations between user space and kernel space. A

The plan has never been to force identical allocations for user and kernel
space since that would go against this feature entirely. Even so, just as
user and kernel space cannot be forced to have identical allocations they
also cannot be forced to have different allocations. Specifically,
a task *can* use the same CLOSID for user and kernel space work just as easily
as it can use *different* CLOSID for user and kernel space work. There
should not be any CLOSID reserved just for kernel work. Or am I missing something?

> single CPU can belong to two groups: one group manages the user-
> space allocation for that CPU, while another manages the kernel-mode
> allocation.

Exactly. This is why it is important to have two files for this CPU association
within a resource group. The cpus/cpus_list file continues to be used as today
while the new kernel_mode_cpus/kernel_mode_cpus_list is used for kernel work.
With this a task can be associated with any resource group for its user space
allocations but when it runs on one of the CPUs within kernel_mode_cpus then
its kernel work will be done with allocations of the resource group the
kernel_mode_cpus file belongs to, which may or may not be the same
resource group that the user space task belongs to.

> This approach also simplifies file handling, which is another reason
> I prefer it.

I *think* we have different interpretations of "dedicated group":
It sounds as though you interpret "dedicated group" as a way that enforces
the same allocations to user space and kernel work.
I interpret "dedicated group" essentially as a CLOSID reserved for kernel
work. Since I do not see that resctrl should dedicate a CLOSID/resource group
for kernel work I have been pushing against such "dedicated group". 

> That said, I’m open to not having a dedicated group if we can still support all the features that PLZA provides without it.

I find that enabling user space to share CLOSID/RMID between user space
and kernel space to indeed support what PLZA provides. I think I am missing
something here since below proposal again attempts to isolate a resource group
(CLOSID) for kernel work.

>>> Add a file, "info/kmode_monitor", to describe how kmode is monitored.
>>>
>>> # cat info/kmode_monitor
>>> [inherit_ctrl_and_mon] <- Kernel uses the same CLOSID/RMID as user. Default option for the "global"
>>> assign_ctrl_inherit_mon <- One CLOSID for all kernel work; RMID inherited from user.
>>> assign_ctrl_assign_mon <- One resource group (CLOSID+RMID) for all kernel work. Default option for "cpu" type.
>>
>> My first thought is that the naming is confusing. resctrl has a very strong relationship between
>> "RMID" and "monitoring" so naming a file "monitor" that deals with allocation/ctrl/CLOSID is
>> potentially confusion.
>>
>> Apart from that, while I think I understand where you are going by separating the mode into
>> two files I am concerned about future complications needing to accommodate all different
>> combinations of the (now) essentially two modes. My preference is thus to keep this simple by
>> keeping the mode within one file.
>>
>> Even so, when stepping back, it does not really look like we need to separate the "global"
>> and "per CPU" modes. We could just have a single "per CPU" mode and the "global" is just
>> its default of "all CPUs", no?
> 
> Yes. That correct.
> 
>>
>> Consider, for example, the implementation just consisting of:
>>
>>     # cat info/kernel_mode
>>     [inherit_ctrl_and_mon]
>>     global_assign_ctrl_inherit_mon_per_cpu
>>     global_assign_ctrl_assign_mon_per_cpu
>>  
>>>
>>> Rename “kernel_mode_assignment” to “kmode_group” to assign the specific group to kmode. This file usage is same as before.
>>>
>>> #cat info/kmode_groups (Renamed "kernel_mode_assignment")
>>> //
>>
>> Please consider the intent of this file when thinking about names. The idea is that "info/kernel_mode"
>> specifies the "mode" of how kernel work is handled and it determines the configuration files used in that
>> mode as well as the syntax when interacting with those files. By renaming "kernel_mode_assignment" to
>> "kmode_groups" it implicitly requires all future kernel mode enhancements to need some data related to "groups".
>>
>> In summary, I think this can be simplified by introducing just two new files in info/ that enables the
>> user to (a) select and (b) configure the "kernel mode". To start there can be just two modes,
>> global_assign_ctrl_inherit_mon_per_cpu and global_assign_ctrl_assign_mon_per_cpu.
>> global_assign_ctrl_inherit_mon_per_cpu mode requires a control group in kernel_mode_assignment while
>> global_assign_ctrl_assign_mon_per_cpu requires a control and monitoring group.
>>
>> The resource group in info/kernel_mode_assignment gets two additional files "kernel_mode_cpus" and
>> "kernel_mode_cpus_list" that contains the CPUs enabled with the kernel mode configuration, by default
>> it will be all online CPUs. The resource group can continue to be used to manage allocations of and
>> monitor user space tasks. Specifically, the "cpus", "cpus_list", and "tasks" files remain.
>>
>> A user wanting just "global" settings will get just that when writing the group to
>> info/kernel_mode_assignment. A user wanting "per CPU" settings can follow the
>> info/kernel_mode_assignment setting with changes to that resource group's kernel_mode_cpus/kernel_mode_cpus_list
>> files. Any task running on a CPU that is *not* in kernel_mode_cpus/kernel_mode_cpus_list can be
>> expected to inherit both CLOSID and RMID from user space for all kernel work.
> 
> After further consideration, I don’t think the info/kernel_mode file
> is necessary. There’s no need to enforce a specific mode for all the
> PLZA groups. Avoiding this constraint makes the design more
> flexible, particularly as we move toward supporting multiple PLZA
> groups in the future. MPAM already appears capable of handling more
> than one group—for example, one group could use
> inherit_ctrl_and_mon, while another could use
> global_assign_ctrl_inherit_mon_per_cpu.

You are looking ahead at future capabilities for which we do not know all requirements
at this time. I think it is very good to consider how things may progress and your example
of MPAM is of course on point. I believe the current design does consider this progression.
Please see https://lore.kernel.org/lkml/2ab556af-095b-422b-9396-f845c6fd0342@intel.com/ 
(search for "per_group_assign_ctrl_assign_mon"). In that exploration per-group assignment
is actually accomplished with global files. I thus think we should not make such a big
architectural decision that does not benefit the immediate feature using partial information.
As it is, a "info/kernel_mode" gives the flexibility to expand to, if needed, configuration
files within a resource group. That is why the intention is to associate the mode within
info/kernel_mode with the presence/absence of info/kernel_mode_assignment (search for
"Visibility depends on active mode in info/kernel_mode" in linked email) since in the
future resctrl may need to enable a mode that needs configuration files within each
resource group and when enabling such mode the per-resource group files will appear
instead of the global info/kernel_mode_assignment.

> 
> The mode can simply be determined on a per-group basis. We can introduce two new files—kernel_mode_cpus and kernel_mode_cpus_list—within each resctrl group when kmode (or PLZA) is supported.

I think having these files in every resource group is confusing since user can only interact
with these files in one resource group for current PLZA. Why not *just* have the files in the
resource group that matches the group in info/kernel_mode_assignment?
 
> 
> The info/kernel_mode_assignment file would indicate which resctrl
> group(or groups) is used for PLZA. The files—kernel_mode_cpus and
> kernel_mode_cpus_list would indicate how the plza is applied which
> each group.

The "how PLZA is applied" should be learned from info/kernel_mode where user
space learns whether RMID is inherited or not. While I find kernel_mode_cpus
and kernel_mode_cpus_list to be just for configuration and just found in the
resource group listed in info/kernel_mode_assignment.

> 
> Files and behavior:
> - cpus / cpus_list:
> 
> CPUs listed here use the same allocation for both user and kernel space.

Both user and kernel space?
Monitoring would depend on info/kernel_mode_assignment ("inherit_mon")
and kernel space allocation would depend on whether the CPU on which the task runs
can be found in kernel_mode_cpus, no?


> There is no change to the current semantics of these files.
> If these files are empty, the group effectively becomes a PLZA-dedicated group.

I do not see it this way. If the cpu/cpus_list files are empty then it means that the
tasks in the group will use their own CLOSID/RMID for user space allocation and
monitoring. What allocations/monitoring is used by tasks when in kernel mode depends
on whether the CPU the task is running on can be found in a kernel_mode_cpus/kernel_mode_cpuslist
file. If the CPU the task is running on can be found in a kernel_mode_cpus/kernel_mode_cpuslist
file then it will inherit whatever the PQR_PLZA setting of that CPU which is the allocation
associated with the resource group to which that kernel_mode_cpus/kernel_mode_cpuslist belongs.
If the CPU the task is running on cannot be found in kernel_mode_cpus/kernel_mode_cpuslist
then its kernel work will inherit its user space allocations and monitoring.

> 
> - kernel_mode_cpus / kernel_mode_cpus_list:
> 
> These files determine whether a separate kernel allocation is applied.
> If empty, user and kernel share the same allocation.
> If non-empty, the kernel uses a separate allocation.
> 
> The group can be CTL_MON or MON group. Based on type the group the CLOSID and RMID will be used to enable PLZA. If it is MON, then rmid_en = 1 when writing PLZA MSR.

This will be difficult to get right since CTRL_MON groups also have RMID assigned.

> Here’s the proposed flow:
> 
> # mount -t resctrl resctrl /sys/fs/resctrl/
> # cd /sys/fs/resctrl/
> # cat info/kernel_mode_assignment
> //
> 
> By default, the root (default) group is PLZA-enabled when resctrl is mounted. All CPUs use CLOSID 0 for both user and kernel-mode allocation.
> 
> # cat cpus_list
> 1-64
> # cat kmode_cpus_list
> 1-64
> 
> Next, create a new group for PLZA:
> 
> # mkdir plza_group
> 
> # echo "plza_group//" > info/kernel_mode_assignment
> 
> At this point, plza_group becomes the new PLZA-enabled group, and the PLZA-related MSRs are updated accordingly.

It really looks like you are getting back to trying to dedicate a resource group to
kernel work and that is not something that resctrl should enforce.

> 
> # cat plza_group/cpus_list
> <empty>
> 
> # cat plza_group/kmode_cpus_list
> 1-64
> 
> The user can then update kmode_cpus_list to apply PLZA only to a specific subset of CPUs, if desired.
> 
> 
> What do you think of this approach?

It is difficult to predict how the "next" PLZA will actually end up looking like and I find resctrl creating a complicated
interface to support this to be risky. Instead I would prefer to focus on efficiently supporting what PLZA can do today
and make it extensible. Apart from that I find the implicit interface, "If it is MON, then rmid_en = 1" to be too
architecture specific for a generic interface while also not able to accurately capture user's intent (i.e. user may
indeed, for example, want "a CTRL_MON group to have rmid_en = 1"). Finally, I am just so confused about why the implementations
keep needing to dedicate a resource group/CLOSID to kernel work.

Reinette



^ permalink raw reply

* Re: [PATCH] crash: Support high memory reservation for range syntax
From: Sourabh Jain @ 2026-04-08  4:31 UTC (permalink / raw)
  To: Youling Tang, Andrew Morton, Baoquan He, Jonathan Corbet
  Cc: Vivek Goyal, Dave Young, kexec, linux-kernel, linux-doc,
	Youling Tang
In-Reply-To: <20260404074103.506793-1-youling.tang@linux.dev>

Hello Youling,

On 04/04/26 13:11, Youling Tang wrote:
> From: Youling Tang <tangyouling@kylinos.cn>
>
> The crashkernel range syntax (range1:size1[,range2:size2,...]) allows
> automatic size selection based on system RAM, but it always reserves
> from low memory. When a large crashkernel is selected, this can
> consume most of the low memory, causing subsequent hardware
> hotplug or drivers requiring low memory to fail due to allocation
> failures.


Support for high crashkernel reservation has been added to
address the above problem.

However, high crashkernel reservation is not supported with
range-based crashkernel kernel command-line arguments.
For example: crashkernel=0M-1G:100M,1G-4G:160M,4G-8G:192M

Many users, including some distributions, use range-based
crashkernel configuration. So, adding support for high crashkernel
reservation with range-based configuration would be useful.

>
> Add a new optional conditional suffix ",>boundary" to the crashkernel
> range syntax. When the selected crashkernel size exceeds the specified
> boundary, the kernel will automatically apply the same reservation
> policy as "crashkernel=size,high" - preferring high memory first
> and reserving the default low memory area.

I think the approach to enable high crashkernel reservation
with range-based configuration makes the crashkernel kernel
argument more complex.

If the goal is to support high crashkernel reservation with
range-based kernel command-line arguments, how about:

crashkernel=<range1>:<size1>[,<range2>:<size2>,...][@offset],high

instead of using >boundary?

>
> Syntax:
>      crashkernel=<range1>:<size1>[,<range2>:<size2>,...][@offset],>boundary
>
> Example:
>      crashkernel=2G-16G:512M,16G-:1G,>512M
>
> This means:
>    - For 2G-16G RAM: reserve 512M normally
>    - For >16G RAM: reserve 1G with high memory preference (since 1G > 512M)
>
> For systems with >16G RAM, 1G is selected which exceeds 512M, so it
> will be reserved from high memory instead of consuming 1G of
> precious low memory.
>
> Signed-off-by: Youling Tang <tangyouling@kylinos.cn>
> ---
>   Documentation/admin-guide/kdump/kdump.rst     | 25 ++++++++-
>   .../admin-guide/kernel-parameters.txt         |  2 +-
>   kernel/crash_reserve.c                        | 56 ++++++++++++++++---
>   3 files changed, 73 insertions(+), 10 deletions(-)
>
> diff --git a/Documentation/admin-guide/kdump/kdump.rst b/Documentation/admin-guide/kdump/kdump.rst
> index 7587caadbae1..b5ae4556e9ca 100644
> --- a/Documentation/admin-guide/kdump/kdump.rst
> +++ b/Documentation/admin-guide/kdump/kdump.rst
> @@ -293,7 +293,28 @@ crashkernel syntax
>          2) if the RAM size is between 512M and 2G (exclusive), then reserve 64M
>          3) if the RAM size is larger than 2G, then reserve 128M
>   
> -3) crashkernel=size,high and crashkernel=size,low
> +3) range1:size1[,range2:size2,...][@offset],>boundary
> +   Optionally, the range list can be followed by a conditional suffix
> +   `,>boundary`. When the selected crashkernel size matches the
> +   condition, the kernel will reserve memory using the same policy as
> +   `crashkernel=size,high` (i.e. prefer high memory first and reserve the
> +   default low memory area).
> +
> +   The syntax is::
> +
> +        crashkernel=<range1>:<size1>[,<range2>:<size2>,...][@offset],>boundary
> +        range=start-[end]
> +
> +   For example::
> +
> +        crashkernel=2G-16G:512M,16G-:1G,>512M
> +
> +   This would mean:
> +       1) if the RAM size is between 2G and 16G (exclusive), then reserve 512M.
> +       2) if the RAM size is larger than 16G, allocation will behave like
> +          `crashkernel=1G,high`.
> +
> +4) crashkernel=size,high and crashkernel=size,low
>   
>      If memory above 4G is preferred, crashkernel=size,high can be used to
>      fulfill that. With it, physical memory is allowed to be allocated from top,
> @@ -311,7 +332,7 @@ crashkernel syntax
>   
>               crashkernel=0,low
>   
> -4) crashkernel=size,cma
> +5) crashkernel=size,cma
>   
>   	Reserve additional crash kernel memory from CMA. This reservation is
>   	usable by the first system's userspace memory and kernel movable
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 03a550630644..b2e1892ab4d8 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -1087,7 +1087,7 @@ Kernel parameters
>   			4G when '@offset' hasn't been specified.
>   			See Documentation/admin-guide/kdump/kdump.rst for further details.
>   
> -	crashkernel=range1:size1[,range2:size2,...][@offset]
> +	crashkernel=range1:size1[,range2:size2,...][@offset][,>boundary]
>   			[KNL] Same as above, but depends on the memory
>   			in the running system. The syntax of range is
>   			start-[end] where start and end are both
> diff --git a/kernel/crash_reserve.c b/kernel/crash_reserve.c
> index 62e60e0223cf..917738412390 100644
> --- a/kernel/crash_reserve.c
> +++ b/kernel/crash_reserve.c
> @@ -254,15 +254,47 @@ static __init char *get_last_crashkernel(char *cmdline,
>   	return ck_cmdline;
>   }
>   
> +/*
> + * This function parses command lines in the format
> + *
> + *   crashkernel=ramsize-range:size[,...][@offset],>boundary
> + */
> +static void __init parse_crashkernel_boundary(char *ck_cmdline,
> +					unsigned long long *boundary)
> +{
> +	char *cur = ck_cmdline, *next;
> +	char *first_gt = false;
> +
> +	first_gt = strchr(cur, '>');
> +	if (!first_gt)
> +		return;
> +
> +	cur = first_gt + 1;
> +	if (*cur == '\0' || *cur == ' ' || *cur == ',') {
> +		pr_warn("crashkernel: '>' specified without boundary size, ignoring\n");
> +		return;
> +	}
> +
> +	*boundary = memparse(cur, &next);
> +	if (cur == next) {
> +		pr_warn("crashkernel: invalid boundary size after '>'\n");
> +		return;
> +	}
> +}
> +
>   static int __init __parse_crashkernel(char *cmdline,
>   			     unsigned long long system_ram,
>   			     unsigned long long *crash_size,
>   			     unsigned long long *crash_base,
> -			     const char *suffix)
> +			     const char *suffix,
> +			     bool *high,
> +			     unsigned long long *low_size)
>   {
>   	char *first_colon, *first_space;
>   	char *ck_cmdline;
>   	char *name = "crashkernel=";
> +	unsigned long long boundary = 0;
> +	int ret;
>   
>   	BUG_ON(!crash_size || !crash_base);
>   	*crash_size = 0;
> @@ -283,10 +315,20 @@ static int __init __parse_crashkernel(char *cmdline,
>   	 */
>   	first_colon = strchr(ck_cmdline, ':');
>   	first_space = strchr(ck_cmdline, ' ');
> -	if (first_colon && (!first_space || first_colon < first_space))
> -		return parse_crashkernel_mem(ck_cmdline, system_ram,
> +	if (first_colon && (!first_space || first_colon < first_space)) {
> +		ret = parse_crashkernel_mem(ck_cmdline, system_ram,
>   				crash_size, crash_base);
>   
> +		/* Handle optional ',>boundary' condition for range ':' syntax only. */
> +		parse_crashkernel_boundary(ck_cmdline, &boundary);
> +		if (!ret && *crash_size > boundary) {
> +			*high = true;
> +			*low_size = DEFAULT_CRASH_KERNEL_LOW_SIZE;
> +		}
> +
> +		return ret;
> +	}
> +
>   	return parse_crashkernel_simple(ck_cmdline, crash_size, crash_base);
>   }
>   
> @@ -310,7 +352,7 @@ int __init parse_crashkernel(char *cmdline,
>   
>   	/* crashkernel=X[@offset] */
>   	ret = __parse_crashkernel(cmdline, system_ram, crash_size,
> -				crash_base, NULL);
> +				crash_base, NULL, high, low_size);
>   #ifdef CONFIG_ARCH_HAS_GENERIC_CRASHKERNEL_RESERVATION
>   	/*
>   	 * If non-NULL 'high' passed in and no normal crashkernel
> @@ -318,7 +360,7 @@ int __init parse_crashkernel(char *cmdline,
>   	 */
>   	if (high && ret == -ENOENT) {
>   		ret = __parse_crashkernel(cmdline, 0, crash_size,
> -				crash_base, suffix_tbl[SUFFIX_HIGH]);
> +				crash_base, suffix_tbl[SUFFIX_HIGH], high, low_size);
>   		if (ret || !*crash_size)
>   			return -EINVAL;
>   
> @@ -327,7 +369,7 @@ int __init parse_crashkernel(char *cmdline,
>   		 * is not allowed.
>   		 */
>   		ret = __parse_crashkernel(cmdline, 0, low_size,
> -				crash_base, suffix_tbl[SUFFIX_LOW]);
> +				crash_base, suffix_tbl[SUFFIX_LOW], high, low_size);
>   		if (ret == -ENOENT) {
>   			*low_size = DEFAULT_CRASH_KERNEL_LOW_SIZE;
>   			ret = 0;
> @@ -344,7 +386,7 @@ int __init parse_crashkernel(char *cmdline,
>   	 */
>   	if (cma_size)
>   		__parse_crashkernel(cmdline, 0, cma_size,
> -			&cma_base, suffix_tbl[SUFFIX_CMA]);
> +			&cma_base, suffix_tbl[SUFFIX_CMA], high, low_size);
>   #endif
>   	if (!*crash_size)
>   		ret = -EINVAL;


^ permalink raw reply

* [PATCH net-next v04 4/6] hinic3: Add ethtool rss ops
From: Fan Gong @ 2026-04-08  4:03 UTC (permalink / raw)
  To: Fan Gong, Zhu Yikai, netdev, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Andrew Lunn,
	Ioana Ciornei, Mohsin Bashir
  Cc: linux-kernel, linux-doc, luosifu, Xin Guo, Zhou Shuai, Wu Like,
	Shi Jing, Zheng Jiezhen, Maxime Chevallier
In-Reply-To: <cover.1775618797.git.zhuyikai1@h-partners.com>

  Implement following ethtool callback function:
.get_rxnfc
.set_rxnfc
.get_channels
.set_channels
.get_rxfh_indir_size
.get_rxfh_key_size
.get_rxfh
.set_rxfh

  These callbacks allow users to utilize ethtool for detailed
RSS parameters configuration and monitoring.

Co-developed-by: Zhu Yikai <zhuyikai1@h-partners.com>
Signed-off-by: Zhu Yikai <zhuyikai1@h-partners.com>
Signed-off-by: Fan Gong <gongfan1@huawei.com>
---
 .../ethernet/huawei/hinic3/hinic3_ethtool.c   |   9 +
 .../huawei/hinic3/hinic3_mgmt_interface.h     |   2 +
 .../net/ethernet/huawei/hinic3/hinic3_rss.c   | 487 +++++++++++++++++-
 .../net/ethernet/huawei/hinic3/hinic3_rss.h   |  19 +
 4 files changed, 515 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/huawei/hinic3/hinic3_ethtool.c b/drivers/net/ethernet/huawei/hinic3/hinic3_ethtool.c
index a4b2d5ba81f8..8cd7dd9da67b 100644
--- a/drivers/net/ethernet/huawei/hinic3/hinic3_ethtool.c
+++ b/drivers/net/ethernet/huawei/hinic3/hinic3_ethtool.c
@@ -15,6 +15,7 @@
 #include "hinic3_hw_comm.h"
 #include "hinic3_nic_dev.h"
 #include "hinic3_nic_cfg.h"
+#include "hinic3_rss.h"
 
 #define HINIC3_MGMT_VERSION_MAX_LEN     32
 /* Coalesce time properties in microseconds */
@@ -1231,6 +1232,14 @@ static const struct ethtool_ops hinic3_ethtool_ops = {
 	.get_pause_stats                = hinic3_get_pause_stats,
 	.get_coalesce                   = hinic3_get_coalesce,
 	.set_coalesce                   = hinic3_set_coalesce,
+	.get_rxnfc                      = hinic3_get_rxnfc,
+	.set_rxnfc                      = hinic3_set_rxnfc,
+	.get_channels                   = hinic3_get_channels,
+	.set_channels                   = hinic3_set_channels,
+	.get_rxfh_indir_size            = hinic3_get_rxfh_indir_size,
+	.get_rxfh_key_size              = hinic3_get_rxfh_key_size,
+	.get_rxfh                       = hinic3_get_rxfh,
+	.set_rxfh                       = hinic3_set_rxfh,
 };
 
 void hinic3_set_ethtool_ops(struct net_device *netdev)
diff --git a/drivers/net/ethernet/huawei/hinic3/hinic3_mgmt_interface.h b/drivers/net/ethernet/huawei/hinic3/hinic3_mgmt_interface.h
index 76c691f82703..3c1263ff99ff 100644
--- a/drivers/net/ethernet/huawei/hinic3/hinic3_mgmt_interface.h
+++ b/drivers/net/ethernet/huawei/hinic3/hinic3_mgmt_interface.h
@@ -282,6 +282,7 @@ enum l2nic_cmd {
 	L2NIC_CMD_SET_VLAN_FILTER_EN  = 26,
 	L2NIC_CMD_SET_RX_VLAN_OFFLOAD = 27,
 	L2NIC_CMD_CFG_RSS             = 60,
+	L2NIC_CMD_GET_RSS_CTX_TBL     = 62,
 	L2NIC_CMD_CFG_RSS_HASH_KEY    = 63,
 	L2NIC_CMD_CFG_RSS_HASH_ENGINE = 64,
 	L2NIC_CMD_SET_RSS_CTX_TBL     = 65,
@@ -301,6 +302,7 @@ enum l2nic_ucode_cmd {
 	L2NIC_UCODE_CMD_MODIFY_QUEUE_CTX  = 0,
 	L2NIC_UCODE_CMD_CLEAN_QUEUE_CTX   = 1,
 	L2NIC_UCODE_CMD_SET_RSS_INDIR_TBL = 4,
+	L2NIC_UCODE_CMD_GET_RSS_INDIR_TBL = 6,
 };
 
 /* hilink mac group command */
diff --git a/drivers/net/ethernet/huawei/hinic3/hinic3_rss.c b/drivers/net/ethernet/huawei/hinic3/hinic3_rss.c
index 25db74d8c7dd..b40d5fa885c2 100644
--- a/drivers/net/ethernet/huawei/hinic3/hinic3_rss.c
+++ b/drivers/net/ethernet/huawei/hinic3/hinic3_rss.c
@@ -155,7 +155,7 @@ static int hinic3_set_rss_type(struct hinic3_hwdev *hwdev,
 				       L2NIC_CMD_SET_RSS_CTX_TBL, &msg_params);
 
 	if (ctx_tbl.msg_head.status == MGMT_STATUS_CMD_UNSUPPORTED) {
-		return MGMT_STATUS_CMD_UNSUPPORTED;
+		return -EOPNOTSUPP;
 	} else if (err || ctx_tbl.msg_head.status) {
 		dev_err(hwdev->dev, "mgmt Failed to set rss context offload, err: %d, status: 0x%x\n",
 			err, ctx_tbl.msg_head.status);
@@ -165,6 +165,39 @@ static int hinic3_set_rss_type(struct hinic3_hwdev *hwdev,
 	return 0;
 }
 
+static int hinic3_get_rss_type(struct hinic3_hwdev *hwdev,
+			       struct hinic3_rss_type *rss_type)
+{
+	struct l2nic_cmd_rss_ctx_tbl ctx_tbl = {};
+	struct mgmt_msg_params msg_params = {};
+	int err;
+
+	ctx_tbl.func_id = hinic3_global_func_id(hwdev);
+
+	mgmt_msg_params_init_default(&msg_params, &ctx_tbl, sizeof(ctx_tbl));
+
+	err = hinic3_send_mbox_to_mgmt(hwdev, MGMT_MOD_L2NIC,
+				       L2NIC_CMD_GET_RSS_CTX_TBL,
+				       &msg_params);
+	if (err || ctx_tbl.msg_head.status) {
+		dev_err(hwdev->dev, "Failed to get hash type, err: %d, status: 0x%x\n",
+			err, ctx_tbl.msg_head.status);
+		return -EINVAL;
+	}
+
+	rss_type->ipv4         = L2NIC_RSS_TYPE_GET(ctx_tbl.context, IPV4);
+	rss_type->ipv6         = L2NIC_RSS_TYPE_GET(ctx_tbl.context, IPV6);
+	rss_type->ipv6_ext     = L2NIC_RSS_TYPE_GET(ctx_tbl.context, IPV6_EXT);
+	rss_type->tcp_ipv4     = L2NIC_RSS_TYPE_GET(ctx_tbl.context, TCP_IPV4);
+	rss_type->tcp_ipv6     = L2NIC_RSS_TYPE_GET(ctx_tbl.context, TCP_IPV6);
+	rss_type->tcp_ipv6_ext = L2NIC_RSS_TYPE_GET(ctx_tbl.context,
+						    TCP_IPV6_EXT);
+	rss_type->udp_ipv4     = L2NIC_RSS_TYPE_GET(ctx_tbl.context, UDP_IPV4);
+	rss_type->udp_ipv6     = L2NIC_RSS_TYPE_GET(ctx_tbl.context, UDP_IPV6);
+
+	return 0;
+}
+
 static int hinic3_rss_cfg_hash_type(struct hinic3_hwdev *hwdev, u8 opcode,
 				    enum hinic3_rss_hash_type *type)
 {
@@ -264,7 +297,8 @@ static int hinic3_set_hw_rss_parameters(struct net_device *netdev, u8 rss_en)
 	if (err)
 		return err;
 
-	hinic3_fillout_indir_tbl(netdev, nic_dev->rss_indir);
+	if (!netif_is_rxfh_configured(netdev))
+		hinic3_fillout_indir_tbl(netdev, nic_dev->rss_indir);
 
 	err = hinic3_config_rss_hw_resource(netdev, nic_dev->rss_indir);
 	if (err)
@@ -334,3 +368,452 @@ void hinic3_try_to_enable_rss(struct net_device *netdev)
 	clear_bit(HINIC3_RSS_ENABLE, &nic_dev->flags);
 	nic_dev->q_params.num_qps = nic_dev->max_qps;
 }
+
+static int hinic3_set_l4_rss_hash_ops(const struct ethtool_rxnfc *cmd,
+				      struct hinic3_rss_type *rss_type)
+{
+	u8 rss_l4_en;
+
+	switch (cmd->data & (RXH_L4_B_0_1 | RXH_L4_B_2_3)) {
+	case 0:
+		rss_l4_en = 0;
+		break;
+	case (RXH_L4_B_0_1 | RXH_L4_B_2_3):
+		rss_l4_en = 1;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	switch (cmd->flow_type) {
+	case TCP_V4_FLOW:
+		rss_type->tcp_ipv4 = rss_l4_en;
+		break;
+	case TCP_V6_FLOW:
+		rss_type->tcp_ipv6 = rss_l4_en;
+		break;
+	case UDP_V4_FLOW:
+		rss_type->udp_ipv4 = rss_l4_en;
+		break;
+	case UDP_V6_FLOW:
+		rss_type->udp_ipv6 = rss_l4_en;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int hinic3_update_rss_hash_opts(struct net_device *netdev,
+				       struct ethtool_rxnfc *cmd,
+				       struct hinic3_rss_type *rss_type)
+{
+	int err;
+
+	switch (cmd->flow_type) {
+	case TCP_V4_FLOW:
+	case TCP_V6_FLOW:
+	case UDP_V4_FLOW:
+	case UDP_V6_FLOW:
+		err = hinic3_set_l4_rss_hash_ops(cmd, rss_type);
+		if (err)
+			return err;
+
+		break;
+	case IPV4_FLOW:
+		rss_type->ipv4 = 1;
+		break;
+	case IPV6_FLOW:
+		rss_type->ipv6 = 1;
+		break;
+	default:
+		netdev_err(netdev, "Unsupported flow type\n");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int hinic3_set_rss_hash_opts(struct net_device *netdev,
+				    struct ethtool_rxnfc *cmd)
+{
+	struct hinic3_nic_dev *nic_dev = netdev_priv(netdev);
+	struct hinic3_rss_type rss_type;
+	int err;
+
+	if (!test_bit(HINIC3_RSS_ENABLE, &nic_dev->flags)) {
+		cmd->data = 0;
+		netdev_err(netdev, "RSS is disable, not support to set flow-hash\n");
+		return -EOPNOTSUPP;
+	}
+
+	/* RSS only supports hashing of IP addresses and L4 ports */
+	if (cmd->data & ~(RXH_IP_SRC | RXH_IP_DST |
+			  RXH_L4_B_0_1 | RXH_L4_B_2_3))
+		return -EINVAL;
+
+	/* Both IP addresses must be part of the hash tuple */
+	if (!(cmd->data & RXH_IP_SRC) || !(cmd->data & RXH_IP_DST))
+		return -EINVAL;
+
+	err = hinic3_get_rss_type(nic_dev->hwdev, &rss_type);
+	if (err) {
+		netdev_err(netdev, "Failed to get rss type\n");
+		return err;
+	}
+
+	err = hinic3_update_rss_hash_opts(netdev, cmd, &rss_type);
+	if (err)
+		return err;
+
+	err = hinic3_set_rss_type(nic_dev->hwdev, rss_type);
+	if (err) {
+		netdev_err(netdev, "Failed to set rss type\n");
+		return err;
+	}
+
+	nic_dev->rss_type = rss_type;
+
+	return 0;
+}
+
+static void convert_rss_type(u8 rss_opt, struct ethtool_rxnfc *cmd)
+{
+	if (rss_opt)
+		cmd->data |= RXH_L4_B_0_1 | RXH_L4_B_2_3;
+}
+
+static int hinic3_convert_rss_type(struct net_device *netdev,
+				   struct hinic3_rss_type *rss_type,
+				   struct ethtool_rxnfc *cmd)
+{
+	cmd->data = RXH_IP_SRC | RXH_IP_DST;
+	switch (cmd->flow_type) {
+	case TCP_V4_FLOW:
+		convert_rss_type(rss_type->tcp_ipv4, cmd);
+		break;
+	case TCP_V6_FLOW:
+		convert_rss_type(rss_type->tcp_ipv6, cmd);
+		break;
+	case UDP_V4_FLOW:
+		convert_rss_type(rss_type->udp_ipv4, cmd);
+		break;
+	case UDP_V6_FLOW:
+		convert_rss_type(rss_type->udp_ipv6, cmd);
+		break;
+	case IPV4_FLOW:
+	case IPV6_FLOW:
+		break;
+	default:
+		netdev_err(netdev, "Unsupported flow type\n");
+		cmd->data = 0;
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int hinic3_get_rss_hash_opts(struct net_device *netdev,
+				    struct ethtool_rxnfc *cmd)
+{
+	struct hinic3_nic_dev *nic_dev = netdev_priv(netdev);
+	struct hinic3_rss_type rss_type;
+	int err;
+
+	cmd->data = 0;
+
+	if (!test_bit(HINIC3_RSS_ENABLE, &nic_dev->flags))
+		return 0;
+
+	err = hinic3_get_rss_type(nic_dev->hwdev, &rss_type);
+	if (err) {
+		netdev_err(netdev, "Failed to get rss type\n");
+		return err;
+	}
+
+	return hinic3_convert_rss_type(netdev, &rss_type, cmd);
+}
+
+int hinic3_get_rxnfc(struct net_device *netdev,
+		     struct ethtool_rxnfc *cmd, u32 *rule_locs)
+{
+	struct hinic3_nic_dev *nic_dev = netdev_priv(netdev);
+	int err = 0;
+
+	switch (cmd->cmd) {
+	case ETHTOOL_GRXRINGS:
+		cmd->data = nic_dev->q_params.num_qps;
+		break;
+	case ETHTOOL_GRXFH:
+		err = hinic3_get_rss_hash_opts(netdev, cmd);
+		break;
+	default:
+		err = -EOPNOTSUPP;
+		break;
+	}
+
+	return err;
+}
+
+int hinic3_set_rxnfc(struct net_device *netdev, struct ethtool_rxnfc *cmd)
+{
+	int err;
+
+	switch (cmd->cmd) {
+	case ETHTOOL_SRXFH:
+		err = hinic3_set_rss_hash_opts(netdev, cmd);
+		break;
+	default:
+		err = -EOPNOTSUPP;
+		break;
+	}
+
+	return err;
+}
+
+static u16 hinic3_max_channels(struct net_device *netdev)
+{
+	struct hinic3_nic_dev *nic_dev = netdev_priv(netdev);
+	u8 tcs = netdev_get_num_tc(netdev);
+
+	return tcs ? nic_dev->max_qps / tcs : nic_dev->max_qps;
+}
+
+static u16 hinic3_curr_channels(struct net_device *netdev)
+{
+	struct hinic3_nic_dev *nic_dev = netdev_priv(netdev);
+
+	if (netif_running(netdev))
+		return nic_dev->q_params.num_qps ?
+				nic_dev->q_params.num_qps : 1;
+	else
+		return min_t(u16, hinic3_max_channels(netdev),
+			     nic_dev->q_params.num_qps);
+}
+
+void hinic3_get_channels(struct net_device *netdev,
+			 struct ethtool_channels *channels)
+{
+	channels->max_rx = 0;
+	channels->max_tx = 0;
+	channels->max_other = 0;
+	/* report maximum channels */
+	channels->max_combined = hinic3_max_channels(netdev);
+	channels->rx_count = 0;
+	channels->tx_count = 0;
+	channels->other_count = 0;
+	/* report flow director queues as maximum channels */
+	channels->combined_count = hinic3_curr_channels(netdev);
+}
+
+static int
+hinic3_validate_channel_parameter(struct net_device *netdev,
+				  const struct ethtool_channels *channels)
+{
+	u16 max_channel = hinic3_max_channels(netdev);
+	unsigned int count = channels->combined_count;
+
+	if (!count) {
+		netdev_err(netdev, "Unsupported combined_count=0\n");
+		return -EINVAL;
+	}
+
+	if (channels->tx_count || channels->rx_count || channels->other_count) {
+		netdev_err(netdev, "Setting rx/tx/other count not supported\n");
+		return -EINVAL;
+	}
+
+	if (count > max_channel) {
+		netdev_err(netdev, "Combined count %u exceed limit %u\n", count,
+			   max_channel);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+int hinic3_set_channels(struct net_device *netdev,
+			struct ethtool_channels *channels)
+{
+	struct hinic3_nic_dev *nic_dev = netdev_priv(netdev);
+	unsigned int count = channels->combined_count;
+	struct hinic3_dyna_txrxq_params q_params;
+	int err;
+
+	if (hinic3_validate_channel_parameter(netdev, channels))
+		return -EINVAL;
+
+	if (!test_bit(HINIC3_RSS_ENABLE, &nic_dev->flags)) {
+		netdev_err(netdev, "This function doesn't support RSS, only support 1 queue pair\n");
+		return -EOPNOTSUPP;
+	}
+
+	netdev_dbg(netdev, "Set max combined queue number from %u to %u\n",
+		   nic_dev->q_params.num_qps, count);
+
+	if (netif_running(netdev)) {
+		q_params = nic_dev->q_params;
+		q_params.num_qps = (u16)count;
+		q_params.txqs_res = NULL;
+		q_params.rxqs_res = NULL;
+		q_params.irq_cfg = NULL;
+
+		err = hinic3_change_channel_settings(netdev, &q_params);
+		if (err) {
+			netdev_err(netdev, "Failed to change channel settings\n");
+			return err;
+		}
+	} else {
+		nic_dev->q_params.num_qps = (u16)count;
+	}
+
+	return 0;
+}
+
+u32 hinic3_get_rxfh_indir_size(struct net_device *netdev)
+{
+	return L2NIC_RSS_INDIR_SIZE;
+}
+
+static int hinic3_set_rss_rxfh(struct net_device *netdev,
+			       const u32 *indir, u8 *key)
+{
+	struct hinic3_nic_dev *nic_dev = netdev_priv(netdev);
+	int err;
+	u32 i;
+
+	if (indir) {
+		for (i = 0; i < L2NIC_RSS_INDIR_SIZE; i++)
+			nic_dev->rss_indir[i] = (u16)indir[i];
+
+		err = hinic3_rss_set_indir_tbl(nic_dev->hwdev,
+					       nic_dev->rss_indir);
+		if (err) {
+			netdev_err(netdev, "Failed to set rss indir table\n");
+			return err;
+		}
+	}
+
+	if (key) {
+		err = hinic3_rss_set_hash_key(nic_dev->hwdev, key);
+		if (err) {
+			netdev_err(netdev, "Failed to set rss key\n");
+			return err;
+		}
+
+		memcpy(nic_dev->rss_hkey, key, L2NIC_RSS_KEY_SIZE);
+	}
+
+	return 0;
+}
+
+u32 hinic3_get_rxfh_key_size(struct net_device *netdev)
+{
+	return L2NIC_RSS_KEY_SIZE;
+}
+
+static int hinic3_rss_get_indir_tbl(struct hinic3_hwdev *hwdev,
+				    u32 *indir_table)
+{
+	struct hinic3_cmd_buf_pair pair;
+	__le16 *indir_tbl = NULL;
+	int err, i;
+
+	err = hinic3_cmd_buf_pair_init(hwdev, &pair);
+	if (err) {
+		dev_err(hwdev->dev, "Failed to allocate cmd_buf.\n");
+		return err;
+	}
+
+	err = hinic3_cmdq_detail_resp(hwdev, MGMT_MOD_L2NIC,
+				      L2NIC_UCODE_CMD_GET_RSS_INDIR_TBL,
+				      pair.in, pair.out, NULL);
+	if (err) {
+		dev_err(hwdev->dev, "Failed to get rss indir table\n");
+		goto err_get_indir_tbl;
+	}
+
+	indir_tbl = (__le16 *)pair.out->buf;
+	for (i = 0; i < L2NIC_RSS_INDIR_SIZE; i++)
+		indir_table[i] = le16_to_cpu(*(indir_tbl + i));
+
+err_get_indir_tbl:
+	hinic3_cmd_buf_pair_uninit(hwdev, &pair);
+
+	return err;
+}
+
+int hinic3_get_rxfh(struct net_device *netdev,
+		    struct ethtool_rxfh_param *rxfh)
+{
+	struct hinic3_nic_dev *nic_dev = netdev_priv(netdev);
+	int err = 0;
+
+	if (!test_bit(HINIC3_RSS_ENABLE, &nic_dev->flags)) {
+		netdev_err(netdev, "Rss is disabled\n");
+		return -EOPNOTSUPP;
+	}
+
+	rxfh->hfunc =
+		nic_dev->rss_hash_type == HINIC3_RSS_HASH_ENGINE_TYPE_XOR ?
+		ETH_RSS_HASH_XOR : ETH_RSS_HASH_TOP;
+
+	if (rxfh->indir) {
+		err = hinic3_rss_get_indir_tbl(nic_dev->hwdev, rxfh->indir);
+		if (err)
+			return err;
+	}
+
+	if (rxfh->key)
+		memcpy(rxfh->key, nic_dev->rss_hkey, L2NIC_RSS_KEY_SIZE);
+
+	return err;
+}
+
+static int hinic3_update_hash_func_type(struct net_device *netdev, u8 hfunc)
+{
+	struct hinic3_nic_dev *nic_dev = netdev_priv(netdev);
+	enum hinic3_rss_hash_type new_rss_hash_type;
+
+	switch (hfunc) {
+	case ETH_RSS_HASH_NO_CHANGE:
+		return 0;
+	case ETH_RSS_HASH_XOR:
+		new_rss_hash_type = HINIC3_RSS_HASH_ENGINE_TYPE_XOR;
+		break;
+	case ETH_RSS_HASH_TOP:
+		new_rss_hash_type = HINIC3_RSS_HASH_ENGINE_TYPE_TOEP;
+		break;
+	default:
+		netdev_err(netdev, "Unsupported hash func %u\n", hfunc);
+		return -EOPNOTSUPP;
+	}
+
+	if (new_rss_hash_type == nic_dev->rss_hash_type)
+		return 0;
+
+	nic_dev->rss_hash_type = new_rss_hash_type;
+	return hinic3_rss_set_hash_type(nic_dev->hwdev, nic_dev->rss_hash_type);
+}
+
+int hinic3_set_rxfh(struct net_device *netdev,
+		    struct ethtool_rxfh_param *rxfh,
+		    struct netlink_ext_ack *extack)
+{
+	struct hinic3_nic_dev *nic_dev = netdev_priv(netdev);
+	int err;
+
+	if (!test_bit(HINIC3_RSS_ENABLE, &nic_dev->flags)) {
+		netdev_err(netdev, "Not support to set rss parameters when rss is disable\n");
+		return -EOPNOTSUPP;
+	}
+
+	err = hinic3_update_hash_func_type(netdev, rxfh->hfunc);
+	if (err)
+		return err;
+
+	err = hinic3_set_rss_rxfh(netdev, rxfh->indir, rxfh->key);
+
+	return err;
+}
diff --git a/drivers/net/ethernet/huawei/hinic3/hinic3_rss.h b/drivers/net/ethernet/huawei/hinic3/hinic3_rss.h
index 78d82c2aca06..9f1b77780cd4 100644
--- a/drivers/net/ethernet/huawei/hinic3/hinic3_rss.h
+++ b/drivers/net/ethernet/huawei/hinic3/hinic3_rss.h
@@ -5,10 +5,29 @@
 #define _HINIC3_RSS_H_
 
 #include <linux/netdevice.h>
+#include <linux/ethtool.h>
 
 int hinic3_rss_init(struct net_device *netdev);
 void hinic3_rss_uninit(struct net_device *netdev);
 void hinic3_try_to_enable_rss(struct net_device *netdev);
 void hinic3_clear_rss_config(struct net_device *netdev);
 
+int hinic3_get_rxnfc(struct net_device *netdev,
+		     struct ethtool_rxnfc *cmd, u32 *rule_locs);
+int hinic3_set_rxnfc(struct net_device *netdev, struct ethtool_rxnfc *cmd);
+
+void hinic3_get_channels(struct net_device *netdev,
+			 struct ethtool_channels *channels);
+int hinic3_set_channels(struct net_device *netdev,
+			struct ethtool_channels *channels);
+
+u32 hinic3_get_rxfh_indir_size(struct net_device *netdev);
+u32 hinic3_get_rxfh_key_size(struct net_device *netdev);
+
+int hinic3_get_rxfh(struct net_device *netdev,
+		    struct ethtool_rxfh_param *rxfh);
+int hinic3_set_rxfh(struct net_device *netdev,
+		    struct ethtool_rxfh_param *rxfh,
+		    struct netlink_ext_ack *extack);
+
 #endif
-- 
2.43.0


^ permalink raw reply related

* [PATCH net-next v04 6/6] hinic3: Remove unneeded coalesce parameters
From: Fan Gong @ 2026-04-08  4:03 UTC (permalink / raw)
  To: Fan Gong, Zhu Yikai, netdev, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Andrew Lunn,
	Ioana Ciornei, Mohsin Bashir
  Cc: linux-kernel, linux-doc, luosifu, Xin Guo, Zhou Shuai, Wu Like,
	Shi Jing, Zheng Jiezhen, Maxime Chevallier
In-Reply-To: <cover.1775618797.git.zhuyikai1@h-partners.com>

  Remove unneeded coalesce parameters in irq handling.

Co-developed-by: Zhu Yikai <zhuyikai1@h-partners.com>
Signed-off-by: Zhu Yikai <zhuyikai1@h-partners.com>
Signed-off-by: Fan Gong <gongfan1@huawei.com>
---
 drivers/net/ethernet/huawei/hinic3/hinic3_irq.c | 6 +-----
 drivers/net/ethernet/huawei/hinic3/hinic3_rx.h  | 3 ---
 2 files changed, 1 insertion(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/huawei/hinic3/hinic3_irq.c b/drivers/net/ethernet/huawei/hinic3/hinic3_irq.c
index d3b3927b5408..42464c007174 100644
--- a/drivers/net/ethernet/huawei/hinic3/hinic3_irq.c
+++ b/drivers/net/ethernet/huawei/hinic3/hinic3_irq.c
@@ -156,13 +156,9 @@ static int hinic3_set_interrupt_moder(struct net_device *netdev, u16 q_id,
 	spin_unlock_irqrestore(&nic_dev->channel_res_lock, flags);
 
 	err = hinic3_set_interrupt_cfg(nic_dev->hwdev, info);
-	if (err) {
+	if (err)
 		netdev_err(netdev,
 			   "Failed to modify moderation for Queue: %u\n", q_id);
-	} else {
-		nic_dev->rxqs[q_id].last_coalesc_timer_cfg = coalesc_timer_cfg;
-		nic_dev->rxqs[q_id].last_pending_limit = pending_limit;
-	}
 
 	return err;
 }
diff --git a/drivers/net/ethernet/huawei/hinic3/hinic3_rx.h b/drivers/net/ethernet/huawei/hinic3/hinic3_rx.h
index c11d080408a7..2ab691ed11a9 100644
--- a/drivers/net/ethernet/huawei/hinic3/hinic3_rx.h
+++ b/drivers/net/ethernet/huawei/hinic3/hinic3_rx.h
@@ -111,9 +111,6 @@ struct hinic3_rxq {
 	dma_addr_t             cqe_start_paddr;
 
 	struct dim             dim;
-
-	u8                     last_coalesc_timer_cfg;
-	u8                     last_pending_limit;
 } ____cacheline_aligned;
 
 struct hinic3_dyna_rxq_res {
-- 
2.43.0


^ permalink raw reply related

* [PATCH net-next v04 3/6] hinic3: Add ethtool coalesce ops
From: Fan Gong @ 2026-04-08  4:03 UTC (permalink / raw)
  To: Fan Gong, Zhu Yikai, netdev, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Andrew Lunn,
	Ioana Ciornei, Mohsin Bashir
  Cc: linux-kernel, linux-doc, luosifu, Xin Guo, Zhou Shuai, Wu Like,
	Shi Jing, Zheng Jiezhen, Maxime Chevallier
In-Reply-To: <cover.1775618797.git.zhuyikai1@h-partners.com>

  Implement following ethtool callback function:
.get_coalesce
.set_coalesce

  These callbacks allow users to utilize ethtool for detailed
RX coalesce configuration and monitoring.

Co-developed-by: Zhu Yikai <zhuyikai1@h-partners.com>
Signed-off-by: Zhu Yikai <zhuyikai1@h-partners.com>
Signed-off-by: Fan Gong <gongfan1@huawei.com>
---
 .../ethernet/huawei/hinic3/hinic3_ethtool.c   | 232 +++++++++++++++++-
 1 file changed, 230 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/huawei/hinic3/hinic3_ethtool.c b/drivers/net/ethernet/huawei/hinic3/hinic3_ethtool.c
index be26698fc658..a4b2d5ba81f8 100644
--- a/drivers/net/ethernet/huawei/hinic3/hinic3_ethtool.c
+++ b/drivers/net/ethernet/huawei/hinic3/hinic3_ethtool.c
@@ -17,6 +17,11 @@
 #include "hinic3_nic_cfg.h"
 
 #define HINIC3_MGMT_VERSION_MAX_LEN     32
+/* Coalesce time properties in microseconds */
+#define COALESCE_PENDING_LIMIT_UNIT     8
+#define COALESCE_TIMER_CFG_UNIT         5
+#define COALESCE_MAX_PENDING_LIMIT      (255 * COALESCE_PENDING_LIMIT_UNIT)
+#define COALESCE_MAX_TIMER_CFG          (255 * COALESCE_TIMER_CFG_UNIT)
 
 static void hinic3_get_drvinfo(struct net_device *netdev,
 			       struct ethtool_drvinfo *info)
@@ -985,9 +990,230 @@ static void hinic3_get_pause_stats(struct net_device *netdev,
 	kfree(ps);
 }
 
+static int hinic3_set_queue_coalesce(struct net_device *netdev, u16 q_id,
+				     struct hinic3_intr_coal_info *coal)
+{
+	struct hinic3_nic_dev *nic_dev = netdev_priv(netdev);
+	struct hinic3_intr_coal_info *intr_coal;
+	struct hinic3_interrupt_info info = {};
+	int err;
+
+	intr_coal = &nic_dev->intr_coalesce[q_id];
+
+	intr_coal->coalesce_timer_cfg = coal->coalesce_timer_cfg;
+	intr_coal->pending_limit = coal->pending_limit;
+	intr_coal->rx_pending_limit_low = coal->rx_pending_limit_low;
+	intr_coal->rx_pending_limit_high = coal->rx_pending_limit_high;
+
+	if (!test_bit(HINIC3_INTF_UP, &nic_dev->flags) ||
+	    q_id >= nic_dev->q_params.num_qps || nic_dev->adaptive_rx_coal)
+		return 0;
+
+	info.msix_index = nic_dev->q_params.irq_cfg[q_id].msix_entry_idx;
+	info.interrupt_coalesc_set = 1;
+	info.coalesc_timer_cfg = intr_coal->coalesce_timer_cfg;
+	info.pending_limit = intr_coal->pending_limit;
+	info.resend_timer_cfg = intr_coal->resend_timer_cfg;
+	err = hinic3_set_interrupt_cfg(nic_dev->hwdev, info);
+	if (err) {
+		netdev_warn(netdev, "Failed to set queue%u coalesce\n", q_id);
+		return err;
+	}
+
+	return 0;
+}
+
+static int is_coalesce_exceed_limit(struct net_device *netdev,
+				    const struct ethtool_coalesce *coal)
+{
+	const struct {
+		const char *name;
+		u32 value;
+		u32 limit;
+	} coalesce_limits[] = {
+		{"rx_coalesce_usecs",
+		 coal->rx_coalesce_usecs,
+		 COALESCE_MAX_TIMER_CFG},
+		{"rx_max_coalesced_frames",
+		 coal->rx_max_coalesced_frames,
+		 COALESCE_MAX_PENDING_LIMIT},
+		{"rx_max_coalesced_frames_low",
+		 coal->rx_max_coalesced_frames_low,
+		 COALESCE_MAX_PENDING_LIMIT},
+		{"rx_max_coalesced_frames_high",
+		 coal->rx_max_coalesced_frames_high,
+		 COALESCE_MAX_PENDING_LIMIT},
+	};
+
+	for (int i = 0; i < ARRAY_SIZE(coalesce_limits); i++) {
+		if (coalesce_limits[i].value > coalesce_limits[i].limit) {
+			netdev_err(netdev, "%s out of range %d-%d\n",
+				   coalesce_limits[i].name, 0,
+				   coalesce_limits[i].limit);
+			return -ERANGE;
+		}
+	}
+	return 0;
+}
+
+static int is_coalesce_legal(struct net_device *netdev,
+			     const struct ethtool_coalesce *coal)
+{
+	int err;
+
+	err = is_coalesce_exceed_limit(netdev, coal);
+	if (err)
+		return err;
+
+	if (coal->rx_max_coalesced_frames_low >
+	    coal->rx_max_coalesced_frames_high) {
+		netdev_err(netdev, "invalid coalesce frame high %u, low %u, unit %d\n",
+			   coal->rx_max_coalesced_frames_high,
+			   coal->rx_max_coalesced_frames_low,
+			   COALESCE_PENDING_LIMIT_UNIT);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static void check_coalesce_align(struct net_device *netdev,
+				 u32 item, u32 unit, const char *str)
+{
+	if (item % unit)
+		netdev_warn(netdev, "%s in %d units, change to %u\n",
+			    str, unit, item - item % unit);
+}
+
+#define CHECK_COALESCE_ALIGN(member, unit) \
+	check_coalesce_align(netdev, member, unit, #member)
+
+static void check_coalesce_changed(struct net_device *netdev,
+				   u32 item, u32 unit, u32 ori_val,
+				   const char *obj_str, const char *str)
+{
+	if ((item / unit) != ori_val)
+		netdev_dbg(netdev, "Change %s from %d to %u %s\n",
+			   str, ori_val * unit, item - item % unit, obj_str);
+}
+
+#define CHECK_COALESCE_CHANGED(member, unit, ori_val, obj_str) \
+	check_coalesce_changed(netdev, member, unit, ori_val, obj_str, #member)
+
+static int hinic3_set_hw_coal_param(struct net_device *netdev,
+				    struct hinic3_intr_coal_info *intr_coal)
+{
+	struct hinic3_nic_dev *nic_dev = netdev_priv(netdev);
+	int err;
+	u16 i;
+
+	for (i = 0; i < nic_dev->max_qps; i++) {
+		err = hinic3_set_queue_coalesce(netdev, i, intr_coal);
+		if (err)
+			return err;
+	}
+
+	return 0;
+}
+
+static int hinic3_get_coalesce(struct net_device *netdev,
+			       struct ethtool_coalesce *coal,
+			       struct kernel_ethtool_coalesce *kernel_coal,
+			       struct netlink_ext_ack *extack)
+{
+	struct hinic3_nic_dev *nic_dev = netdev_priv(netdev);
+	struct hinic3_intr_coal_info *interrupt_info;
+
+	interrupt_info = &nic_dev->intr_coalesce[0];
+
+	/* TX/RX uses the same interrupt.
+	 * So we only declare RX ethtool_coalesce parameters.
+	 */
+	coal->rx_coalesce_usecs = interrupt_info->coalesce_timer_cfg *
+				  COALESCE_TIMER_CFG_UNIT;
+	coal->rx_max_coalesced_frames = interrupt_info->pending_limit *
+					COALESCE_PENDING_LIMIT_UNIT;
+
+	coal->use_adaptive_rx_coalesce = nic_dev->adaptive_rx_coal;
+
+	coal->rx_max_coalesced_frames_high =
+		interrupt_info->rx_pending_limit_high *
+		COALESCE_PENDING_LIMIT_UNIT;
+
+	coal->rx_max_coalesced_frames_low =
+		interrupt_info->rx_pending_limit_low *
+		COALESCE_PENDING_LIMIT_UNIT;
+
+	return 0;
+}
+
+static int hinic3_set_coalesce(struct net_device *netdev,
+			       struct ethtool_coalesce *coal,
+			       struct kernel_ethtool_coalesce *kernel_coal,
+			       struct netlink_ext_ack *extack)
+{
+	struct hinic3_nic_dev *nic_dev = netdev_priv(netdev);
+	struct hinic3_intr_coal_info *ori_intr_coal;
+	struct hinic3_intr_coal_info intr_coal = {};
+	char obj_str[32];
+	int err;
+
+	err = is_coalesce_legal(netdev, coal);
+	if (err)
+		return err;
+
+	CHECK_COALESCE_ALIGN(coal->rx_coalesce_usecs, COALESCE_TIMER_CFG_UNIT);
+	CHECK_COALESCE_ALIGN(coal->rx_max_coalesced_frames,
+			     COALESCE_PENDING_LIMIT_UNIT);
+	CHECK_COALESCE_ALIGN(coal->rx_max_coalesced_frames_high,
+			     COALESCE_PENDING_LIMIT_UNIT);
+	CHECK_COALESCE_ALIGN(coal->rx_max_coalesced_frames_low,
+			     COALESCE_PENDING_LIMIT_UNIT);
+
+	ori_intr_coal = &nic_dev->intr_coalesce[0];
+	snprintf(obj_str, sizeof(obj_str), "for netdev");
+
+	CHECK_COALESCE_CHANGED(coal->rx_coalesce_usecs, COALESCE_TIMER_CFG_UNIT,
+			       ori_intr_coal->coalesce_timer_cfg, obj_str);
+	CHECK_COALESCE_CHANGED(coal->rx_max_coalesced_frames,
+			       COALESCE_PENDING_LIMIT_UNIT,
+			       ori_intr_coal->pending_limit, obj_str);
+	CHECK_COALESCE_CHANGED(coal->rx_max_coalesced_frames_high,
+			       COALESCE_PENDING_LIMIT_UNIT,
+			       ori_intr_coal->rx_pending_limit_high, obj_str);
+	CHECK_COALESCE_CHANGED(coal->rx_max_coalesced_frames_low,
+			       COALESCE_PENDING_LIMIT_UNIT,
+			       ori_intr_coal->rx_pending_limit_low, obj_str);
+
+	intr_coal.coalesce_timer_cfg =
+		(u8)(coal->rx_coalesce_usecs / COALESCE_TIMER_CFG_UNIT);
+	intr_coal.pending_limit = (u8)(coal->rx_max_coalesced_frames /
+				      COALESCE_PENDING_LIMIT_UNIT);
+
+	nic_dev->adaptive_rx_coal = coal->use_adaptive_rx_coalesce;
+
+	intr_coal.rx_pending_limit_high =
+		(u8)(coal->rx_max_coalesced_frames_high /
+		     COALESCE_PENDING_LIMIT_UNIT);
+
+	intr_coal.rx_pending_limit_low =
+		(u8)(coal->rx_max_coalesced_frames_low /
+		     COALESCE_PENDING_LIMIT_UNIT);
+
+	/* coalesce timer or pending set to zero will disable coalesce */
+	if (!nic_dev->adaptive_rx_coal &&
+	    (!intr_coal.coalesce_timer_cfg || !intr_coal.pending_limit))
+		netdev_warn(netdev, "Coalesce will be disabled\n");
+
+	return hinic3_set_hw_coal_param(netdev, &intr_coal);
+}
+
 static const struct ethtool_ops hinic3_ethtool_ops = {
-	.supported_coalesce_params      = ETHTOOL_COALESCE_USECS |
-					  ETHTOOL_COALESCE_PKT_RATE_RX_USECS,
+	.supported_coalesce_params      = ETHTOOL_COALESCE_RX_USECS |
+					  ETHTOOL_COALESCE_RX_MAX_FRAMES |
+					  ETHTOOL_COALESCE_USE_ADAPTIVE_RX |
+					  ETHTOOL_COALESCE_RX_MAX_FRAMES_LOW |
+					  ETHTOOL_COALESCE_RX_MAX_FRAMES_HIGH,
 	.get_link_ksettings             = hinic3_get_link_ksettings,
 	.get_drvinfo                    = hinic3_get_drvinfo,
 	.get_msglevel                   = hinic3_get_msglevel,
@@ -1003,6 +1229,8 @@ static const struct ethtool_ops hinic3_ethtool_ops = {
 	.get_eth_ctrl_stats             = hinic3_get_eth_ctrl_stats,
 	.get_rmon_stats                 = hinic3_get_rmon_stats,
 	.get_pause_stats                = hinic3_get_pause_stats,
+	.get_coalesce                   = hinic3_get_coalesce,
+	.set_coalesce                   = hinic3_set_coalesce,
 };
 
 void hinic3_set_ethtool_ops(struct net_device *netdev)
-- 
2.43.0


^ permalink raw reply related

* [PATCH net-next v04 5/6] hinic3: Configure netdev->watchdog_timeo to set nic tx timeout
From: Fan Gong @ 2026-04-08  4:03 UTC (permalink / raw)
  To: Fan Gong, Zhu Yikai, netdev, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Andrew Lunn,
	Ioana Ciornei, Mohsin Bashir
  Cc: linux-kernel, linux-doc, luosifu, Xin Guo, Zhou Shuai, Wu Like,
	Shi Jing, Zheng Jiezhen, Maxime Chevallier
In-Reply-To: <cover.1775618797.git.zhuyikai1@h-partners.com>

  Configure netdev watchdog timeout to improve transmission reliability.

Co-developed-by: Zhu Yikai <zhuyikai1@h-partners.com>
Signed-off-by: Zhu Yikai <zhuyikai1@h-partners.com>
Signed-off-by: Fan Gong <gongfan1@huawei.com>
---
 drivers/net/ethernet/huawei/hinic3/hinic3_main.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/ethernet/huawei/hinic3/hinic3_main.c b/drivers/net/ethernet/huawei/hinic3/hinic3_main.c
index 3b470978714a..7e09b4b2da9f 100644
--- a/drivers/net/ethernet/huawei/hinic3/hinic3_main.c
+++ b/drivers/net/ethernet/huawei/hinic3/hinic3_main.c
@@ -33,6 +33,8 @@
 #define HINIC3_RX_PENDING_LIMIT_LOW   2
 #define HINIC3_RX_PENDING_LIMIT_HIGH  8
 
+#define HINIC3_WATCHDOG_TIMEOUT       5
+
 static void init_intr_coal_param(struct net_device *netdev)
 {
 	struct hinic3_nic_dev *nic_dev = netdev_priv(netdev);
@@ -246,6 +248,8 @@ static void hinic3_assign_netdev_ops(struct net_device *netdev)
 {
 	hinic3_set_netdev_ops(netdev);
 	hinic3_set_ethtool_ops(netdev);
+
+	netdev->watchdog_timeo = HINIC3_WATCHDOG_TIMEOUT * HZ;
 }
 
 static void netdev_feature_init(struct net_device *netdev)
-- 
2.43.0


^ permalink raw reply related

* [PATCH net-next v04 2/6] hinic3: Add ethtool statistic ops
From: Fan Gong @ 2026-04-08  4:03 UTC (permalink / raw)
  To: Fan Gong, Zhu Yikai, netdev, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Andrew Lunn,
	Ioana Ciornei, Mohsin Bashir
  Cc: linux-kernel, linux-doc, luosifu, Xin Guo, Zhou Shuai, Wu Like,
	Shi Jing, Zheng Jiezhen, Maxime Chevallier
In-Reply-To: <cover.1775618797.git.zhuyikai1@h-partners.com>

  Add PF/VF statistics functions in TX and RX processing.
  Implement following ethtool callback function:
.get_sset_count
.get_ethtool_stats
.get_strings
.get_eth_phy_stats
.get_eth_mac_stats
.get_eth_ctrl_stats
.get_rmon_stats
.get_pause_stats

  These callbacks allow users to utilize ethtool for detailed
TX and RX netdev stats monitoring.

Co-developed-by: Zhu Yikai <zhuyikai1@h-partners.com>
Signed-off-by: Zhu Yikai <zhuyikai1@h-partners.com>
Signed-off-by: Fan Gong <gongfan1@huawei.com>
---
 .../ethernet/huawei/hinic3/hinic3_ethtool.c   | 485 ++++++++++++++++++
 .../ethernet/huawei/hinic3/hinic3_hw_intf.h   |  13 +-
 .../huawei/hinic3/hinic3_mgmt_interface.h     |  37 ++
 .../ethernet/huawei/hinic3/hinic3_nic_cfg.c   |  64 +++
 .../ethernet/huawei/hinic3/hinic3_nic_cfg.h   | 109 ++++
 .../net/ethernet/huawei/hinic3/hinic3_rx.c    |  59 ++-
 .../net/ethernet/huawei/hinic3/hinic3_rx.h    |  15 +-
 .../net/ethernet/huawei/hinic3/hinic3_tx.c    |  71 ++-
 .../net/ethernet/huawei/hinic3/hinic3_tx.h    |   2 +
 9 files changed, 845 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/huawei/hinic3/hinic3_ethtool.c b/drivers/net/ethernet/huawei/hinic3/hinic3_ethtool.c
index e47c3f43e7b9..be26698fc658 100644
--- a/drivers/net/ethernet/huawei/hinic3/hinic3_ethtool.c
+++ b/drivers/net/ethernet/huawei/hinic3/hinic3_ethtool.c
@@ -508,6 +508,483 @@ static int hinic3_set_ringparam(struct net_device *netdev,
 	return 0;
 }
 
+struct hinic3_stats {
+	char name[ETH_GSTRING_LEN];
+	u32  size;
+	int  offset;
+};
+
+#define HINIC3_RXQ_STAT(_stat_item) { \
+	.name   = "rxq%d_"#_stat_item, \
+	.size   = sizeof_field(struct hinic3_rxq_stats, _stat_item), \
+	.offset = offsetof(struct hinic3_rxq_stats, _stat_item) \
+}
+
+#define HINIC3_TXQ_STAT(_stat_item) { \
+	.name   = "txq%d_"#_stat_item, \
+	.size   = sizeof_field(struct hinic3_txq_stats, _stat_item), \
+	.offset = offsetof(struct hinic3_txq_stats, _stat_item) \
+}
+
+static struct hinic3_stats hinic3_rx_queue_stats[] = {
+	HINIC3_RXQ_STAT(csum_errors),
+	HINIC3_RXQ_STAT(other_errors),
+	HINIC3_RXQ_STAT(rx_buf_empty),
+	HINIC3_RXQ_STAT(alloc_skb_err),
+	HINIC3_RXQ_STAT(alloc_rx_buf_err),
+};
+
+static struct hinic3_stats hinic3_tx_queue_stats[] = {
+	HINIC3_TXQ_STAT(busy),
+	HINIC3_TXQ_STAT(skb_pad_err),
+	HINIC3_TXQ_STAT(frag_len_overflow),
+	HINIC3_TXQ_STAT(offload_cow_skb_err),
+	HINIC3_TXQ_STAT(map_frag_err),
+	HINIC3_TXQ_STAT(unknown_tunnel_pkt),
+	HINIC3_TXQ_STAT(frag_size_err),
+};
+
+#define HINIC3_FUNC_STAT(_stat_item) {	\
+	.name   = #_stat_item, \
+	.size   = sizeof_field(struct l2nic_vport_stats, _stat_item), \
+	.offset = offsetof(struct l2nic_vport_stats, _stat_item) \
+}
+
+static struct hinic3_stats hinic3_function_stats[] = {
+	HINIC3_FUNC_STAT(tx_unicast_pkts_vport),
+	HINIC3_FUNC_STAT(tx_unicast_bytes_vport),
+	HINIC3_FUNC_STAT(tx_multicast_pkts_vport),
+	HINIC3_FUNC_STAT(tx_multicast_bytes_vport),
+	HINIC3_FUNC_STAT(tx_broadcast_pkts_vport),
+	HINIC3_FUNC_STAT(tx_broadcast_bytes_vport),
+
+	HINIC3_FUNC_STAT(rx_unicast_pkts_vport),
+	HINIC3_FUNC_STAT(rx_unicast_bytes_vport),
+	HINIC3_FUNC_STAT(rx_multicast_pkts_vport),
+	HINIC3_FUNC_STAT(rx_multicast_bytes_vport),
+	HINIC3_FUNC_STAT(rx_broadcast_pkts_vport),
+	HINIC3_FUNC_STAT(rx_broadcast_bytes_vport),
+
+	HINIC3_FUNC_STAT(tx_discard_vport),
+	HINIC3_FUNC_STAT(rx_discard_vport),
+	HINIC3_FUNC_STAT(tx_err_vport),
+	HINIC3_FUNC_STAT(rx_err_vport),
+};
+
+#define HINIC3_PORT_STAT(_stat_item) { \
+	.name   = #_stat_item, \
+	.size   = sizeof_field(struct mag_cmd_port_stats, _stat_item), \
+	.offset = offsetof(struct mag_cmd_port_stats, _stat_item) \
+}
+
+static struct hinic3_stats hinic3_port_stats[] = {
+	HINIC3_PORT_STAT(mac_tx_fragment_pkt_num),
+	HINIC3_PORT_STAT(mac_tx_undersize_pkt_num),
+	HINIC3_PORT_STAT(mac_tx_undermin_pkt_num),
+	HINIC3_PORT_STAT(mac_tx_1519_max_bad_pkt_num),
+	HINIC3_PORT_STAT(mac_tx_1519_max_good_pkt_num),
+	HINIC3_PORT_STAT(mac_tx_oversize_pkt_num),
+	HINIC3_PORT_STAT(mac_tx_jabber_pkt_num),
+	HINIC3_PORT_STAT(mac_tx_bad_pkt_num),
+	HINIC3_PORT_STAT(mac_tx_bad_oct_num),
+	HINIC3_PORT_STAT(mac_tx_good_oct_num),
+	HINIC3_PORT_STAT(mac_tx_total_pkt_num),
+	HINIC3_PORT_STAT(mac_tx_uni_pkt_num),
+	HINIC3_PORT_STAT(mac_tx_pfc_pkt_num),
+	HINIC3_PORT_STAT(mac_tx_pfc_pri0_pkt_num),
+	HINIC3_PORT_STAT(mac_tx_pfc_pri1_pkt_num),
+	HINIC3_PORT_STAT(mac_tx_pfc_pri2_pkt_num),
+	HINIC3_PORT_STAT(mac_tx_pfc_pri3_pkt_num),
+	HINIC3_PORT_STAT(mac_tx_pfc_pri4_pkt_num),
+	HINIC3_PORT_STAT(mac_tx_pfc_pri5_pkt_num),
+	HINIC3_PORT_STAT(mac_tx_pfc_pri6_pkt_num),
+	HINIC3_PORT_STAT(mac_tx_pfc_pri7_pkt_num),
+	HINIC3_PORT_STAT(mac_tx_err_all_pkt_num),
+	HINIC3_PORT_STAT(mac_tx_from_app_good_pkt_num),
+	HINIC3_PORT_STAT(mac_tx_from_app_bad_pkt_num),
+
+	HINIC3_PORT_STAT(mac_rx_undermin_pkt_num),
+	HINIC3_PORT_STAT(mac_rx_1519_max_bad_pkt_num),
+	HINIC3_PORT_STAT(mac_rx_1519_max_good_pkt_num),
+	HINIC3_PORT_STAT(mac_rx_bad_pkt_num),
+	HINIC3_PORT_STAT(mac_rx_bad_oct_num),
+	HINIC3_PORT_STAT(mac_rx_good_oct_num),
+	HINIC3_PORT_STAT(mac_rx_total_pkt_num),
+	HINIC3_PORT_STAT(mac_rx_uni_pkt_num),
+	HINIC3_PORT_STAT(mac_rx_pfc_pkt_num),
+	HINIC3_PORT_STAT(mac_rx_pfc_pri0_pkt_num),
+	HINIC3_PORT_STAT(mac_rx_pfc_pri1_pkt_num),
+	HINIC3_PORT_STAT(mac_rx_pfc_pri2_pkt_num),
+	HINIC3_PORT_STAT(mac_rx_pfc_pri3_pkt_num),
+	HINIC3_PORT_STAT(mac_rx_pfc_pri4_pkt_num),
+	HINIC3_PORT_STAT(mac_rx_pfc_pri5_pkt_num),
+	HINIC3_PORT_STAT(mac_rx_pfc_pri6_pkt_num),
+	HINIC3_PORT_STAT(mac_rx_pfc_pri7_pkt_num),
+	HINIC3_PORT_STAT(mac_rx_send_app_good_pkt_num),
+	HINIC3_PORT_STAT(mac_rx_send_app_bad_pkt_num),
+	HINIC3_PORT_STAT(mac_rx_unfilter_pkt_num),
+};
+
+static int hinic3_get_sset_count(struct net_device *netdev, int sset)
+{
+	struct hinic3_nic_dev *nic_dev = netdev_priv(netdev);
+	int count, q_num;
+
+	switch (sset) {
+	case ETH_SS_STATS:
+		q_num = nic_dev->q_params.num_qps;
+		count = ARRAY_SIZE(hinic3_function_stats) +
+			(ARRAY_SIZE(hinic3_tx_queue_stats) +
+			 ARRAY_SIZE(hinic3_rx_queue_stats)) *
+			q_num;
+
+		if (!HINIC3_IS_VF(nic_dev->hwdev))
+			count += ARRAY_SIZE(hinic3_port_stats);
+
+		return count;
+	default:
+		return -EOPNOTSUPP;
+	}
+}
+
+static u64 get_val_of_ptr(u32 size, const void *ptr)
+{
+	u64 ret = size == sizeof(u64) ? *(u64 *)ptr :
+		  size == sizeof(u32) ? *(u32 *)ptr :
+		  size == sizeof(u16) ? *(u16 *)ptr :
+		  *(u8 *)ptr;
+
+	return ret;
+}
+
+static void hinic3_get_drv_queue_stats(struct net_device *netdev, u64 *data)
+{
+	struct hinic3_nic_dev *nic_dev = netdev_priv(netdev);
+	struct hinic3_txq_stats txq_stats = {};
+	struct hinic3_rxq_stats rxq_stats = {};
+	u16 i = 0, j, qid;
+	char *p;
+
+	u64_stats_init(&txq_stats.syncp);
+	u64_stats_init(&rxq_stats.syncp);
+
+	for (qid = 0; qid < nic_dev->q_params.num_qps; qid++) {
+		if (!nic_dev->txqs)
+			break;
+
+		hinic3_txq_get_stats(&nic_dev->txqs[qid], &txq_stats);
+		for (j = 0; j < ARRAY_SIZE(hinic3_tx_queue_stats); j++, i++) {
+			p = (char *)&txq_stats +
+			    hinic3_tx_queue_stats[j].offset;
+			data[i] = get_val_of_ptr(hinic3_tx_queue_stats[j].size,
+						 p);
+		}
+	}
+
+	for (qid = 0; qid < nic_dev->q_params.num_qps; qid++) {
+		if (!nic_dev->rxqs)
+			break;
+
+		hinic3_rxq_get_stats(&nic_dev->rxqs[qid], &rxq_stats);
+		for (j = 0; j < ARRAY_SIZE(hinic3_rx_queue_stats); j++, i++) {
+			p = (char *)&rxq_stats +
+			    hinic3_rx_queue_stats[j].offset;
+			data[i] = get_val_of_ptr(hinic3_rx_queue_stats[j].size,
+						 p);
+		}
+	}
+}
+
+static u16 hinic3_get_ethtool_port_stats(struct net_device *netdev, u64 *data)
+{
+	struct hinic3_nic_dev *nic_dev = netdev_priv(netdev);
+	struct mag_cmd_port_stats *ps;
+	u16 i = 0, j;
+	char *p;
+	int err;
+
+	ps = kmalloc_obj(*ps);
+	if (!ps)
+		goto err_zero_stats;
+
+	err = hinic3_get_phy_port_stats(nic_dev->hwdev, ps);
+	if (err) {
+		kfree(ps);
+		netdev_err(netdev, "Failed to get port stats from fw\n");
+		goto err_zero_stats;
+	}
+
+	for (j = 0; j < ARRAY_SIZE(hinic3_port_stats); j++, i++) {
+		p = (char *)ps + hinic3_port_stats[j].offset;
+		data[i] = get_val_of_ptr(hinic3_port_stats[j].size, p);
+	}
+
+	kfree(ps);
+
+	return i;
+
+err_zero_stats:
+	memset(&data[i], 0, ARRAY_SIZE(hinic3_port_stats) * sizeof(*data));
+
+	return i + ARRAY_SIZE(hinic3_port_stats);
+}
+
+static void hinic3_get_ethtool_stats(struct net_device *netdev,
+				     struct ethtool_stats *stats, u64 *data)
+{
+	struct hinic3_nic_dev *nic_dev = netdev_priv(netdev);
+	struct l2nic_vport_stats vport_stats = {};
+	u16 i = 0, j;
+	char *p;
+	int err;
+
+	err = hinic3_get_vport_stats(nic_dev->hwdev,
+				     hinic3_global_func_id(nic_dev->hwdev),
+				     &vport_stats);
+	if (err)
+		netdev_err(netdev, "Failed to get function stats from fw\n");
+
+	for (j = 0; j < ARRAY_SIZE(hinic3_function_stats); j++, i++) {
+		p = (char *)&vport_stats + hinic3_function_stats[j].offset;
+		data[i] = get_val_of_ptr(hinic3_function_stats[j].size, p);
+	}
+
+	if (!HINIC3_IS_VF(nic_dev->hwdev))
+		i += hinic3_get_ethtool_port_stats(netdev, data + i);
+
+	hinic3_get_drv_queue_stats(netdev, data + i);
+}
+
+static u16 hinic3_get_hw_stats_strings(struct net_device *netdev, char *p)
+{
+	struct hinic3_nic_dev *nic_dev = netdev_priv(netdev);
+	u16 i, cnt = 0;
+
+	for (i = 0; i < ARRAY_SIZE(hinic3_function_stats); i++) {
+		memcpy(p, hinic3_function_stats[i].name, ETH_GSTRING_LEN);
+		p += ETH_GSTRING_LEN;
+		cnt++;
+	}
+
+	if (!HINIC3_IS_VF(nic_dev->hwdev)) {
+		for (i = 0; i < ARRAY_SIZE(hinic3_port_stats); i++) {
+			memcpy(p, hinic3_port_stats[i].name, ETH_GSTRING_LEN);
+			p += ETH_GSTRING_LEN;
+			cnt++;
+		}
+	}
+
+	return cnt;
+}
+
+static void hinic3_get_qp_stats_strings(struct net_device *netdev, char *p)
+{
+	struct hinic3_nic_dev *nic_dev = netdev_priv(netdev);
+	u8 *data = p;
+	u16 i, j;
+
+	for (i = 0; i < nic_dev->q_params.num_qps; i++) {
+		for (j = 0; j < ARRAY_SIZE(hinic3_tx_queue_stats); j++)
+			ethtool_sprintf(&data,
+					hinic3_tx_queue_stats[j].name, i);
+	}
+
+	for (i = 0; i < nic_dev->q_params.num_qps; i++) {
+		for (j = 0; j < ARRAY_SIZE(hinic3_rx_queue_stats); j++)
+			ethtool_sprintf(&data,
+					hinic3_rx_queue_stats[j].name, i);
+	}
+}
+
+static void hinic3_get_strings(struct net_device *netdev,
+			       u32 stringset, u8 *data)
+{
+	char *p = (char *)data;
+	u16 offset;
+
+	switch (stringset) {
+	case ETH_SS_STATS:
+		offset = hinic3_get_hw_stats_strings(netdev, p);
+		hinic3_get_qp_stats_strings(netdev,
+					    p + offset * ETH_GSTRING_LEN);
+
+		return;
+	default:
+		netdev_err(netdev, "Invalid string set %u.\n", stringset);
+		return;
+	}
+}
+
+static void hinic3_get_eth_phy_stats(struct net_device *netdev,
+				     struct ethtool_eth_phy_stats *phy_stats)
+{
+	struct hinic3_nic_dev *nic_dev = netdev_priv(netdev);
+	struct mag_cmd_port_stats *ps;
+	int err;
+
+	ps = kmalloc_obj(*ps);
+	if (!ps)
+		return;
+
+	err = hinic3_get_phy_port_stats(nic_dev->hwdev, ps);
+	if (err) {
+		kfree(ps);
+		netdev_err(netdev, "Failed to get eth phy stats from fw\n");
+		return;
+	}
+
+	phy_stats->SymbolErrorDuringCarrier = ps->mac_rx_sym_err_pkt_num;
+
+	kfree(ps);
+}
+
+static void hinic3_get_eth_mac_stats(struct net_device *netdev,
+				     struct ethtool_eth_mac_stats *mac_stats)
+{
+	struct hinic3_nic_dev *nic_dev = netdev_priv(netdev);
+	struct mag_cmd_port_stats *ps;
+	int err;
+
+	ps = kmalloc_obj(*ps);
+	if (!ps)
+		return;
+
+	err = hinic3_get_phy_port_stats(nic_dev->hwdev, ps);
+	if (err) {
+		kfree(ps);
+		netdev_err(netdev, "Failed to get eth mac stats from fw\n");
+		return;
+	}
+
+	mac_stats->FramesTransmittedOK = ps->mac_tx_good_pkt_num;
+	mac_stats->FramesReceivedOK = ps->mac_rx_good_pkt_num;
+	mac_stats->FrameCheckSequenceErrors = ps->mac_rx_fcs_err_pkt_num;
+	mac_stats->OctetsTransmittedOK = ps->mac_tx_total_oct_num;
+	mac_stats->OctetsReceivedOK = ps->mac_rx_total_oct_num;
+	mac_stats->MulticastFramesXmittedOK = ps->mac_tx_multi_pkt_num;
+	mac_stats->BroadcastFramesXmittedOK = ps->mac_tx_broad_pkt_num;
+	mac_stats->MulticastFramesReceivedOK = ps->mac_rx_multi_pkt_num;
+	mac_stats->BroadcastFramesReceivedOK = ps->mac_rx_broad_pkt_num;
+
+	kfree(ps);
+}
+
+static void hinic3_get_eth_ctrl_stats(struct net_device *netdev,
+				      struct ethtool_eth_ctrl_stats *ctrl_stats)
+{
+	struct hinic3_nic_dev *nic_dev = netdev_priv(netdev);
+	struct mag_cmd_port_stats *ps;
+	int err;
+
+	ps = kmalloc_obj(*ps);
+	if (!ps)
+		return;
+
+	err = hinic3_get_phy_port_stats(nic_dev->hwdev, ps);
+	if (err) {
+		kfree(ps);
+		netdev_err(netdev, "Failed to get eth ctrl stats from fw\n");
+		return;
+	}
+
+	ctrl_stats->MACControlFramesTransmitted = ps->mac_tx_control_pkt_num;
+	ctrl_stats->MACControlFramesReceived = ps->mac_rx_control_pkt_num;
+
+	kfree(ps);
+}
+
+static const struct ethtool_rmon_hist_range hinic3_rmon_ranges[] = {
+	{     0,    64 },
+	{    65,   127 },
+	{   128,   255 },
+	{   256,   511 },
+	{   512,  1023 },
+	{  1024,  1518 },
+	{  1519,  2047 },
+	{  2048,  4095 },
+	{  4096,  8191 },
+	{  8192,  9216 },
+	{  9217, 12287 },
+	{}
+};
+
+static void hinic3_get_rmon_stats(struct net_device *netdev,
+				  struct ethtool_rmon_stats *rmon_stats,
+				  const struct ethtool_rmon_hist_range **ranges)
+{
+	struct hinic3_nic_dev *nic_dev = netdev_priv(netdev);
+	struct mag_cmd_port_stats *ps;
+	int err;
+
+	ps = kmalloc_obj(*ps);
+	if (!ps)
+		return;
+
+	err = hinic3_get_phy_port_stats(nic_dev->hwdev, ps);
+	if (err) {
+		kfree(ps);
+		netdev_err(netdev, "Failed to get eth rmon stats from fw\n");
+		return;
+	}
+
+	rmon_stats->undersize_pkts	= ps->mac_rx_undersize_pkt_num;
+	rmon_stats->oversize_pkts	= ps->mac_rx_oversize_pkt_num;
+	rmon_stats->fragments		= ps->mac_rx_fragment_pkt_num;
+	rmon_stats->jabbers		= ps->mac_rx_jabber_pkt_num;
+
+	rmon_stats->hist[0]		= ps->mac_rx_64_oct_pkt_num;
+	rmon_stats->hist[1]		= ps->mac_rx_65_127_oct_pkt_num;
+	rmon_stats->hist[2]		= ps->mac_rx_128_255_oct_pkt_num;
+	rmon_stats->hist[3]		= ps->mac_rx_256_511_oct_pkt_num;
+	rmon_stats->hist[4]		= ps->mac_rx_512_1023_oct_pkt_num;
+	rmon_stats->hist[5]		= ps->mac_rx_1024_1518_oct_pkt_num;
+	rmon_stats->hist[6]		= ps->mac_rx_1519_2047_oct_pkt_num;
+	rmon_stats->hist[7]		= ps->mac_rx_2048_4095_oct_pkt_num;
+	rmon_stats->hist[8]		= ps->mac_rx_4096_8191_oct_pkt_num;
+	rmon_stats->hist[9]		= ps->mac_rx_8192_9216_oct_pkt_num;
+	rmon_stats->hist[10]		= ps->mac_rx_9217_12287_oct_pkt_num;
+
+	rmon_stats->hist_tx[0]		= ps->mac_tx_64_oct_pkt_num;
+	rmon_stats->hist_tx[1]		= ps->mac_tx_65_127_oct_pkt_num;
+	rmon_stats->hist_tx[2]		= ps->mac_tx_128_255_oct_pkt_num;
+	rmon_stats->hist_tx[3]		= ps->mac_tx_256_511_oct_pkt_num;
+	rmon_stats->hist_tx[4]		= ps->mac_tx_512_1023_oct_pkt_num;
+	rmon_stats->hist_tx[5]		= ps->mac_tx_1024_1518_oct_pkt_num;
+	rmon_stats->hist_tx[6]		= ps->mac_tx_1519_2047_oct_pkt_num;
+	rmon_stats->hist_tx[7]		= ps->mac_tx_2048_4095_oct_pkt_num;
+	rmon_stats->hist_tx[8]		= ps->mac_tx_4096_8191_oct_pkt_num;
+	rmon_stats->hist_tx[9]		= ps->mac_tx_8192_9216_oct_pkt_num;
+	rmon_stats->hist_tx[10]		= ps->mac_tx_9217_12287_oct_pkt_num;
+
+	*ranges = hinic3_rmon_ranges;
+
+	kfree(ps);
+}
+
+static void hinic3_get_pause_stats(struct net_device *netdev,
+				   struct ethtool_pause_stats *pause_stats)
+{
+	struct hinic3_nic_dev *nic_dev = netdev_priv(netdev);
+	struct mag_cmd_port_stats *ps;
+	int err;
+
+	ps = kmalloc_obj(*ps);
+	if (!ps)
+		return;
+
+	err = hinic3_get_phy_port_stats(nic_dev->hwdev, ps);
+	if (err) {
+		kfree(ps);
+		netdev_err(netdev, "Failed to get eth pause stats from fw\n");
+		return;
+	}
+
+	pause_stats->tx_pause_frames = ps->mac_tx_pause_num;
+	pause_stats->rx_pause_frames = ps->mac_rx_pause_num;
+
+	kfree(ps);
+}
+
 static const struct ethtool_ops hinic3_ethtool_ops = {
 	.supported_coalesce_params      = ETHTOOL_COALESCE_USECS |
 					  ETHTOOL_COALESCE_PKT_RATE_RX_USECS,
@@ -518,6 +995,14 @@ static const struct ethtool_ops hinic3_ethtool_ops = {
 	.get_link                       = ethtool_op_get_link,
 	.get_ringparam                  = hinic3_get_ringparam,
 	.set_ringparam                  = hinic3_set_ringparam,
+	.get_sset_count                 = hinic3_get_sset_count,
+	.get_ethtool_stats              = hinic3_get_ethtool_stats,
+	.get_strings                    = hinic3_get_strings,
+	.get_eth_phy_stats              = hinic3_get_eth_phy_stats,
+	.get_eth_mac_stats              = hinic3_get_eth_mac_stats,
+	.get_eth_ctrl_stats             = hinic3_get_eth_ctrl_stats,
+	.get_rmon_stats                 = hinic3_get_rmon_stats,
+	.get_pause_stats                = hinic3_get_pause_stats,
 };
 
 void hinic3_set_ethtool_ops(struct net_device *netdev)
diff --git a/drivers/net/ethernet/huawei/hinic3/hinic3_hw_intf.h b/drivers/net/ethernet/huawei/hinic3/hinic3_hw_intf.h
index cfc9daa3034f..0b2ebef04c02 100644
--- a/drivers/net/ethernet/huawei/hinic3/hinic3_hw_intf.h
+++ b/drivers/net/ethernet/huawei/hinic3/hinic3_hw_intf.h
@@ -51,7 +51,18 @@ static inline void mgmt_msg_params_init_default(struct mgmt_msg_params *msg_para
 	msg_params->in_size = buf_size;
 	msg_params->expected_out_size = buf_size;
 	msg_params->timeout_ms = 0;
-}
+};
+
+static inline void
+mgmt_msg_params_init_in_out(struct mgmt_msg_params *msg_params, void *in_buf,
+			    void *out_buf, u32 in_buf_size, u32 out_buf_size)
+{
+	msg_params->buf_in = in_buf;
+	msg_params->buf_out = out_buf;
+	msg_params->in_size = in_buf_size;
+	msg_params->expected_out_size = out_buf_size;
+	msg_params->timeout_ms = 0;
+};
 
 enum cfg_cmd {
 	CFG_CMD_GET_DEV_CAP = 0,
diff --git a/drivers/net/ethernet/huawei/hinic3/hinic3_mgmt_interface.h b/drivers/net/ethernet/huawei/hinic3/hinic3_mgmt_interface.h
index c5bca3c4af96..76c691f82703 100644
--- a/drivers/net/ethernet/huawei/hinic3/hinic3_mgmt_interface.h
+++ b/drivers/net/ethernet/huawei/hinic3/hinic3_mgmt_interface.h
@@ -143,6 +143,41 @@ struct l2nic_cmd_set_dcb_state {
 	u8                   rsvd[7];
 };
 
+struct l2nic_port_stats_info {
+	struct mgmt_msg_head msg_head;
+	u16                  func_id;
+	u16                  rsvd1;
+};
+
+struct l2nic_vport_stats {
+	u64 tx_unicast_pkts_vport;
+	u64 tx_unicast_bytes_vport;
+	u64 tx_multicast_pkts_vport;
+	u64 tx_multicast_bytes_vport;
+	u64 tx_broadcast_pkts_vport;
+	u64 tx_broadcast_bytes_vport;
+
+	u64 rx_unicast_pkts_vport;
+	u64 rx_unicast_bytes_vport;
+	u64 rx_multicast_pkts_vport;
+	u64 rx_multicast_bytes_vport;
+	u64 rx_broadcast_pkts_vport;
+	u64 rx_broadcast_bytes_vport;
+
+	u64 tx_discard_vport;
+	u64 rx_discard_vport;
+	u64 tx_err_vport;
+	u64 rx_err_vport;
+};
+
+struct l2nic_cmd_vport_stats {
+	struct mgmt_msg_head     msg_head;
+	u32                      stats_size;
+	u32                      rsvd1;
+	struct l2nic_vport_stats stats;
+	u64                      rsvd2[6];
+};
+
 struct l2nic_cmd_lro_config {
 	struct mgmt_msg_head msg_head;
 	u16                  func_id;
@@ -234,6 +269,7 @@ enum l2nic_cmd {
 	L2NIC_CMD_SET_VPORT_ENABLE    = 6,
 	L2NIC_CMD_SET_RX_MODE         = 7,
 	L2NIC_CMD_SET_SQ_CI_ATTR      = 8,
+	L2NIC_CMD_GET_VPORT_STAT      = 9,
 	L2NIC_CMD_CLEAR_QP_RESOURCE   = 11,
 	L2NIC_CMD_CFG_RX_LRO          = 13,
 	L2NIC_CMD_CFG_LRO_TIMER       = 14,
@@ -272,6 +308,7 @@ enum mag_cmd {
 	MAG_CMD_SET_PORT_ENABLE = 6,
 	MAG_CMD_GET_LINK_STATUS = 7,
 
+	MAG_CMD_GET_PORT_STAT   = 151,
 	MAG_CMD_GET_PORT_INFO   = 153,
 };
 
diff --git a/drivers/net/ethernet/huawei/hinic3/hinic3_nic_cfg.c b/drivers/net/ethernet/huawei/hinic3/hinic3_nic_cfg.c
index de5a7984d2cb..1b14dc824ce1 100644
--- a/drivers/net/ethernet/huawei/hinic3/hinic3_nic_cfg.c
+++ b/drivers/net/ethernet/huawei/hinic3/hinic3_nic_cfg.c
@@ -639,6 +639,42 @@ int hinic3_get_link_status(struct hinic3_hwdev *hwdev, bool *link_status_up)
 	return 0;
 }
 
+int hinic3_get_phy_port_stats(struct hinic3_hwdev *hwdev,
+			      struct mag_cmd_port_stats *stats)
+{
+	struct mag_cmd_port_stats_info stats_info = {};
+	struct mag_cmd_get_port_stat *ps;
+	struct mgmt_msg_params msg_params = {};
+	int err;
+
+	ps = kzalloc_obj(*ps);
+	if (!ps)
+		return -ENOMEM;
+
+	stats_info.port_id = hinic3_physical_port_id(hwdev);
+
+	mgmt_msg_params_init_in_out(&msg_params, &stats_info, ps,
+				    sizeof(stats_info), sizeof(*ps));
+
+	err = hinic3_send_mbox_to_mgmt(hwdev, MGMT_MOD_HILINK,
+				       MAG_CMD_GET_PORT_STAT, &msg_params);
+
+	if (err || ps->head.status) {
+		dev_err(hwdev->dev,
+			"Failed to get port statistics, err: %d, status: 0x%x\n",
+			err, ps->head.status);
+		err = -EFAULT;
+		goto out;
+	}
+
+	memcpy(stats, &ps->counter, sizeof(*stats));
+
+out:
+	kfree(ps);
+
+	return err;
+}
+
 int hinic3_get_port_info(struct hinic3_hwdev *hwdev,
 			 struct hinic3_nic_port_info *port_info)
 {
@@ -738,3 +774,31 @@ int hinic3_get_pause_info(struct hinic3_nic_dev *nic_dev,
 	return hinic3_cfg_hw_pause(nic_dev->hwdev, MGMT_MSG_CMD_OP_GET,
 				   nic_pause);
 }
+
+int hinic3_get_vport_stats(struct hinic3_hwdev *hwdev, u16 func_id,
+			   struct l2nic_vport_stats *stats)
+{
+	struct l2nic_cmd_vport_stats vport_stats = {};
+	struct l2nic_port_stats_info stats_info = {};
+	struct mgmt_msg_params msg_params = {};
+	int err;
+
+	stats_info.func_id = func_id;
+
+	mgmt_msg_params_init_in_out(&msg_params, &stats_info, &vport_stats,
+				    sizeof(stats_info), sizeof(vport_stats));
+
+	err = hinic3_send_mbox_to_mgmt(hwdev, MGMT_MOD_L2NIC,
+				       L2NIC_CMD_GET_VPORT_STAT, &msg_params);
+
+	if (err || vport_stats.msg_head.status) {
+		dev_err(hwdev->dev,
+			"Failed to get function statistics, err: %d, status: 0x%x\n",
+			err, vport_stats.msg_head.status);
+		return -EFAULT;
+	}
+
+	memcpy(stats, &vport_stats.stats, sizeof(*stats));
+
+	return 0;
+}
diff --git a/drivers/net/ethernet/huawei/hinic3/hinic3_nic_cfg.h b/drivers/net/ethernet/huawei/hinic3/hinic3_nic_cfg.h
index 5d52202a8d4e..80573c121539 100644
--- a/drivers/net/ethernet/huawei/hinic3/hinic3_nic_cfg.h
+++ b/drivers/net/ethernet/huawei/hinic3/hinic3_nic_cfg.h
@@ -129,6 +129,110 @@ struct mag_cmd_get_xsfp_present {
 	u8                   rsvd[2];
 };
 
+struct mag_cmd_port_stats {
+	u64 mac_tx_fragment_pkt_num;
+	u64 mac_tx_undersize_pkt_num;
+	u64 mac_tx_undermin_pkt_num;
+	u64 mac_tx_64_oct_pkt_num;
+	u64 mac_tx_65_127_oct_pkt_num;
+	u64 mac_tx_128_255_oct_pkt_num;
+	u64 mac_tx_256_511_oct_pkt_num;
+	u64 mac_tx_512_1023_oct_pkt_num;
+	u64 mac_tx_1024_1518_oct_pkt_num;
+	u64 mac_tx_1519_2047_oct_pkt_num;
+	u64 mac_tx_2048_4095_oct_pkt_num;
+	u64 mac_tx_4096_8191_oct_pkt_num;
+	u64 mac_tx_8192_9216_oct_pkt_num;
+	u64 mac_tx_9217_12287_oct_pkt_num;
+	u64 mac_tx_12288_16383_oct_pkt_num;
+	u64 mac_tx_1519_max_bad_pkt_num;
+	u64 mac_tx_1519_max_good_pkt_num;
+	u64 mac_tx_oversize_pkt_num;
+	u64 mac_tx_jabber_pkt_num;
+	u64 mac_tx_bad_pkt_num;
+	u64 mac_tx_bad_oct_num;
+	u64 mac_tx_good_pkt_num;
+	u64 mac_tx_good_oct_num;
+	u64 mac_tx_total_pkt_num;
+	u64 mac_tx_total_oct_num;
+	u64 mac_tx_uni_pkt_num;
+	u64 mac_tx_multi_pkt_num;
+	u64 mac_tx_broad_pkt_num;
+	u64 mac_tx_pause_num;
+	u64 mac_tx_pfc_pkt_num;
+	u64 mac_tx_pfc_pri0_pkt_num;
+	u64 mac_tx_pfc_pri1_pkt_num;
+	u64 mac_tx_pfc_pri2_pkt_num;
+	u64 mac_tx_pfc_pri3_pkt_num;
+	u64 mac_tx_pfc_pri4_pkt_num;
+	u64 mac_tx_pfc_pri5_pkt_num;
+	u64 mac_tx_pfc_pri6_pkt_num;
+	u64 mac_tx_pfc_pri7_pkt_num;
+	u64 mac_tx_control_pkt_num;
+	u64 mac_tx_err_all_pkt_num;
+	u64 mac_tx_from_app_good_pkt_num;
+	u64 mac_tx_from_app_bad_pkt_num;
+
+	u64 mac_rx_fragment_pkt_num;
+	u64 mac_rx_undersize_pkt_num;
+	u64 mac_rx_undermin_pkt_num;
+	u64 mac_rx_64_oct_pkt_num;
+	u64 mac_rx_65_127_oct_pkt_num;
+	u64 mac_rx_128_255_oct_pkt_num;
+	u64 mac_rx_256_511_oct_pkt_num;
+	u64 mac_rx_512_1023_oct_pkt_num;
+	u64 mac_rx_1024_1518_oct_pkt_num;
+	u64 mac_rx_1519_2047_oct_pkt_num;
+	u64 mac_rx_2048_4095_oct_pkt_num;
+	u64 mac_rx_4096_8191_oct_pkt_num;
+	u64 mac_rx_8192_9216_oct_pkt_num;
+	u64 mac_rx_9217_12287_oct_pkt_num;
+	u64 mac_rx_12288_16383_oct_pkt_num;
+	u64 mac_rx_1519_max_bad_pkt_num;
+	u64 mac_rx_1519_max_good_pkt_num;
+	u64 mac_rx_oversize_pkt_num;
+	u64 mac_rx_jabber_pkt_num;
+	u64 mac_rx_bad_pkt_num;
+	u64 mac_rx_bad_oct_num;
+	u64 mac_rx_good_pkt_num;
+	u64 mac_rx_good_oct_num;
+	u64 mac_rx_total_pkt_num;
+	u64 mac_rx_total_oct_num;
+	u64 mac_rx_uni_pkt_num;
+	u64 mac_rx_multi_pkt_num;
+	u64 mac_rx_broad_pkt_num;
+	u64 mac_rx_pause_num;
+	u64 mac_rx_pfc_pkt_num;
+	u64 mac_rx_pfc_pri0_pkt_num;
+	u64 mac_rx_pfc_pri1_pkt_num;
+	u64 mac_rx_pfc_pri2_pkt_num;
+	u64 mac_rx_pfc_pri3_pkt_num;
+	u64 mac_rx_pfc_pri4_pkt_num;
+	u64 mac_rx_pfc_pri5_pkt_num;
+	u64 mac_rx_pfc_pri6_pkt_num;
+	u64 mac_rx_pfc_pri7_pkt_num;
+	u64 mac_rx_control_pkt_num;
+	u64 mac_rx_sym_err_pkt_num;
+	u64 mac_rx_fcs_err_pkt_num;
+	u64 mac_rx_send_app_good_pkt_num;
+	u64 mac_rx_send_app_bad_pkt_num;
+	u64 mac_rx_unfilter_pkt_num;
+};
+
+struct mag_cmd_port_stats_info {
+	struct mgmt_msg_head head;
+
+	u8                   port_id;
+	u8                   rsvd0[3];
+};
+
+struct mag_cmd_get_port_stat {
+	struct mgmt_msg_head      head;
+
+	struct mag_cmd_port_stats counter;
+	u64                       rsvd1[15];
+};
+
 enum link_err_type {
 	LINK_ERR_MODULE_UNRECOGENIZED,
 	LINK_ERR_NUM,
@@ -209,6 +313,11 @@ int hinic3_get_port_info(struct hinic3_hwdev *hwdev,
 			 struct hinic3_nic_port_info *port_info);
 int hinic3_set_vport_enable(struct hinic3_hwdev *hwdev, u16 func_id,
 			    bool enable);
+int hinic3_get_phy_port_stats(struct hinic3_hwdev *hwdev,
+			      struct mag_cmd_port_stats *stats);
+int hinic3_get_vport_stats(struct hinic3_hwdev *hwdev, u16 func_id,
+			   struct l2nic_vport_stats *stats);
+
 int hinic3_add_vlan(struct hinic3_hwdev *hwdev, u16 vlan_id, u16 func_id);
 int hinic3_del_vlan(struct hinic3_hwdev *hwdev, u16 vlan_id, u16 func_id);
 
diff --git a/drivers/net/ethernet/huawei/hinic3/hinic3_rx.c b/drivers/net/ethernet/huawei/hinic3/hinic3_rx.c
index 309ab5901379..7fadb88ff722 100644
--- a/drivers/net/ethernet/huawei/hinic3/hinic3_rx.c
+++ b/drivers/net/ethernet/huawei/hinic3/hinic3_rx.c
@@ -29,7 +29,7 @@
 #define HINIC3_LRO_PKT_HDR_LEN_IPV4     66
 #define HINIC3_LRO_PKT_HDR_LEN_IPV6     86
 #define HINIC3_LRO_PKT_HDR_LEN(cqe) \
-	(RQ_CQE_OFFOLAD_TYPE_GET((cqe)->offload_type, IP_TYPE) == \
+	(RQ_CQE_OFFOLAD_TYPE_GET(le32_to_cpu((cqe)->offload_type), IP_TYPE) == \
 	 HINIC3_RX_IPV6_PKT ? HINIC3_LRO_PKT_HDR_LEN_IPV6 : \
 	 HINIC3_LRO_PKT_HDR_LEN_IPV4)
 
@@ -46,7 +46,6 @@ static void hinic3_rxq_clean_stats(struct hinic3_rxq_stats *rxq_stats)
 
 	rxq_stats->alloc_skb_err = 0;
 	rxq_stats->alloc_rx_buf_err = 0;
-	rxq_stats->restore_drop_sge = 0;
 	u64_stats_update_end(&rxq_stats->syncp);
 }
 
@@ -155,8 +154,12 @@ static u32 hinic3_rx_fill_buffers(struct hinic3_rxq *rxq)
 
 		err = rx_alloc_mapped_page(rxq->page_pool, rx_info,
 					   rxq->buf_len);
-		if (unlikely(err))
+		if (unlikely(err)) {
+			u64_stats_update_begin(&rxq->rxq_stats.syncp);
+			rxq->rxq_stats.alloc_rx_buf_err++;
+			u64_stats_update_end(&rxq->rxq_stats.syncp);
 			break;
+		}
 
 		dma_addr = page_pool_get_dma_addr(rx_info->page) +
 			rx_info->page_offset;
@@ -170,6 +173,10 @@ static u32 hinic3_rx_fill_buffers(struct hinic3_rxq *rxq)
 				rxq->next_to_update << HINIC3_NORMAL_RQ_WQE);
 		rxq->delta -= i;
 		rxq->next_to_alloc = rxq->next_to_update;
+	} else if (free_wqebbs == rxq->q_depth - 1) {
+		u64_stats_update_begin(&rxq->rxq_stats.syncp);
+		rxq->rxq_stats.rx_buf_empty++;
+		u64_stats_update_end(&rxq->rxq_stats.syncp);
 	}
 
 	return i;
@@ -330,11 +337,23 @@ static void hinic3_rx_csum(struct hinic3_rxq *rxq, u32 offload_type,
 	struct net_device *netdev = rxq->netdev;
 	bool l2_tunnel;
 
+	if (unlikely(csum_err == HINIC3_RX_CSUM_IPSU_OTHER_ERR)) {
+		u64_stats_update_begin(&rxq->rxq_stats.syncp);
+		rxq->rxq_stats.other_errors++;
+		u64_stats_update_end(&rxq->rxq_stats.syncp);
+	}
+
 	if (!(netdev->features & NETIF_F_RXCSUM))
 		return;
 
 	if (unlikely(csum_err)) {
 		/* pkt type is recognized by HW, and csum is wrong */
+		if (!(csum_err & (HINIC3_RX_CSUM_HW_CHECK_NONE |
+				  HINIC3_RX_CSUM_IPSU_OTHER_ERR))) {
+			u64_stats_update_begin(&rxq->rxq_stats.syncp);
+			rxq->rxq_stats.csum_errors++;
+			u64_stats_update_end(&rxq->rxq_stats.syncp);
+		}
 		skb->ip_summed = CHECKSUM_NONE;
 		return;
 	}
@@ -387,8 +406,12 @@ static int recv_one_pkt(struct hinic3_rxq *rxq, struct hinic3_rq_cqe *rx_cqe,
 	u16 num_lro;
 
 	skb = hinic3_fetch_rx_buffer(rxq, pkt_len);
-	if (unlikely(!skb))
+	if (unlikely(!skb)) {
+		u64_stats_update_begin(&rxq->rxq_stats.syncp);
+		rxq->rxq_stats.alloc_skb_err++;
+		u64_stats_update_end(&rxq->rxq_stats.syncp);
 		return -ENOMEM;
+	}
 
 	/* place header in linear portion of buffer */
 	if (skb_is_nonlinear(skb))
@@ -550,11 +573,28 @@ int hinic3_configure_rxqs(struct net_device *netdev, u16 num_rq,
 	return 0;
 }
 
+void hinic3_rxq_get_stats(struct hinic3_rxq *rxq,
+			  struct hinic3_rxq_stats *stats)
+{
+	struct hinic3_rxq_stats *rxq_stats = &rxq->rxq_stats;
+	unsigned int start;
+
+	do {
+		start = u64_stats_fetch_begin(&rxq_stats->syncp);
+		stats->csum_errors = rxq_stats->csum_errors;
+		stats->other_errors = rxq_stats->other_errors;
+		stats->rx_buf_empty = rxq_stats->rx_buf_empty;
+		stats->alloc_skb_err = rxq_stats->alloc_skb_err;
+		stats->alloc_rx_buf_err = rxq_stats->alloc_rx_buf_err;
+	} while (u64_stats_fetch_retry(&rxq_stats->syncp, start));
+}
+
 int hinic3_rx_poll(struct hinic3_rxq *rxq, int budget)
 {
 	struct hinic3_nic_dev *nic_dev = netdev_priv(rxq->netdev);
 	u32 sw_ci, status, pkt_len, vlan_len;
 	struct hinic3_rq_cqe *rx_cqe;
+	u64 rx_bytes = 0;
 	u32 num_wqe = 0;
 	int nr_pkts = 0;
 	u16 num_lro;
@@ -574,10 +614,14 @@ int hinic3_rx_poll(struct hinic3_rxq *rxq, int budget)
 		if (recv_one_pkt(rxq, rx_cqe, pkt_len, vlan_len, status))
 			break;
 
+		rx_bytes += pkt_len;
 		nr_pkts++;
 		num_lro = RQ_CQE_STATUS_GET(status, NUM_LRO);
-		if (num_lro)
+		if (num_lro) {
+			rx_bytes += (num_lro - 1) *
+				    HINIC3_LRO_PKT_HDR_LEN(rx_cqe);
 			num_wqe += hinic3_get_sge_num(rxq, pkt_len);
+		}
 
 		rx_cqe->status = 0;
 
@@ -588,5 +632,10 @@ int hinic3_rx_poll(struct hinic3_rxq *rxq, int budget)
 	if (rxq->delta >= HINIC3_RX_BUFFER_WRITE)
 		hinic3_rx_fill_buffers(rxq);
 
+	u64_stats_update_begin(&rxq->rxq_stats.syncp);
+	rxq->rxq_stats.packets += (u64)nr_pkts;
+	rxq->rxq_stats.bytes += rx_bytes;
+	u64_stats_update_end(&rxq->rxq_stats.syncp);
+
 	return nr_pkts;
 }
diff --git a/drivers/net/ethernet/huawei/hinic3/hinic3_rx.h b/drivers/net/ethernet/huawei/hinic3/hinic3_rx.h
index 06d1b3299e7c..c11d080408a7 100644
--- a/drivers/net/ethernet/huawei/hinic3/hinic3_rx.h
+++ b/drivers/net/ethernet/huawei/hinic3/hinic3_rx.h
@@ -8,6 +8,17 @@
 #include <linux/dim.h>
 #include <linux/netdevice.h>
 
+/* rx cqe checksum err */
+#define HINIC3_RX_CSUM_IP_CSUM_ERR      BIT(0)
+#define HINIC3_RX_CSUM_TCP_CSUM_ERR     BIT(1)
+#define HINIC3_RX_CSUM_UDP_CSUM_ERR     BIT(2)
+#define HINIC3_RX_CSUM_IGMP_CSUM_ERR    BIT(3)
+#define HINIC3_RX_CSUM_ICMPV4_CSUM_ERR  BIT(4)
+#define HINIC3_RX_CSUM_ICMPV6_CSUM_ERR  BIT(5)
+#define HINIC3_RX_CSUM_SCTP_CRC_ERR     BIT(6)
+#define HINIC3_RX_CSUM_HW_CHECK_NONE    BIT(7)
+#define HINIC3_RX_CSUM_IPSU_OTHER_ERR   BIT(8)
+
 #define RQ_CQE_OFFOLAD_TYPE_PKT_TYPE_MASK           GENMASK(4, 0)
 #define RQ_CQE_OFFOLAD_TYPE_IP_TYPE_MASK            GENMASK(6, 5)
 #define RQ_CQE_OFFOLAD_TYPE_TUNNEL_PKT_FORMAT_MASK  GENMASK(11, 8)
@@ -39,7 +50,6 @@ struct hinic3_rxq_stats {
 	u64                   rx_buf_empty;
 	u64                   alloc_skb_err;
 	u64                   alloc_rx_buf_err;
-	u64                   restore_drop_sge;
 	struct u64_stats_sync syncp;
 };
 
@@ -123,6 +133,9 @@ void hinic3_free_rxqs_res(struct net_device *netdev, u16 num_rq,
 			  u32 rq_depth, struct hinic3_dyna_rxq_res *rxqs_res);
 int hinic3_configure_rxqs(struct net_device *netdev, u16 num_rq,
 			  u32 rq_depth, struct hinic3_dyna_rxq_res *rxqs_res);
+
+void hinic3_rxq_get_stats(struct hinic3_rxq *rxq,
+			  struct hinic3_rxq_stats *stats);
 int hinic3_rx_poll(struct hinic3_rxq *rxq, int budget);
 
 #endif
diff --git a/drivers/net/ethernet/huawei/hinic3/hinic3_tx.c b/drivers/net/ethernet/huawei/hinic3/hinic3_tx.c
index 9306bf0020ca..3fbbfa5d96b6 100644
--- a/drivers/net/ethernet/huawei/hinic3/hinic3_tx.c
+++ b/drivers/net/ethernet/huawei/hinic3/hinic3_tx.c
@@ -97,8 +97,12 @@ static int hinic3_tx_map_skb(struct net_device *netdev, struct sk_buff *skb,
 
 	dma_info[0].dma = dma_map_single(&pdev->dev, skb->data,
 					 skb_headlen(skb), DMA_TO_DEVICE);
-	if (dma_mapping_error(&pdev->dev, dma_info[0].dma))
+	if (dma_mapping_error(&pdev->dev, dma_info[0].dma)) {
+		u64_stats_update_begin(&txq->txq_stats.syncp);
+		txq->txq_stats.map_frag_err++;
+		u64_stats_update_end(&txq->txq_stats.syncp);
 		return -EFAULT;
+	}
 
 	dma_info[0].len = skb_headlen(skb);
 
@@ -117,6 +121,9 @@ static int hinic3_tx_map_skb(struct net_device *netdev, struct sk_buff *skb,
 						     skb_frag_size(frag),
 						     DMA_TO_DEVICE);
 		if (dma_mapping_error(&pdev->dev, dma_info[idx].dma)) {
+			u64_stats_update_begin(&txq->txq_stats.syncp);
+			txq->txq_stats.map_frag_err++;
+			u64_stats_update_end(&txq->txq_stats.syncp);
 			err = -EFAULT;
 			goto err_unmap_page;
 		}
@@ -260,6 +267,9 @@ static int hinic3_tx_csum(struct hinic3_txq *txq, struct hinic3_sq_task *task,
 		if (l4_proto != IPPROTO_UDP ||
 		    ((struct udphdr *)skb_transport_header(skb))->dest !=
 		    VXLAN_OFFLOAD_PORT_LE) {
+			u64_stats_update_begin(&txq->txq_stats.syncp);
+			txq->txq_stats.unknown_tunnel_pkt++;
+			u64_stats_update_end(&txq->txq_stats.syncp);
 			/* Unsupported tunnel packet, disable csum offload */
 			skb_checksum_help(skb);
 			return 0;
@@ -433,6 +443,27 @@ static u32 hinic3_tx_offload(struct sk_buff *skb, struct hinic3_sq_task *task,
 	return offload;
 }
 
+static void hinic3_get_pkt_stats(struct hinic3_txq *txq, struct sk_buff *skb)
+{
+	u32 hdr_len, tx_bytes;
+	unsigned short pkts;
+
+	if (skb_is_gso(skb)) {
+		hdr_len = (skb_shinfo(skb)->gso_segs - 1) *
+			  skb_tcp_all_headers(skb);
+		tx_bytes = skb->len + hdr_len;
+		pkts = skb_shinfo(skb)->gso_segs;
+	} else {
+		tx_bytes = skb->len > ETH_ZLEN ? skb->len : ETH_ZLEN;
+		pkts = 1;
+	}
+
+	u64_stats_update_begin(&txq->txq_stats.syncp);
+	txq->txq_stats.bytes += tx_bytes;
+	txq->txq_stats.packets += pkts;
+	u64_stats_update_end(&txq->txq_stats.syncp);
+}
+
 static u16 hinic3_get_and_update_sq_owner(struct hinic3_io_queue *sq,
 					  u16 curr_pi, u16 wqebb_cnt)
 {
@@ -539,8 +570,12 @@ static netdev_tx_t hinic3_send_one_skb(struct sk_buff *skb,
 	int err;
 
 	if (unlikely(skb->len < MIN_SKB_LEN)) {
-		if (skb_pad(skb, MIN_SKB_LEN - skb->len))
+		if (skb_pad(skb, MIN_SKB_LEN - skb->len)) {
+			u64_stats_update_begin(&txq->txq_stats.syncp);
+			txq->txq_stats.skb_pad_err++;
+			u64_stats_update_end(&txq->txq_stats.syncp);
 			goto err_out;
+		}
 
 		skb->len = MIN_SKB_LEN;
 	}
@@ -595,6 +630,7 @@ static netdev_tx_t hinic3_send_one_skb(struct sk_buff *skb,
 				  txq->tx_stop_thrs,
 				  txq->tx_start_thrs);
 
+	hinic3_get_pkt_stats(txq, skb);
 	hinic3_prepare_sq_ctrl(&wqe_combo, queue_info, num_sge, owner);
 	hinic3_write_db(txq->sq, 0, DB_CFLAG_DP_SQ,
 			hinic3_get_sq_local_pi(txq->sq));
@@ -604,6 +640,10 @@ static netdev_tx_t hinic3_send_one_skb(struct sk_buff *skb,
 err_drop_pkt:
 	dev_kfree_skb_any(skb);
 err_out:
+	u64_stats_update_begin(&txq->txq_stats.syncp);
+	txq->txq_stats.dropped++;
+	u64_stats_update_end(&txq->txq_stats.syncp);
+
 	return NETDEV_TX_OK;
 }
 
@@ -611,12 +651,19 @@ netdev_tx_t hinic3_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
 {
 	struct hinic3_nic_dev *nic_dev = netdev_priv(netdev);
 	u16 q_id = skb_get_queue_mapping(skb);
+	struct hinic3_txq *txq;
 
 	if (unlikely(!netif_carrier_ok(netdev)))
 		goto err_drop_pkt;
 
-	if (unlikely(q_id >= nic_dev->q_params.num_qps))
+	if (unlikely(q_id >= nic_dev->q_params.num_qps)) {
+		txq = &nic_dev->txqs[0];
+		u64_stats_update_begin(&txq->txq_stats.syncp);
+		txq->txq_stats.dropped++;
+		u64_stats_update_end(&txq->txq_stats.syncp);
+
 		goto err_drop_pkt;
+	}
 
 	return hinic3_send_one_skb(skb, netdev, &nic_dev->txqs[q_id]);
 
@@ -754,6 +801,24 @@ int hinic3_configure_txqs(struct net_device *netdev, u16 num_sq,
 	return 0;
 }
 
+void hinic3_txq_get_stats(struct hinic3_txq *txq,
+			  struct hinic3_txq_stats *stats)
+{
+	struct hinic3_txq_stats *txq_stats = &txq->txq_stats;
+	unsigned int start;
+
+	do {
+		start = u64_stats_fetch_begin(&txq_stats->syncp);
+		stats->busy = txq_stats->busy;
+		stats->skb_pad_err = txq_stats->skb_pad_err;
+		stats->frag_len_overflow = txq_stats->frag_len_overflow;
+		stats->offload_cow_skb_err = txq_stats->offload_cow_skb_err;
+		stats->map_frag_err = txq_stats->map_frag_err;
+		stats->unknown_tunnel_pkt = txq_stats->unknown_tunnel_pkt;
+		stats->frag_size_err = txq_stats->frag_size_err;
+	} while (u64_stats_fetch_retry(&txq_stats->syncp, start));
+}
+
 bool hinic3_tx_poll(struct hinic3_txq *txq, int budget)
 {
 	struct net_device *netdev = txq->netdev;
diff --git a/drivers/net/ethernet/huawei/hinic3/hinic3_tx.h b/drivers/net/ethernet/huawei/hinic3/hinic3_tx.h
index 00194f2a1bcc..0a21c423618f 100644
--- a/drivers/net/ethernet/huawei/hinic3/hinic3_tx.h
+++ b/drivers/net/ethernet/huawei/hinic3/hinic3_tx.h
@@ -157,6 +157,8 @@ int hinic3_configure_txqs(struct net_device *netdev, u16 num_sq,
 			  u32 sq_depth, struct hinic3_dyna_txq_res *txqs_res);
 
 netdev_tx_t hinic3_xmit_frame(struct sk_buff *skb, struct net_device *netdev);
+void hinic3_txq_get_stats(struct hinic3_txq *txq,
+			  struct hinic3_txq_stats *stats);
 bool hinic3_tx_poll(struct hinic3_txq *txq, int budget);
 void hinic3_flush_txqs(struct net_device *netdev);
 
-- 
2.43.0


^ permalink raw reply related

* [PATCH net-next v04 1/6] hinic3: Add ethtool queue ops
From: Fan Gong @ 2026-04-08  4:03 UTC (permalink / raw)
  To: Fan Gong, Zhu Yikai, netdev, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Andrew Lunn,
	Ioana Ciornei, Mohsin Bashir
  Cc: linux-kernel, linux-doc, luosifu, Xin Guo, Zhou Shuai, Wu Like,
	Shi Jing, Zheng Jiezhen, Maxime Chevallier
In-Reply-To: <cover.1775618797.git.zhuyikai1@h-partners.com>

  Implement following ethtool callback function:
.get_ringparam
.set_ringparam

  These callbacks allow users to utilize ethtool for detailed
queue depth configuration and monitoring.

Co-developed-by: Zhu Yikai <zhuyikai1@h-partners.com>
Signed-off-by: Zhu Yikai <zhuyikai1@h-partners.com>
Signed-off-by: Fan Gong <gongfan1@huawei.com>
---
 .../ethernet/huawei/hinic3/hinic3_ethtool.c   | 101 +++++++++++++++++
 .../net/ethernet/huawei/hinic3/hinic3_irq.c   |  10 +-
 .../net/ethernet/huawei/hinic3/hinic3_main.c  |  11 ++
 .../huawei/hinic3/hinic3_netdev_ops.c         | 103 +++++++++++++++++-
 .../ethernet/huawei/hinic3/hinic3_nic_dev.h   |  16 +++
 .../ethernet/huawei/hinic3/hinic3_nic_io.h    |   4 +
 6 files changed, 240 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/huawei/hinic3/hinic3_ethtool.c b/drivers/net/ethernet/huawei/hinic3/hinic3_ethtool.c
index 90fc16288de9..e47c3f43e7b9 100644
--- a/drivers/net/ethernet/huawei/hinic3/hinic3_ethtool.c
+++ b/drivers/net/ethernet/huawei/hinic3/hinic3_ethtool.c
@@ -409,6 +409,105 @@ hinic3_get_link_ksettings(struct net_device *netdev,
 	return 0;
 }
 
+static void hinic3_get_ringparam(struct net_device *netdev,
+				 struct ethtool_ringparam *ring,
+				 struct kernel_ethtool_ringparam *kernel_ring,
+				 struct netlink_ext_ack *extack)
+{
+	struct hinic3_nic_dev *nic_dev = netdev_priv(netdev);
+
+	ring->rx_max_pending = HINIC3_MAX_RX_QUEUE_DEPTH;
+	ring->tx_max_pending = HINIC3_MAX_TX_QUEUE_DEPTH;
+	ring->rx_pending = nic_dev->rxqs[0].q_depth;
+	ring->tx_pending = nic_dev->txqs[0].q_depth;
+}
+
+static void hinic3_update_qp_depth(struct net_device *netdev,
+				   u32 sq_depth, u32 rq_depth)
+{
+	struct hinic3_nic_dev *nic_dev = netdev_priv(netdev);
+	u16 i;
+
+	nic_dev->q_params.sq_depth = sq_depth;
+	nic_dev->q_params.rq_depth = rq_depth;
+	for (i = 0; i < nic_dev->max_qps; i++) {
+		nic_dev->txqs[i].q_depth = sq_depth;
+		nic_dev->txqs[i].q_mask = sq_depth - 1;
+		nic_dev->rxqs[i].q_depth = rq_depth;
+		nic_dev->rxqs[i].q_mask = rq_depth - 1;
+	}
+}
+
+static int hinic3_check_ringparam_valid(struct net_device *netdev,
+					const struct ethtool_ringparam *ring)
+{
+	if (ring->rx_jumbo_pending || ring->rx_mini_pending) {
+		netdev_err(netdev, "Unsupported rx_jumbo_pending/rx_mini_pending\n");
+		return -EINVAL;
+	}
+
+	if (ring->tx_pending > HINIC3_MAX_TX_QUEUE_DEPTH ||
+	    ring->tx_pending < HINIC3_MIN_QUEUE_DEPTH ||
+	    ring->rx_pending > HINIC3_MAX_RX_QUEUE_DEPTH ||
+	    ring->rx_pending < HINIC3_MIN_QUEUE_DEPTH) {
+		netdev_err(netdev,
+			   "Queue depth out of range tx[%d-%d] rx[%d-%d]\n",
+			   HINIC3_MIN_QUEUE_DEPTH, HINIC3_MAX_TX_QUEUE_DEPTH,
+			   HINIC3_MIN_QUEUE_DEPTH, HINIC3_MAX_RX_QUEUE_DEPTH);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int hinic3_set_ringparam(struct net_device *netdev,
+				struct ethtool_ringparam *ring,
+				struct kernel_ethtool_ringparam *kernel_ring,
+				struct netlink_ext_ack *extack)
+{
+	struct hinic3_nic_dev *nic_dev = netdev_priv(netdev);
+	struct hinic3_dyna_txrxq_params q_params = {};
+	u32 new_sq_depth, new_rq_depth;
+	int err;
+
+	err = hinic3_check_ringparam_valid(netdev, ring);
+	if (err)
+		return err;
+
+	new_sq_depth = 1U << ilog2(ring->tx_pending);
+	new_rq_depth = 1U << ilog2(ring->rx_pending);
+	if (new_sq_depth == nic_dev->q_params.sq_depth &&
+	    new_rq_depth == nic_dev->q_params.rq_depth)
+		return 0;
+
+	if (new_sq_depth != ring->tx_pending)
+		netdev_info(netdev, "Requested Tx depth trimmed to %d\n",
+			    new_sq_depth);
+	if (new_rq_depth != ring->rx_pending)
+		netdev_info(netdev, "Requested Rx depth trimmed to %d\n",
+			    new_rq_depth);
+
+	netdev_info(netdev, "Change Tx/Rx ring depth from %u/%u to %u/%u\n",
+		    nic_dev->q_params.sq_depth, nic_dev->q_params.rq_depth,
+		    new_sq_depth, new_rq_depth);
+
+	if (!netif_running(netdev)) {
+		hinic3_update_qp_depth(netdev, new_sq_depth, new_rq_depth);
+	} else {
+		q_params = nic_dev->q_params;
+		q_params.sq_depth = new_sq_depth;
+		q_params.rq_depth = new_rq_depth;
+
+		err = hinic3_change_channel_settings(netdev, &q_params);
+		if (err) {
+			netdev_err(netdev, "Failed to change channel settings\n");
+			return err;
+		}
+	}
+
+	return 0;
+}
+
 static const struct ethtool_ops hinic3_ethtool_ops = {
 	.supported_coalesce_params      = ETHTOOL_COALESCE_USECS |
 					  ETHTOOL_COALESCE_PKT_RATE_RX_USECS,
@@ -417,6 +516,8 @@ static const struct ethtool_ops hinic3_ethtool_ops = {
 	.get_msglevel                   = hinic3_get_msglevel,
 	.set_msglevel                   = hinic3_set_msglevel,
 	.get_link                       = ethtool_op_get_link,
+	.get_ringparam                  = hinic3_get_ringparam,
+	.set_ringparam                  = hinic3_set_ringparam,
 };
 
 void hinic3_set_ethtool_ops(struct net_device *netdev)
diff --git a/drivers/net/ethernet/huawei/hinic3/hinic3_irq.c b/drivers/net/ethernet/huawei/hinic3/hinic3_irq.c
index e7d6c2033b45..d3b3927b5408 100644
--- a/drivers/net/ethernet/huawei/hinic3/hinic3_irq.c
+++ b/drivers/net/ethernet/huawei/hinic3/hinic3_irq.c
@@ -135,10 +135,16 @@ static int hinic3_set_interrupt_moder(struct net_device *netdev, u16 q_id,
 {
 	struct hinic3_nic_dev *nic_dev = netdev_priv(netdev);
 	struct hinic3_interrupt_info info = {};
+	unsigned long flags;
 	int err;
 
-	if (q_id >= nic_dev->q_params.num_qps)
+	spin_lock_irqsave(&nic_dev->channel_res_lock, flags);
+
+	if (!HINIC3_CHANNEL_RES_VALID(nic_dev) ||
+	    q_id >= nic_dev->q_params.num_qps) {
+		spin_unlock_irqrestore(&nic_dev->channel_res_lock, flags);
 		return 0;
+	}
 
 	info.interrupt_coalesc_set = 1;
 	info.coalesc_timer_cfg = coalesc_timer_cfg;
@@ -147,6 +153,8 @@ static int hinic3_set_interrupt_moder(struct net_device *netdev, u16 q_id,
 	info.resend_timer_cfg =
 		nic_dev->intr_coalesce[q_id].resend_timer_cfg;
 
+	spin_unlock_irqrestore(&nic_dev->channel_res_lock, flags);
+
 	err = hinic3_set_interrupt_cfg(nic_dev->hwdev, info);
 	if (err) {
 		netdev_err(netdev,
diff --git a/drivers/net/ethernet/huawei/hinic3/hinic3_main.c b/drivers/net/ethernet/huawei/hinic3/hinic3_main.c
index 0a888fe4c975..3b470978714a 100644
--- a/drivers/net/ethernet/huawei/hinic3/hinic3_main.c
+++ b/drivers/net/ethernet/huawei/hinic3/hinic3_main.c
@@ -179,6 +179,8 @@ static int hinic3_sw_init(struct net_device *netdev)
 	int err;
 
 	mutex_init(&nic_dev->port_state_mutex);
+	mutex_init(&nic_dev->channel_cfg_lock);
+	spin_lock_init(&nic_dev->channel_res_lock);
 
 	nic_dev->q_params.sq_depth = HINIC3_SQ_DEPTH;
 	nic_dev->q_params.rq_depth = HINIC3_RQ_DEPTH;
@@ -314,6 +316,15 @@ static void hinic3_link_status_change(struct net_device *netdev,
 				      bool link_status_up)
 {
 	struct hinic3_nic_dev *nic_dev = netdev_priv(netdev);
+	unsigned long flags;
+	bool valid;
+
+	spin_lock_irqsave(&nic_dev->channel_res_lock, flags);
+	valid = HINIC3_CHANNEL_RES_VALID(nic_dev);
+	spin_unlock_irqrestore(&nic_dev->channel_res_lock, flags);
+
+	if (!valid)
+		return;
 
 	if (link_status_up) {
 		if (netif_carrier_ok(netdev))
diff --git a/drivers/net/ethernet/huawei/hinic3/hinic3_netdev_ops.c b/drivers/net/ethernet/huawei/hinic3/hinic3_netdev_ops.c
index da73811641a9..cec501a9dd43 100644
--- a/drivers/net/ethernet/huawei/hinic3/hinic3_netdev_ops.c
+++ b/drivers/net/ethernet/huawei/hinic3/hinic3_netdev_ops.c
@@ -428,6 +428,84 @@ static void hinic3_vport_down(struct net_device *netdev)
 	}
 }
 
+int
+hinic3_change_channel_settings(struct net_device *netdev,
+			       struct hinic3_dyna_txrxq_params *trxq_params)
+{
+	struct hinic3_nic_dev *nic_dev = netdev_priv(netdev);
+	struct hinic3_dyna_txrxq_params old_qp_params = {};
+	struct hinic3_dyna_qp_params new_qp_params = {};
+	struct hinic3_dyna_qp_params cur_qp_params = {};
+	bool need_teardown = false;
+	unsigned long flags;
+	int err;
+
+	mutex_lock(&nic_dev->channel_cfg_lock);
+
+	hinic3_config_num_qps(netdev, trxq_params);
+
+	err = hinic3_alloc_channel_resources(netdev, &new_qp_params,
+					     trxq_params);
+	if (err) {
+		netdev_err(netdev, "Failed to alloc channel resources\n");
+		mutex_unlock(&nic_dev->channel_cfg_lock);
+		return err;
+	}
+
+	spin_lock_irqsave(&nic_dev->channel_res_lock, flags);
+	if (!test_and_set_bit(HINIC3_CHANGE_RES_INVALID, &nic_dev->flags))
+		need_teardown = true;
+	spin_unlock_irqrestore(&nic_dev->channel_res_lock, flags);
+
+	if (need_teardown) {
+		hinic3_vport_down(netdev);
+		hinic3_close_channel(netdev);
+		hinic3_uninit_qps(nic_dev, &cur_qp_params);
+		hinic3_free_channel_resources(netdev, &cur_qp_params,
+					      &nic_dev->q_params);
+	}
+
+	if (nic_dev->num_qp_irq > trxq_params->num_qps)
+		hinic3_qp_irq_change(netdev, trxq_params->num_qps);
+
+	spin_lock_irqsave(&nic_dev->channel_res_lock, flags);
+	old_qp_params = nic_dev->q_params;
+	nic_dev->q_params = *trxq_params;
+	spin_unlock_irqrestore(&nic_dev->channel_res_lock, flags);
+
+	hinic3_init_qps(nic_dev, &new_qp_params);
+
+	err = hinic3_open_channel(netdev);
+	if (err)
+		goto err_uninit_qps;
+
+	err = hinic3_vport_up(netdev);
+	if (err)
+		goto err_close_channel;
+
+	spin_lock_irqsave(&nic_dev->channel_res_lock, flags);
+	clear_bit(HINIC3_CHANGE_RES_INVALID, &nic_dev->flags);
+	spin_unlock_irqrestore(&nic_dev->channel_res_lock, flags);
+
+	mutex_unlock(&nic_dev->channel_cfg_lock);
+
+	return 0;
+
+err_close_channel:
+	hinic3_close_channel(netdev);
+err_uninit_qps:
+	spin_lock_irqsave(&nic_dev->channel_res_lock, flags);
+	nic_dev->q_params = old_qp_params;
+	spin_unlock_irqrestore(&nic_dev->channel_res_lock, flags);
+
+	hinic3_uninit_qps(nic_dev, &new_qp_params);
+	hinic3_free_channel_resources(netdev, &new_qp_params, trxq_params);
+
+	mutex_unlock(&nic_dev->channel_cfg_lock);
+
+	return err;
+}
+
 static int hinic3_open(struct net_device *netdev)
 {
 	struct hinic3_nic_dev *nic_dev = netdev_priv(netdev);
@@ -487,16 +565,33 @@ static int hinic3_close(struct net_device *netdev)
 {
 	struct hinic3_nic_dev *nic_dev = netdev_priv(netdev);
 	struct hinic3_dyna_qp_params qp_params;
+	bool need_teardown = false;
+	unsigned long flags;
 
 	if (!test_and_clear_bit(HINIC3_INTF_UP, &nic_dev->flags)) {
 		netdev_dbg(netdev, "Netdev already close, do nothing\n");
 		return 0;
 	}
 
-	hinic3_vport_down(netdev);
-	hinic3_close_channel(netdev);
-	hinic3_uninit_qps(nic_dev, &qp_params);
-	hinic3_free_channel_resources(netdev, &qp_params, &nic_dev->q_params);
+	mutex_lock(&nic_dev->channel_cfg_lock);
+
+	spin_lock_irqsave(&nic_dev->channel_res_lock, flags);
+	if (!test_and_set_bit(HINIC3_CHANGE_RES_INVALID, &nic_dev->flags))
+		need_teardown = true;
+	spin_unlock_irqrestore(&nic_dev->channel_res_lock, flags);
+
+	if (need_teardown) {
+		hinic3_vport_down(netdev);
+		hinic3_close_channel(netdev);
+		hinic3_uninit_qps(nic_dev, &qp_params);
+		hinic3_free_channel_resources(netdev, &qp_params,
+					      &nic_dev->q_params);
+	}
+
+	hinic3_free_nicio_res(nic_dev);
+	hinic3_destroy_num_qps(netdev);
+
+	mutex_unlock(&nic_dev->channel_cfg_lock);
 
 	return 0;
 }
diff --git a/drivers/net/ethernet/huawei/hinic3/hinic3_nic_dev.h b/drivers/net/ethernet/huawei/hinic3/hinic3_nic_dev.h
index 9502293ff710..55b280888ad8 100644
--- a/drivers/net/ethernet/huawei/hinic3/hinic3_nic_dev.h
+++ b/drivers/net/ethernet/huawei/hinic3/hinic3_nic_dev.h
@@ -10,6 +10,9 @@
 #include "hinic3_hw_cfg.h"
 #include "hinic3_hwdev.h"
 #include "hinic3_mgmt_interface.h"
+#include "hinic3_nic_io.h"
+#include "hinic3_tx.h"
+#include "hinic3_rx.h"
 
 #define HINIC3_VLAN_BITMAP_BYTE_SIZE(nic_dev)  (sizeof(*(nic_dev)->vlan_bitmap))
 #define HINIC3_VLAN_BITMAP_SIZE(nic_dev)  \
@@ -20,8 +23,13 @@ enum hinic3_flags {
 	HINIC3_MAC_FILTER_CHANGED,
 	HINIC3_RSS_ENABLE,
 	HINIC3_UPDATE_MAC_FILTER,
+	HINIC3_CHANGE_RES_INVALID,
 };
 
+#define HINIC3_CHANNEL_RES_VALID(nic_dev) \
+	(test_bit(HINIC3_INTF_UP, &(nic_dev)->flags) && \
+	 !test_bit(HINIC3_CHANGE_RES_INVALID, &(nic_dev)->flags))
+
 enum hinic3_event_work_flags {
 	HINIC3_EVENT_WORK_TX_TIMEOUT,
 };
@@ -129,6 +137,10 @@ struct hinic3_nic_dev {
 	struct work_struct              rx_mode_work;
 	/* lock for enable/disable port */
 	struct mutex                    port_state_mutex;
+	/* lock for channel configuration */
+	struct mutex                    channel_cfg_lock;
+	/* lock for channel resources */
+	spinlock_t                      channel_res_lock;
 
 	struct list_head                uc_filter_list;
 	struct list_head                mc_filter_list;
@@ -143,6 +155,10 @@ struct hinic3_nic_dev {
 
 void hinic3_set_netdev_ops(struct net_device *netdev);
 int hinic3_set_hw_features(struct net_device *netdev);
+int
+hinic3_change_channel_settings(struct net_device *netdev,
+			       struct hinic3_dyna_txrxq_params *trxq_params);
+
 int hinic3_qps_irq_init(struct net_device *netdev);
 void hinic3_qps_irq_uninit(struct net_device *netdev);
 
diff --git a/drivers/net/ethernet/huawei/hinic3/hinic3_nic_io.h b/drivers/net/ethernet/huawei/hinic3/hinic3_nic_io.h
index 12eefabcf1db..3791b9bc865b 100644
--- a/drivers/net/ethernet/huawei/hinic3/hinic3_nic_io.h
+++ b/drivers/net/ethernet/huawei/hinic3/hinic3_nic_io.h
@@ -14,6 +14,10 @@ struct hinic3_nic_dev;
 #define HINIC3_RQ_WQEBB_SHIFT      3
 #define HINIC3_SQ_WQEBB_SIZE       BIT(HINIC3_SQ_WQEBB_SHIFT)
 
+#define HINIC3_MAX_TX_QUEUE_DEPTH  65536
+#define HINIC3_MAX_RX_QUEUE_DEPTH  16384
+#define HINIC3_MIN_QUEUE_DEPTH     128
+
 /* ******************** RQ_CTRL ******************** */
 enum hinic3_rq_wqe_type {
 	HINIC3_NORMAL_RQ_WQE = 1,
-- 
2.43.0


^ permalink raw reply related

* [PATCH net-next v04 0/6] net: hinic3: PF initialization
From: Fan Gong @ 2026-04-08  4:03 UTC (permalink / raw)
  To: Fan Gong, Zhu Yikai, netdev, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Andrew Lunn,
	Ioana Ciornei, Mohsin Bashir
  Cc: linux-kernel, linux-doc, luosifu, Xin Guo, Zhou Shuai, Wu Like,
	Shi Jing, Zheng Jiezhen, Maxime Chevallier

This is [3/3] part of hinic3 Ethernet driver second submission.
With this patch hinic3 becomes a complete Ethernet driver with
pf and vf.

Add 20 ethtool ops for information of queue, rss, coalesce and eth data.
Add MTU size validation
Config netdev watchdog timeout.
Remove unneed coalesce parameters.

Changes:

PATCH 03 V01: https://lore.kernel.org/netdev/cover.1773387649.git.zhuyikai1@h-partners.com/
* Add rmon/pause/phy/mac/ctrl stats (Ioana Ciornei)

PATCH 03 V02: https://lore.kernel.org/netdev/cover.1774684571.git.zhuyikai1@h-partners.com/
* Modify "return -EINVAL" intension problem (AI review)
* Use le16_to_cpu for rss_indir pair.out->buf (AI review)
* Use u32 instead of int in coalesce_limits to avoid overflow (AI review)
* Remove redundant u64_stats_update_begin/end when reading stats without
  concurrent reader (AI review)
* Modify nic_dev->stats.syncp logic (AI review)
* Complete rxq/txq stats stats fileds in hinic3_rx/txq_get_stats (AI review)
* Remove statistics values in rtnl_link_stats64 from ethtool statistics
  values (AI review)
* Add channel_cfg_lock & channel_res_lock to protect resources access (AI review)
* Remove OutOfRangeLengthField, FrameToolong and InRangeLengthErrors (Ioana Ciornei)
* Remove redundant mtu commit (Maxime Chevialler)

PATCH 03 V03: https://lore.kernel.org/netdev/cover.1774940117.git.zhuyikai1@h-partners.com/
* Change unnedd to unneeded (AI review)
* Remove packets,bytes,errors and dropped in hinic3_rx/tx_queue_stats (AI review)
* Remove duplicated entried in hinic3_port_stats[] (AI review)
* change stats_info.head.status to ps->head.status (AI review)

PATCH 03 V04:
* Remove restore_drop_sge in hinic3_rx_queue_stats (AI review)
* Remove hinic3_nic_stats (AI review)
* Use old_q_param to store old config and use it in error handling (Mohsin Bashir)
* Add netdev_info to inform the user that depth is trimmed (Mohsin Bashir)
* Remove const in hinic3_get_qp_stats_strings parameters (Mohsin Bashir)
* Change EOPNOTSUPP to ERANGE in is_coalesce_exceed_limit (Mohsin Bashir)
* Update nic_dev->rss_type after hinic3_set_rss_type (Mohsin Bashir)
* Modify MGMT_STATUS_CMD_UNSUPPORTED to EOPNOTSUPP for complying with the
  error code specifications  (Mohsin Bashir)

Fan Gong (6):
  hinic3: Add ethtool queue ops
  hinic3: Add ethtool statistic ops
  hinic3: Add ethtool coalesce ops
  hinic3: Add ethtool rss ops
  hinic3: Configure netdev->watchdog_timeo to set nic tx timeout
  hinic3: Remove unneeded coalesce parameters

 .../ethernet/huawei/hinic3/hinic3_ethtool.c   | 827 +++++++++++++++++-
 .../ethernet/huawei/hinic3/hinic3_hw_intf.h   |  13 +-
 .../net/ethernet/huawei/hinic3/hinic3_irq.c   |  16 +-
 .../net/ethernet/huawei/hinic3/hinic3_main.c  |  15 +
 .../huawei/hinic3/hinic3_mgmt_interface.h     |  39 +
 .../huawei/hinic3/hinic3_netdev_ops.c         | 103 ++-
 .../ethernet/huawei/hinic3/hinic3_nic_cfg.c   |  64 ++
 .../ethernet/huawei/hinic3/hinic3_nic_cfg.h   | 109 +++
 .../ethernet/huawei/hinic3/hinic3_nic_dev.h   |  16 +
 .../ethernet/huawei/hinic3/hinic3_nic_io.h    |   4 +
 .../net/ethernet/huawei/hinic3/hinic3_rss.c   | 487 ++++++++++-
 .../net/ethernet/huawei/hinic3/hinic3_rss.h   |  19 +
 .../net/ethernet/huawei/hinic3/hinic3_rx.c    |  59 +-
 .../net/ethernet/huawei/hinic3/hinic3_rx.h    |  18 +-
 .../net/ethernet/huawei/hinic3/hinic3_tx.c    |  71 +-
 .../net/ethernet/huawei/hinic3/hinic3_tx.h    |   2 +
 16 files changed, 1835 insertions(+), 27 deletions(-)


base-commit: 8e7adcf81564a3fe886a6270eea7558f063e5538
-- 
2.43.0


^ permalink raw reply

* [PATCH] doc: watchdog: fix typos etc.
From: Randy Dunlap @ 2026-04-08  3:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Randy Dunlap, Andrew Morton, Jonathan Corbet, Shuah Khan,
	linux-doc

Correct grammar, plurality, and typos in lockup-watchdogs.rst.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
---
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: linux-doc@vger.kernel.org

 Documentation/admin-guide/lockup-watchdogs.rst |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

--- linux-next-20260406.orig/Documentation/admin-guide/lockup-watchdogs.rst
+++ linux-next-20260406/Documentation/admin-guide/lockup-watchdogs.rst
@@ -18,7 +18,7 @@ provided for this.
 A 'hardlockup' is defined as a bug that causes the CPU to loop in
 kernel mode for several seconds (see "Implementation" below for
 details), without letting other interrupts have a chance to run.
-Similarly to the softlockup case, the current stack trace is displayed
+Similar to the softlockup case, the current stack trace is displayed
 upon detection and the system will stay locked up unless the default
 behavior is changed, which can be done through a sysctl,
 'hardlockup_panic', a compile time knob, "BOOTPARAM_HARDLOCKUP_PANIC",
@@ -41,7 +41,7 @@ is a trade-off between fast response to
 Implementation
 ==============
 
-The soft and hard lockup detectors are built around a hrtimer.
+The soft and hard lockup detectors are built around an hrtimer.
 In addition, the softlockup detector regularly schedules a job, and
 the hard lockup detector might use Perf/NMI events on architectures
 that support it.
@@ -49,7 +49,7 @@ that support it.
 Frequency and Heartbeats
 ------------------------
 
-The core of the detectors in a hrtimer. It servers multiple purpose:
+The core of the detectors is an hrtimer. It servers multiple purposes:
 
 - schedules watchdog job for the softlockup detector
 - bumps the interrupt counter for hardlockup detectors (heartbeat)

^ permalink raw reply

* Re: (sashiko review) [PATCH v6 1/1] mm/damon: add node_eligible_mem_bp and node_ineligible_mem_bp goal metrics
From: Ravi Jonnalagadda @ 2026-04-08  2:33 UTC (permalink / raw)
  To: SeongJae Park
  Cc: damon, linux-mm, linux-kernel, linux-doc, akpm, corbet, bijan311,
	ajayjoshi, honggyu.kim, yunjeong.mun
In-Reply-To: <20260407160546.52220-1-sj@kernel.org>

On Tue, Apr 7, 2026 at 9:05 AM SeongJae Park <sj@kernel.org> wrote:
>
> Adding another thought at the end of the mail without cutting the previous
> unrelated questions, so that Ravi can answer all my questions at once.
>
> On Mon,  6 Apr 2026 17:13:08 -0700 SeongJae Park <sj@kernel.org> wrote:
>
> > On Mon, 6 Apr 2026 12:47:56 -0700 Ravi Jonnalagadda <ravis.opensrc@gmail.com> wrote:
> >
> > > On Sun, Apr 5, 2026 at 3:45 PM SeongJae Park <sj@kernel.org> wrote:
> > > >
> > > >
> > > > Ravi, thank you for reposting this patch after the rebase.  This time sashiko
> > > > was able to review this, and found good points including things that deserve
> > > > another revision of this patch.
> > > >
> > > > Forwarding full sashiko review in a reply format with my inline comments below,
> > > > for sharing details of my view and doing followup discussions via mails.  Ravi,
> > > > could you please reply?
> > > >
> > >
> > > Thanks SJ, providing your comments on top of sashiko's review is very helpful.
> >
> > I'm glad to hear that it is working for you :)
> >
> > [...]
> > > > > +static unsigned long damos_calc_eligible_bytes(struct damon_ctx *c,
> > > > > > +           struct damos *s, int nid, unsigned long *total)
> > > > > > +{
> > [...]
> > > > > > +                           struct folio *folio;
> > > > > > +                           unsigned long folio_sz, counted;
> > > > > > +
> > > > > > +                           folio = damon_get_folio(PHYS_PFN(addr));
> > > > >
> > > > > What happens if this metric is assigned to a DAMON context configured for
> > > > > virtual address space monitoring? If the context uses DAMON_OPS_VADDR,
> > > > > passing a user-space virtual address to PHYS_PFN() might cause invalid
> > > > > memory accesses or out-of-bounds page struct reads. Should this code
> > > > > explicitly verify the operations type first?
> > > >
> > > > Good finding.  We intend to support only paddr ops.  But there is no guard for
> > > > using this on vaddr ops configuration.  Ravi, could we add underlying ops
> > > > check?  I think damon_commit_ctx() is a good place to add that.  The check
> > > > could be something like below?
> > > >
> > >
> > > I plan to add the ops type check directly in the metric functions
> > > (damos_get_node_eligible_mem_bp and its counterpart) rather than in
> > > damon_commit_ctx(). The functions will return 0 early
> > > if c->ops.id != DAMON_OPS_PADDR.
> > >
> > > That said, if you prefer the damon_commit_ctx() validation approach to
> > > reject the configuration outright, I can implement it that way instead.
> > > Please let me know your preference.
> >
> > I'd prefer damon_commit_ctx() validation approach since it would give users
> > more clear message of the failure.
> >
> > >
> > > > '''
> > > > --- a/mm/damon/core.c
> > > > +++ b/mm/damon/core.c
> > > > @@ -1515,10 +1515,23 @@ static int damon_commit_sample_control(
> > > >  int damon_commit_ctx(struct damon_ctx *dst, struct damon_ctx *src)
> > > >  {
> > > >         int err;
> > > > +       struct damos *scheme;
> > > > +       struct damos_quota_goal *goal;
> > > >
> > > >         dst->maybe_corrupted = true;
> > > >         if (!is_power_of_2(src->min_region_sz))
> > > >                 return -EINVAL;
> > > > +       if (src->ops.id != DAMON_OPS_PADDR) {
> > > > +               damon_for_each_scheme(scheme, src) {
> > > > +                       damos_for_each_quota_goal(goal, &scheme->quota) {
> > > > +                               switch (goal->metric) {
> > > > +                               case DAMOS_QUOTA_NODE_ELIGIBLE_MEM_BP:
> > > > +                               case DAMOS_QUOTA_NODE_INELIGIBLE_MEMPBP:
> > > > +                                       return -EINVAL;
> > > > +                               }
> > > > +                       }
> > > > +               }
> > > > +       }
> > > >
> > > >         err = damon_commit_schemes(dst, src);
> > > >         if (err)
> > > > '''
> > [...]
> > > > > > +   /* Compute ineligible ratio directly: 10000 - eligible_bp */
> > > > > > +   return 10000 - mult_frac(node_eligible, 10000, total_eligible);
> > > > > > +}
> > > > >
> > > > > Does this return value match the documented metric? The formula computes the
> > > > > percentage of the system's eligible memory located on other NUMA nodes,
> > > > > rather than the amount of actual ineligible (filtered out) memory residing
> > > > > on the target node. Could this semantic mismatch cause confusion when
> > > > > configuring quota policies?
> > > >
> > > > Nice catch.  The name and the documentation are confusing.  We actually
> > > > confused a few times in previous revisions, and I'm again confused now.  IIUC,
> > > > the current implementation is the intended and right one for the given use
> > > > case, though.  If my understanding is correct, how about renaming
> > > > DAMOS_QUOTA_NODE_INELIGIBLE_MEM_BP to
> > > > DAMOS_QUOTA_NODE_ELIGIBLE_MEM_BP_COMPLEMENT, and updating the documentation
> > > > together?  Ravi, what do you think?
> > > >
> > >
> > > Agreed, the current name is confusing. How about
> > > DAMOS_QUOTA_NODE_ELIGIBLE_MEM_BP_OFFNODE?
> > >
> > > The rationale is that this metric measures "eligible memory that is off
> > > this node" (i.e., on other nodes).
> > >
> > >  I think "offnode" conveys the physical meaning more directly than "complement".
> > > That said, I'm happy to go with "complement" if you prefer.
> > > both are clearer than "ineligible".
> >
> > Thank you for the nice suggestion.  I like "offnode" term.  But I think having
> > "node" twice on the name is not really efficient for people who print code on
> > papers.  What about DAMOS_QUOTA_OFFNODE_ELIGIBLE_MEM_BP?
> >
> > But...  Maybe more importantly...  Now I realize this means that
> > offnode_eligible_mem_bp with target nid 0 is just same to node_eligible_mem_bp
> > with target nid 1, on your test setup.  Maybe we don't really need
> > offnode_eligible_mem_bp?  That is, your test setup could be like below.
> >
> > '''
> > For maintaining hot memory on DRAM (node 0) and CXL (node 1) in a 7:3
> > ratio:
> >
> >     PUSH scheme: migrate_hot from node 0 -> node 1
> >       goal: node_eligible_mem_bp, nid=1, target=3000
> >       "Move hot pages from DRAM to CXL if less thatn 30% of hot data is
> >        in CXL"
> >
> >     PULL scheme: migrate_hot from node 1 -> node 0
> >       goal: node_eligible_mem_bp, nid=0, target=7000
> >       "Move hot pages from CXL to DRAM if less than 70% of hot data is
> >        in DRAM"
> > '''
> >
> > And the schemes are more easy to read and understand for me.  This seems even
> > straightforward to scale for >2 nodes.  For example, if we want hot memory
> > distribution of 5:3:2 to nodes 0:1:2,
> >
> >       Two schemes for migrating hot pages out of node 0
> >       - migrate_hot from node 0 -> node 1
> >         - goal: node_eligible_mem_bp, nid=1, target=3000
> >       - migrate_hot from node 0 -> node 2
> >         - goal: node_eligible_mem_bp, nid=2, target=2000
> >
> >       Two schemes for migrating hot pages out of node 1
> >       - migrate_hot from node 1 -> node 0
> >         - goal: node_eligible_mem_bp, nid=0, target=5000
> >       - migrate_hot from node 1 -> node 2
> >         - goal: node_eligible_mem_bp, nid=2, target=2000
> >
> >       Two schemes for migrating hot pages out of node 2
> >       - migrate_hot from node 2 -> node 0
> >         - goal: node_eligible_mem_bp, nid=0, target=5000
> >       - migrate_hot from node 2 -> node 1
> >         - goal: node_eligible_mem_bp, nid=1, target=3000
> >
> > Do you think this makes sense?  If it makes sense and works for your use case,
> > what about dropping the offnode goal type?
>
> Now I recall I suggested the offnode metric because I suggested to run a
> kdamond per node.  That is, having one kdamond that monitors only node 0 and
> migrate hot memory to node 1, and another kdamond that monitors only node 1 and
> migrate hot memory to node 0.  And I suggested to do so because I knew it is
> suboptimal to run DAMOS schemes with node filter.
>
> We made a change [1] for making that more optimum, though.  The change is now
> in mm-stable, so hopefully it will be available from 7.1-rc1.  So I believe the
> single quota goal metric should work now.  Ravi, could you share what you
> think?
>

Yes SJ. I think we can make it work with single goal now that the
below commit is part of mainline. will give it a try and post an
update.

> [1] commit e1ace69c33ec ("mm/damon/core: set quota-score histogram with core filters")
>
>
> Thanks,
> SJ
>
> [...]

^ permalink raw reply

* Re: [PATCH v2 00/16] fs,x86/resctrl: Add kernel-mode (e.g., PLZA) support to the resctrl subsystem
From: Babu Moger @ 2026-04-08  1:01 UTC (permalink / raw)
  To: Reinette Chatre, corbet@lwn.net, tony.luck@intel.com,
	Dave.Martin@arm.com, james.morse@arm.com, tglx@kernel.org,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com
  Cc: skhan@linuxfoundation.org, x86@kernel.org, hpa@zytor.com,
	peterz@infradead.org, juri.lelli@redhat.com,
	vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
	rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
	vschneid@redhat.com, kas@kernel.org, rick.p.edgecombe@intel.com,
	akpm@linux-foundation.org, pmladek@suse.com,
	rdunlap@infradead.org, dapeng1.mi@linux.intel.com,
	kees@kernel.org, elver@google.com, paulmck@kernel.org,
	lirongqing@baidu.com, safinaskar@gmail.com, fvdl@google.com,
	seanjc@google.com, pawan.kumar.gupta@linux.intel.com,
	xin@zytor.com, tiala@microsoft.com, Neeraj.Upadhyay@amd.com,
	chang.seok.bae@intel.com, Lendacky, Thomas,
	elena.reshetova@intel.com, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-coco@lists.linux.dev,
	kvm@vger.kernel.org, eranian@google.com, peternewman@google.com
In-Reply-To: <3305c18e-9e50-4df0-b9f1-c61028628967@intel.com>

Hi Reinette,

On 4/7/26 12:48, Reinette Chatre wrote:
> Hi Babu,
> 
> On 4/6/26 3:45 PM, Babu Moger wrote:
>> Hi Reinette,
>>
>> Sorry for the late response. I was trying to get confirmation about the use case.
> 
> No problem. I appreciate that you did this so that we can make sure resctrl supports
> needed use cases.
> 
>>
>> On 3/31/26 17:24, Reinette Chatre wrote:
>>> On 3/30/26 11:46 AM, Babu Moger wrote:
>>>> On 3/27/26 17:11, Reinette Chatre wrote:
>>>>> On 3/26/26 10:12 AM, Babu Moger wrote:
>>>>>> On 3/24/26 17:51, Reinette Chatre wrote:
>>>>>>> On 3/12/26 1:36 PM, Babu Moger wrote:
> 
>>> can have domains that span different CPUs. There thus seem to be a built in assumption of what a "domain"
>>> means for PQR_PLZA_ASSOC so it sounds to me as though, instead of saying that "PQR_PLZA_ASSOC needs
>>> to be the same in QoS domain" it may be more accurate to, for example, say that "PQR_PLZA_ASSOC has L3 scope"?
>>
>> Yes.
> 
> Above is about L3 scope ...

Yes. The scope for PQR_PLZA_ASSOC is L3.

Is that what you are asking here?

>   
>>>
>>> This seems to be what this implementation does since it hardcodes PQR_PLZA_ASSOC scope to the L3
>>> resource but that creates dependency to the L3 resource that would make PLZA unusable if, for example,
>>> the user boots with "rdt=!l3cat" while wanting to use PLZA to manage MBA allocations when in kernel?
>>
>> Yes. that is correct. It should not be attached to one resource. We need to change it to global scope.
> 
> Can I interpret "global scope" as "all online CPUs"? Doing so will simplify

Yes. That is correct.


> supporting this feature. It does not sound practical for a user wanting to assign
> different resource groups to kernel work done in different domains ... the guidance should
> instead be to just set the allocations of one resource group to what is needed in the different
> domains? There may be more flexibility when supporting per-domain RMIDs though but so far
> it sounds as though the focus is global. We can consider what needs to be done to support
> some type of "per-domain" assignment as exercise whether current interface could support it
> in the future.

Yes. Makes sense.

> 
> ...
> 
>>>> There are multiple ways this feature can be applied. For simplicity, the discussion below focuses only on CLOSID.
>>>>
>>>>
>>>>        1. Global PLZA enablement
>>>>
>>>> PLZA can be configured as a global feature by setting |PQR_PLZA_ASSOC.closid = CLOSID| and |PQR_PLZA_ASSOC.plza_en = 1| on all threads in the system. A dedicated CLOSID is reserved for this purpose,
>>>
>>> Also discussed during v1 is that there is no need to dedicate a CLOSID for this purpose.
>>> There could be an "unthrottled" CLOSID to which all high priority user space tasks as
>>> well as all kernel work of all tasks are assigned.
>>> If user space chooses to dedicate a CLOSID for kernel work then that should supported and
>>> interface can allow that, but there is no need for resctrl to enforce this.
> 
> (above is comment about dedicated group - please see below)
> 
>   
>> Yes. I agree. The changes in context switch code is a concern.
>>
>> You covered some of the cases I was thinking(xx_set_individual).
>>
>> How about this idea?
>>
>> I suggest splitting the PLZA into two distinct aspects:
>>
>> 1. How PLZA is applied within a resource group
>>
>> 2. How PLZA is monitored
> 
> I think I see where you are going here. While the "How PLZA is monitored" naming
> refers to "monitoring" I *think* what you are separating here is (a) how PLZA is configured
> (CLOSID and RMID settings) and (b) how that PLZA configuration is assigned to tasks/CPUs,
> not just within a resource group but across the system. Please see below.
> 
> 
>> Introduce a new file, "info/kmode_type", to describe how kmode applies in the system.
> 
> ack. "in the system" as you have above, not "within a resource group" as mentioned
> before that.
> 
>>
>> # cat info/kmode_type
>> [global] <- Kernel mode applies to the entire system (all CPUs/tasks)
>>    cpus   <- Kernel mode applies only to the CPUs in the group
>>    tasks  <- Kernel mode applies only to the tasks in the group
>>
>> The "global" option is the default right now and it is current common use-case.
>>
>> The "info/kmode_type -> cpus" option introduces new files
>> "kmode_cpus" and "kmode_cpus_list" for users to apply kmode to
>> specific set of CPUs. This lets users change the CPU set for PLZA.
> Where were you thinking about placing these files in the hierarchy?

It needs to be inside the resctrl group (in struct rdtgroup).


> 
>> The PLZA MSR is updated when user changes the association to the
>> file. No context switch code changes are needed. This will be
>> dedicated group. The current resctrl group files, "cpus, cpus_list
> 
> Why does this have to be a dedicated group? One of the conclusions from v1
> discussion was that the "PLZA group" need *not* be a dedicated group. I repeated that
> in my earlier response that I left quoted above. You did not respond to these
> conclusions and statements in this regard while you keep coming back to this
> needing to be a dedicated group without providing a motivation to do so.
> Could you please elaborate why a dedicated group is required?

If the same group applies identical limits to both user and kernel 
space, it essentially behaves like a current resctrl group. In that 
sense, it’s not really a PLZA group. PLZA’s key value is the ability to 
separate allocations between user space and kernel space. A single CPU 
can belong to two groups: one group manages the user-space allocation 
for that CPU, while another manages the kernel-mode allocation.
This approach also simplifies file handling, which is another reason I 
prefer it.

That said, I’m open to not having a dedicated group if we can still 
support all the features that PLZA provides without it.


> 
> 
>> and tasks" will not be accessible in this mode. This option give
> 
> These files can continue to be accessible.

ok.

> 
>> some flexibility for the user without the context switch overhead.
> 
> Dedicating a resource group to PLZA removes flexibility though, no?

Yes. But makes it easy to handle the files as I mentioned above.

> 
>>
>> The "info/kmode_type -> tasks" option introduces a new file,
>> "kmode_tasks", for users to apply kmode to specific set of tasks.
>> This requires context switch changes. This will be dedicated group.
>> The current resctrl group files, "cpus, cpus_list and tasks" will
>> not be accessible in this mode. We currently have no use case for
>> this, so it will not be supported now.
> 
> Thank you for confirming. This is a relief.
> 
>>
>>
>> Add a file, "info/kmode_monitor", to describe how kmode is monitored.
>>
>> # cat info/kmode_monitor
>> [inherit_ctrl_and_mon] <- Kernel uses the same CLOSID/RMID as user. Default option for the "global"
>> assign_ctrl_inherit_mon <- One CLOSID for all kernel work; RMID inherited from user.
>> assign_ctrl_assign_mon <- One resource group (CLOSID+RMID) for all kernel work. Default option for "cpu" type.
> 
> My first thought is that the naming is confusing. resctrl has a very strong relationship between
> "RMID" and "monitoring" so naming a file "monitor" that deals with allocation/ctrl/CLOSID is
> potentially confusion.
> 
> Apart from that, while I think I understand where you are going by separating the mode into
> two files I am concerned about future complications needing to accommodate all different
> combinations of the (now) essentially two modes. My preference is thus to keep this simple by
> keeping the mode within one file.
> 
> Even so, when stepping back, it does not really look like we need to separate the "global"
> and "per CPU" modes. We could just have a single "per CPU" mode and the "global" is just
> its default of "all CPUs", no?

Yes. That correct.

> 
> Consider, for example, the implementation just consisting of:
> 
> 	# cat info/kernel_mode
> 	[inherit_ctrl_and_mon]
> 	global_assign_ctrl_inherit_mon_per_cpu
> 	global_assign_ctrl_assign_mon_per_cpu
>   
>>
>> Rename “kernel_mode_assignment” to “kmode_group” to assign the specific group to kmode. This file usage is same as before.
>>
>> #cat info/kmode_groups (Renamed "kernel_mode_assignment")
>> //
> 
> Please consider the intent of this file when thinking about names. The idea is that "info/kernel_mode"
> specifies the "mode" of how kernel work is handled and it determines the configuration files used in that
> mode as well as the syntax when interacting with those files. By renaming "kernel_mode_assignment" to
> "kmode_groups" it implicitly requires all future kernel mode enhancements to need some data related to "groups".
> 
> In summary, I think this can be simplified by introducing just two new files in info/ that enables the
> user to (a) select and (b) configure the "kernel mode". To start there can be just two modes,
> global_assign_ctrl_inherit_mon_per_cpu and global_assign_ctrl_assign_mon_per_cpu.
> global_assign_ctrl_inherit_mon_per_cpu mode requires a control group in kernel_mode_assignment while
> global_assign_ctrl_assign_mon_per_cpu requires a control and monitoring group.
> 
> The resource group in info/kernel_mode_assignment gets two additional files "kernel_mode_cpus" and
> "kernel_mode_cpus_list" that contains the CPUs enabled with the kernel mode configuration, by default
> it will be all online CPUs. The resource group can continue to be used to manage allocations of and
> monitor user space tasks. Specifically, the "cpus", "cpus_list", and "tasks" files remain.
> 
> A user wanting just "global" settings will get just that when writing the group to
> info/kernel_mode_assignment. A user wanting "per CPU" settings can follow the
> info/kernel_mode_assignment setting with changes to that resource group's kernel_mode_cpus/kernel_mode_cpus_list
> files. Any task running on a CPU that is *not* in kernel_mode_cpus/kernel_mode_cpus_list can be
> expected to inherit both CLOSID and RMID from user space for all kernel work.

After further consideration, I don’t think the info/kernel_mode file is 
necessary. There’s no need to enforce a specific mode for all the PLZA 
groups. Avoiding this constraint makes the design more flexible, 
particularly as we move toward supporting multiple PLZA groups in the 
future. MPAM already appears capable of handling more than one group—for 
example, one group could use inherit_ctrl_and_mon, while another could 
use global_assign_ctrl_inherit_mon_per_cpu.

The mode can simply be determined on a per-group basis. We can introduce 
two new files—kernel_mode_cpus and kernel_mode_cpus_list—within each 
resctrl group when kmode (or PLZA) is supported.

The info/kernel_mode_assignment file would indicate which resctrl 
group(or groups) is used for PLZA. The files—kernel_mode_cpus and 
kernel_mode_cpus_list would indicate how the plza is applied which each 
group.

Files and behavior:
- cpus / cpus_list:

CPUs listed here use the same allocation for both user and kernel space.
There is no change to the current semantics of these files.
If these files are empty, the group effectively becomes a PLZA-dedicated 
group.

- kernel_mode_cpus / kernel_mode_cpus_list:

These files determine whether a separate kernel allocation is applied.
If empty, user and kernel share the same allocation.
If non-empty, the kernel uses a separate allocation.

The group can be CTL_MON or MON group. Based on type the group the 
CLOSID and RMID will be used to enable PLZA. If it is MON, then rmid_en 
= 1 when writing PLZA MSR.


Here’s the proposed flow:

# mount -t resctrl resctrl /sys/fs/resctrl/
# cd /sys/fs/resctrl/
# cat info/kernel_mode_assignment
//

By default, the root (default) group is PLZA-enabled when resctrl is 
mounted. All CPUs use CLOSID 0 for both user and kernel-mode allocation.

# cat cpus_list
1-64
# cat kmode_cpus_list
1-64

Next, create a new group for PLZA:

# mkdir plza_group

# echo "plza_group//" > info/kernel_mode_assignment

At this point, plza_group becomes the new PLZA-enabled group, and the 
PLZA-related MSRs are updated accordingly.

# cat plza_group/cpus_list
<empty>

# cat plza_group/kmode_cpus_list
1-64

The user can then update kmode_cpus_list to apply PLZA only to a 
specific subset of CPUs, if desired.


What do you think of this approach?


Thanks
Babu

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox