Linux Documentation

Linux Documentation
 help / color / mirror / Atom feed

* [RESEND/FOLLOW-UP] Documentation/translations/pt_BR patches
From: Daniel Pereira @ 2026-05-14 22:02 UTC (permalink / raw)
  To: Jonathan Corbet, linux-doc

Hi Corbet,

I hope you’re doing well.

I’m reaching out to follow up on the status of the patches I submitted
about 12 days ago regarding the Portuguese (pt_BR) documentation
translations.

Please let me know if you’ve had a chance to review them or if there
are any adjustments needed on my end to move forward with the merge.

Thanks for your time and assistance.

Best regards,

Daniel Pereira

^ permalink raw reply

* Re: [PATCH RFC 4/5] selinux: Restrict cross-cgroup dma-heap charging
From: Paul Moore @ 2026-05-14 20:44 UTC (permalink / raw)
  To: Albert Esteve, Tejun Heo, Johannes Weiner, Michal Koutný,
	Jonathan Corbet, Shuah Khan, Sumit Semwal, Christian König,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Benjamin Gaignard, Brian Starkey, John Stultz,
	T.J. Mercier, Christian Brauner, James Morris, Serge E. Hallyn,
	Stephen Smalley, Ondrej Mosnacek, Shuah Khan
  Cc: cgroups, linux-doc, linux-kernel, linux-media, dri-devel,
	linaro-mm-sig, linux-mm, linux-security-module, selinux,
	linux-kselftest, Albert Esteve, mripard, echanude
In-Reply-To: <20260512-v2_20230123_tjmercier_google_com-v1-4-6326701c3691@redhat.com>

On May 12, 2026 Albert Esteve <aesteve@redhat.com> wrote:
> 
> The security_dma_heap_alloc() hook allows security modules
> to control which processes may charge dma-buf allocations
> to another process's cgroup via the charge_pid_fd field of
> DMA_HEAP_IOCTL_ALLOC. Without a policy implementation, the
> hook is a no-op and the restriction is not enforced.
> 
> On SELinux-managed systems any domain with access to a
> dma-heap device node can therefore exhaust another cgroup's
> memory budget without restriction.
> 
> Implement selinux_dma_heap_alloc() using avc_has_perm() with
> a new dma_heap object class and a charge_to permission. Policy
> authors can then grant cross-cgroup charging selectively,
> for example:
> 
>   allow allocator_app_t client_app_t:dma_heap charge_to;
> 
> Signed-off-by: Albert Esteve <aesteve@redhat.com>
> ---
>  security/selinux/hooks.c            | 7 +++++++
>  security/selinux/include/classmap.h | 1 +
>  2 files changed, 8 insertions(+)
> 
> diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
> index 0f704380a8c81..ea1f410b9f619 100644
> --- a/security/selinux/hooks.c
> +++ b/security/selinux/hooks.c
> @@ -2189,6 +2189,12 @@ static int selinux_capable(const struct cred *cred, struct user_namespace *ns,
>  	return cred_has_capability(cred, cap, opts, ns == &init_user_ns);
>  }
>  
> +static int selinux_dma_heap_alloc(const struct cred *from, const struct cred *to)
> +{
> +	return avc_has_perm(cred_sid(from), cred_sid(to),
> +			    SECCLASS_DMA_HEAP, DMA_HEAP__CHARGE_TO, NULL);
> +}
> +
>  static int selinux_quotactl(int cmds, int type, int id, const struct super_block *sb)
>  {
>  	const struct cred *cred = current_cred();
> @@ -7541,6 +7547,7 @@ static struct security_hook_list selinux_hooks[] __ro_after_init = {
>  	LSM_HOOK_INIT(capget, selinux_capget),
>  	LSM_HOOK_INIT(capset, selinux_capset),
>  	LSM_HOOK_INIT(capable, selinux_capable),
> +	LSM_HOOK_INIT(dma_heap_alloc, selinux_dma_heap_alloc),
>  	LSM_HOOK_INIT(quotactl, selinux_quotactl),
>  	LSM_HOOK_INIT(quota_on, selinux_quota_on),
>  	LSM_HOOK_INIT(syslog, selinux_syslog),
> diff --git a/security/selinux/include/classmap.h b/security/selinux/include/classmap.h
> index 90cb61b164256..d232f7808f6b8 100644
> --- a/security/selinux/include/classmap.h
> +++ b/security/selinux/include/classmap.h
> @@ -181,6 +181,7 @@ const struct security_class_mapping secclass_map[] = {
>  	{ "user_namespace", { "create", NULL } },
>  	{ "memfd_file",
>  	  { COMMON_FILE_PERMS, "execute_no_trans", "entrypoint", NULL } },
> +	{ "dma_heap", { "charge_to", NULL } },
>  	/* last one */ { NULL, {} }
>  };

While we have seen some one-off patches to add specific resource/cgroups
controls in the past, much like this one, we've yet to see a patchset
that provides a more comprehensive set of resource/cgroup access controls
for SELinux.

I'm not opposed to a patch like this, but I would like to see it as part
of a larger effort to introduce access controls across all of the
existing cgroup control points where it makes sense.  In other words,
let's see a design for cgroup access controls so that we can ensure we
have something that is meaningful and makes sense from a policy
developer's perspective.

--
paul-moore.com

^ permalink raw reply

* Re: [PATCH v3 1/2] dt-bindings: trivial-devices: Add Murata D1U74T PSU
From: Abdurrahman Hussain @ 2026-05-14 20:01 UTC (permalink / raw)
  To: Krzysztof Kozlowski, Abdurrahman Hussain
  Cc: Guenter Roeck, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	Jonathan Corbet, Shuah Khan, linux-hwmon, devicetree,
	linux-kernel, linux-doc
In-Reply-To: <20260514-dazzling-ethereal-bumblebee-d9b69e@quoll>

On Thu May 14, 2026 at 4:43 AM PDT, Krzysztof Kozlowski wrote:
> On Wed, May 13, 2026 at 03:33:02AM -0700, Abdurrahman Hussain wrote:
>> The Murata D1U74T-W is a PMBus-compliant AC/DC power supply unit. The
>> binding only declares the compatible string and i2c reg, with no
>
> Describe the hardware, not binding. What does the hardware have?
> Supplies? Pins? Clocks? Interrupts?
>

Hi Krzysztof,

The Murata D1U74T-W series are hot-pluggable 1U AC/DC front-end
power supplies in the Intel CRPS-185 / OCP M-CRPS form factor.
Each variant delivers a 12 V main output plus a 12 V standby output
from a wide AC input (90-264 Vac) or HVDC supply, and includes an
internal variable-speed cooling fan and on-board voltage, current,
power, fan-speed, and temperature telemetry.

The host-side digital interface is a PMBus 1.2 port on I2C.  The
PSU's other electrical signals (status, alert, current-share) live
on the CRPS edge connector and are consumed by the chassis
controller rather than the host SoC, so there are no host-described
supplies, gpios, clocks, or interrupts.

If the above two paragraphs provide the adequate description of the
hardware I will include them verbatim in v4.

Best regards,
Abdurrahman

^ permalink raw reply

* Re: [PATCH v7 13/20] KVM: arm64: Apply dynamic guest counter reservations
From: Colton Lewis @ 2026-05-14 19:05 UTC (permalink / raw)
  To: James Clark
  Cc: alexandru.elisei, pbonzini, corbet, linux, catalin.marinas, will,
	maz, oliver.upton, mizhang, joey.gouly, suzuki.poulose, yuzenghui,
	mark.rutland, shuah, gankulkarni, linux-doc, linux-kernel,
	linux-arm-kernel, kvmarm, linux-perf-users, linux-kselftest, kvm
In-Reply-To: <66797cb2-18f7-4782-9370-68d0c10c35f4@linaro.org>

James Clark <james.clark@linaro.org> writes:

> On 13/05/2026 5:45 pm, Colton Lewis wrote:
>> James Clark <james.clark@linaro.org> writes:

>>> On 04/05/2026 10:18 pm, Colton Lewis wrote:
>>>> Apply dynamic guest counter reservations by checking if the requested
>>>> guest mask collides with any events the host has scheduled and calling
>>>> pmu_perf_resched_update() with a hook that updates the mask of
>>>> available counters in between schedule out and schedule in.

>>>> Signed-off-by: Colton Lewis <coltonlewis@google.com>
>>>> ---
>>>>    arch/arm64/kvm/pmu-direct.c  | 69 ++++++++++++++++++++++++++++++++
>>>> ++++
>>>>    include/linux/perf/arm_pmu.h |  1 +
>>>>    2 files changed, 70 insertions(+)

>>>> diff --git a/arch/arm64/kvm/pmu-direct.c b/arch/arm64/kvm/pmu-direct.c
>>>> index 2252d3b905db9..14cc419dbafad 100644
>>>> --- a/arch/arm64/kvm/pmu-direct.c
>>>> +++ b/arch/arm64/kvm/pmu-direct.c
>>>> @@ -100,6 +100,73 @@ u8 kvm_pmu_hpmn(struct kvm_vcpu *vcpu)
>>>>        return *host_data_ptr(nr_event_counters);
>>>>    }

>>>> +/* Callback to update counter mask between perf scheduling */
>>>> +static void kvm_pmu_update_mask(struct pmu *pmu, void *data)
>>>> +{
>>>> +    struct arm_pmu *arm_pmu = to_arm_pmu(pmu);
>>>> +    unsigned long *new_mask = data;
>>>> +
>>>> +    bitmap_copy(arm_pmu->cntr_mask, new_mask, ARMPMU_MAX_HWEVENTS);
>>>> +}
>>>> +
>>>> +/**
>>>> + * kvm_pmu_set_guest_counters() - Handle dynamic counter reservations
>>>> + * @cpu_pmu: struct arm_pmu to potentially modify
>>>> + * @guest_mask: new guest mask for the pmu
>>>> + *
>>>> + * Check if guest counters will interfere with current host events and
>>>> + * call into perf_pmu_resched_update if a reschedule is required.
>>>> + */
>>>> +static void kvm_pmu_set_guest_counters(struct arm_pmu *cpu_pmu, u64
>>>> guest_mask)
>>>> +{
>>>> +    struct pmu_hw_events *cpuc = this_cpu_ptr(cpu_pmu->hw_events);
>>>> +    DECLARE_BITMAP(guest_bitmap, ARMPMU_MAX_HWEVENTS);
>>>> +    DECLARE_BITMAP(new_mask, ARMPMU_MAX_HWEVENTS);
>>>> +    bool need_resched = false;
>>>> +
>>>> +    bitmap_from_arr64(guest_bitmap, &guest_mask, ARMPMU_MAX_HWEVENTS);
>>>> +    bitmap_copy(new_mask, cpu_pmu->hw_cntr_mask, ARMPMU_MAX_HWEVENTS);
>>>> +
>>>> +    if (guest_mask) {
>>>> +        /* Subtract guest counters from available host mask */
>>>> +        bitmap_andnot(new_mask, new_mask, guest_bitmap,
>>>> ARMPMU_MAX_HWEVENTS);
>>>> +
>>>> +        /* Did we collide with an active host event? */
>>>> +        if (bitmap_intersects(cpuc->used_mask, guest_bitmap,
>>>> ARMPMU_MAX_HWEVENTS)) {
>>>> +            int idx;
>>>> +
>>>> +            need_resched = true;
>>>> +            cpuc->host_squeezed = true;
>>>> +
>>>> +            /* Look for pinned events that are about to be preempted  
>>>> */
>>>> +            for_each_set_bit(idx, guest_bitmap, ARMPMU_MAX_HWEVENTS) {
>>>> +                if (test_bit(idx, cpuc->used_mask) && cpuc-
>>>> >events[idx] &&
>>>> +                    cpuc->events[idx]->attr.pinned) {
>>>> +                    pr_warn_ratelimited("perf: Pinned host event
>>>> squeezed out by KVM guest PMU partition\n");

>>> Hi Colton,

>>> I get "perf: Pinned host event squeezed out by KVM guest PMU partition"
>>> even with arm_pmuv3.reserved_host_counters=3 for example. I would have
>>> expected any non zero value to stop the warning.

>>> I think armv8pmu_get_single_idx() needs to be changed to allocate from
>>> the high end host counters first. A more complicated option would be
>>> checking to see if there are any non-pinned counters in the host
>>> reserved half when a new pinned counter is opened, then swapping the
>>> places of the new pinned and existing non-pinned counters so pinned
>>> always prefer being put into the host half. But it's probably not worth
>>> doing that.

>>> James


>> I agree it makes the most sense to allocate from the top, but I'm happy
>> the basic idea works.


> Another thing I forgot to mention is that even with the ratelimited
> warning, this spams the logs any time the host and guest are both using
> the PMU and I'm not sure how useful that is.

I'm sure it does. I'll delete it.

>>>> +                    break;
>>>> +                }
>>>> +            }
>>>> +        }
>>>> +    } else {
>>>> +        /*
>>>> +         * Restoring to hw_cntr_mask.
>>>> +         * Only resched if we previously squeezed an event.
>>>> +         */
>>>> +        if (cpuc->host_squeezed) {
>>>> +            need_resched = true;
>>>> +            cpuc->host_squeezed = false;
>>>> +        }
>>>> +    }
>>>> +
>>>> +    if (need_resched) {
>>>> +        /* Collision: run full perf reschedule */
>>>> +        perf_pmu_resched_update(&cpu_pmu->pmu, kvm_pmu_update_mask,
>>>> new_mask);
>>>> +    } else {
>>>> +        /* Host was never using guest counters anyway */
>>>> +        bitmap_copy(cpu_pmu->cntr_mask, new_mask,  
>>>> ARMPMU_MAX_HWEVENTS);
>>>> +    }
>>>> +}
>>>> +
>>>>    /**
>>>>     * kvm_pmu_host_counter_mask() - Compute bitmask of host-reserved
>>>> counters
>>>>     * @pmu: Pointer to arm_pmu struct
>>>> @@ -218,6 +285,7 @@ void kvm_pmu_load(struct kvm_vcpu *vcpu)

>>>>        pmu = vcpu->kvm->arch.arm_pmu;
>>>>        guest_counters = kvm_pmu_guest_counter_mask(pmu);
>>>> +    kvm_pmu_set_guest_counters(pmu, guest_counters);
>>>>        kvm_pmu_apply_event_filter(vcpu);

>>>>        for_each_set_bit(i, &guest_counters, ARMPMU_MAX_HWEVENTS) {
>>>> @@ -319,5 +387,6 @@ void kvm_pmu_put(struct kvm_vcpu *vcpu)
>>>>        val = read_sysreg(pmintenset_el1);
>>>>        __vcpu_assign_sys_reg(vcpu, PMINTENSET_EL1, val & mask);

>>>> +    kvm_pmu_set_guest_counters(pmu, 0);
>>>>        preempt_enable();
>>>>    }
>>>> diff --git a/include/linux/perf/arm_pmu.h  
>>>> b/include/linux/perf/arm_pmu.h
>>>> index f7b000bb3eca8..63f88fec5e80f 100644
>>>> --- a/include/linux/perf/arm_pmu.h
>>>> +++ b/include/linux/perf/arm_pmu.h
>>>> @@ -75,6 +75,7 @@ struct pmu_hw_events {

>>>>        /* Active events requesting branch records */
>>>>        unsigned int        branch_users;
>>>> +    bool host_squeezed;
>>>>    };

>>>>    enum armpmu_attr_groups {

^ permalink raw reply

* Re: [PATCH v7 10/20] KVM: arm64: Context swap Partitioned PMU guest registers
From: Colton Lewis @ 2026-05-14 18:59 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvm, alexandru.elisei, pbonzini, corbet, linux, catalin.marinas,
	will, maz, oliver.upton, mizhang, joey.gouly, suzuki.poulose,
	yuzenghui, mark.rutland, shuah, gankulkarni, james.clark,
	linux-doc, linux-kernel, linux-arm-kernel, kvmarm,
	linux-perf-users, linux-kselftest
In-Reply-To: <agRBzkVcR-qZZdx2@kernel.org>

Oliver Upton <oupton@kernel.org> writes:

> On Mon, May 04, 2026 at 09:18:03PM +0000, Colton Lewis wrote:
>> +
>> +/**
>> + * kvm_pmu_host_counter_mask() - Compute bitmask of host-reserved  
>> counters
>> + * @pmu: Pointer to arm_pmu struct
>> + *
>> + * Compute the bitmask that selects the host-reserved counters in the
>> + * {PMCNTEN,PMINTEN,PMOVS}{SET,CLR} registers. These are the counters
>> + * in HPMN..N
>> + *
>> + * Return: Bitmask
>> + */
>> +u64 kvm_pmu_host_counter_mask(struct arm_pmu *pmu)
>> +{
>> +	u8 nr_counters = *host_data_ptr(nr_event_counters);
>> +
>> +	if (kvm_pmu_is_partitioned(pmu))
>> +		return GENMASK(nr_counters - 1, pmu->max_guest_counters);
>> +
>> +	return ARMV8_PMU_CNT_MASK_ALL;
>> +}
>> +
>> +/**
>> + * kvm_pmu_guest_counter_mask() - Compute bitmask of guest-reserved  
>> counters
>> + * @pmu: Pointer to arm_pmu struct
>> + *
>> + * Compute the bitmask that selects the guest-reserved counters in the
>> + * {PMCNTEN,PMINTEN,PMOVS}{SET,CLR} registers. These are the counters
>> + * in 0..HPMN and the cycle and instruction counters.
>> + *
>> + * Return: Bitmask
>> + */
>> +u64 kvm_pmu_guest_counter_mask(struct arm_pmu *pmu)
>> +{
>> +	if (kvm_pmu_is_partitioned(pmu))
>> +		return ARMV8_PMU_CNT_MASK_C | GENMASK(pmu->max_guest_counters - 1, 0);
>> +
>> +	return 0;
>> +}
>> +
>> +/**
>> + * kvm_pmu_load() - Load untrapped PMU registers
>> + * @vcpu: Pointer to struct kvm_vcpu
>> + *
>> + * Load all untrapped PMU registers from the VCPU into the PCPU. Mask
>> + * to only bits belonging to guest-reserved counters and leave
>> + * host-reserved counters alone in bitmask registers.
>> + */
>> +void kvm_pmu_load(struct kvm_vcpu *vcpu)
>> +{
>> +	struct arm_pmu *pmu;
>> +	unsigned long guest_counters;
>> +	u64 mask;
>> +	u8 i;
>> +	u64 val;
>> +
>> +	/*
>> +	 * If we aren't guest-owned then we know the guest isn't using
>> +	 * the PMU anyway, so no need to bother with the swap.
>> +	 */
>> +	if (!kvm_vcpu_pmu_is_partitioned(vcpu))
>> +		return;
>> +
>> +	preempt_disable();
>> +
>> +	pmu = vcpu->kvm->arch.arm_pmu;
>> +	guest_counters = kvm_pmu_guest_counter_mask(pmu);
>> +
>> +	for_each_set_bit(i, &guest_counters, ARMPMU_MAX_HWEVENTS) {
>> +		val = __vcpu_sys_reg(vcpu, PMEVCNTR0_EL0 + i);
>> +
>> +		if (i == ARMV8_PMU_CYCLE_IDX) {
>> +			write_sysreg(val, pmccntr_el0);
>> +		} else {
>> +			write_sysreg(i, pmselr_el0);
>> +			write_sysreg(val, pmxevcntr_el0);

> This is wrong, you would need an intervening ISB. It'd be better to
> avoid the ISB altogether and just use {read,write}_pmevcntrn().

Good catch, I was using {read,write}_pmevcntrn here before but changed
it after your feedback that:

> I'd prefer KVM directly accessed the PMU registers to
> avoid the possibility of taking some instrumented codepath in the
> future.

https://lore.kernel.org/kvm/aUH7oC41XaEMsXf_@kernel.org/

I assume this is a compromise with that.

^ permalink raw reply

* Re: [PATCH v7 09/20] KVM: arm64: Set up MDCR_EL2 to handle a Partitioned PMU
From: Colton Lewis @ 2026-05-14 18:43 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvm, alexandru.elisei, pbonzini, corbet, linux, catalin.marinas,
	will, maz, oliver.upton, mizhang, joey.gouly, suzuki.poulose,
	yuzenghui, mark.rutland, shuah, gankulkarni, james.clark,
	linux-doc, linux-kernel, linux-arm-kernel, kvmarm,
	linux-perf-users, linux-kselftest
In-Reply-To: <agQu9kLnbHjSni-C@kernel.org>

Oliver Upton <oupton@kernel.org> writes:

> On Mon, May 04, 2026 at 09:18:02PM +0000, Colton Lewis wrote:
>> diff --git a/arch/arm64/kvm/debug.c b/arch/arm64/kvm/debug.c
>> index 3ad6b7c6e4ba7..0ab89c91e19cb 100644
>> --- a/arch/arm64/kvm/debug.c
>> +++ b/arch/arm64/kvm/debug.c
>> @@ -36,20 +36,43 @@ static int cpu_has_spe(u64 dfr0)
>>    */
>>   static void kvm_arm_setup_mdcr_el2(struct kvm_vcpu *vcpu)
>>   {
>> +	int hpmn = kvm_pmu_hpmn(vcpu);
>> +
>>   	preempt_disable();

>>   	/*
>>   	 * This also clears MDCR_EL2_E2PB_MASK and MDCR_EL2_E2TB_MASK
>>   	 * to disable guest access to the profiling and trace buffers
>>   	 */
>> -	vcpu->arch.mdcr_el2 = FIELD_PREP(MDCR_EL2_HPMN,
>> -					 *host_data_ptr(nr_event_counters));
>> +
>> +	vcpu->arch.mdcr_el2 = FIELD_PREP(MDCR_EL2_HPMN, hpmn);
>>   	vcpu->arch.mdcr_el2 |= (MDCR_EL2_TPM |
>>   				MDCR_EL2_TPMS |
>>   				MDCR_EL2_TTRF |
>>   				MDCR_EL2_TPMCR |
>>   				MDCR_EL2_TDRA |
>> -				MDCR_EL2_TDOSA);
>> +				MDCR_EL2_TDOSA |
>> +				MDCR_EL2_HPME);
>> +
>> +	if (kvm_vcpu_pmu_is_partitioned(vcpu)) {
>> +		/*
>> +		 * Filtering these should be redundant because we trap
>> +		 * all the TYPER and FILTR registers anyway and ensure
>> +		 * they filter EL2, but set the bits if they are here.
>> +		 */
>> +		if (is_pmuv3p1(read_pmuver()))
>> +			vcpu->arch.mdcr_el2 |= MDCR_EL2_HPMD;
>> +		if (is_pmuv3p5(read_pmuver()))
>> +			vcpu->arch.mdcr_el2 |= MDCR_EL2_HCCD;

> Neither of these controls are of any consequence on unsupported
> hardware (RES0). Set them unconditionally?

Sure.

>> +		/*
>> +		 * Take out the coarse grain traps if we are using
>> +		 * fine grain traps.
>> +		 */
>> +		if (kvm_vcpu_pmu_use_fgt(vcpu))

> I think open coding the check here would actually improve readability.

> 		if (cpus_have_final_cap(ARM64_HAS_FGT) &&
> 		    (cpus_have_final_cap(ARM64_HAS_HPMN0) ||
> 		     vcpu->kvm->arch.nr_pmu_counters != 0))
> 			vcpu->arch.mdcr_el2 &= ~(MDCR_EL2_TPM | MDCR_EL2_TPMCR);

I disagree but I'll do it.

>> +
>> +/**
>> + * kvm_pmu_hpmn() - Calculate HPMN field value
>> + * @vcpu: Pointer to struct kvm_vcpu
>> + *
>> + * Calculate the appropriate value to set for MDCR_EL2.HPMN. If
>> + * partitioned, this is the number of counters set for the guest if
>> + * supported, falling back to max_guest_counters if needed. If we are  
>> not
>> + * partitioned or can't set the implied HPMN value, fall back to the
>> + * host value.
>> + *
>> + * Return: A valid HPMN value
>> + */
>> +u8 kvm_pmu_hpmn(struct kvm_vcpu *vcpu)
>> +{
>> +	u8 nr_guest_cntr = vcpu->kvm->arch.nr_pmu_counters;
>> +
>> +	if (kvm_vcpu_pmu_is_partitioned(vcpu)
>> +	    && !vcpu_on_unsupported_cpu(vcpu)
>> +	    && (cpus_have_final_cap(ARM64_HAS_HPMN0) || nr_guest_cntr > 0))
>> +		return nr_guest_cntr;
>> +
>> +	return *host_data_ptr(nr_event_counters);
>> +}

> This helper isn't helpful. Just open code it in the place where we are
> computing MDCR_EL2.

I disagree but I'll do it.

>> @@ -542,6 +542,13 @@ u8 kvm_arm_pmu_get_max_counters(struct kvm *kvm)
>>   	if (cpus_have_final_cap(ARM64_WORKAROUND_PMUV3_IMPDEF_TRAPS))
>>   		return 1;

>> +	/*
>> +	 * If partitioned then we are limited by the max counters in
>> +	 * the guest partition.
>> +	 */
>> +	if (kvm_pmu_is_partitioned(arm_pmu))
>> +		return arm_pmu->max_guest_counters;
>> +

> Ok, this is exactly what I was getting at earlier. What about a VM with
> an emulated PMU? It should use cntr_mask calculation, not the guest
> range.

True. I should use something different here.


> Thanks,
> Oliver

^ permalink raw reply

* Re: [PATCH v7 08/20] KVM: arm64: Add Partitioned PMU register trap handlers
From: Colton Lewis @ 2026-05-14 18:18 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvm, alexandru.elisei, pbonzini, corbet, linux, catalin.marinas,
	will, maz, oliver.upton, mizhang, joey.gouly, suzuki.poulose,
	yuzenghui, mark.rutland, shuah, gankulkarni, james.clark,
	linux-doc, linux-kernel, linux-arm-kernel, kvmarm,
	linux-perf-users, linux-kselftest
In-Reply-To: <agQsM7XFsbxbFRLO@kernel.org>

Oliver Upton <oupton@kernel.org> writes:

> On Mon, May 04, 2026 at 09:18:01PM +0000, Colton Lewis wrote:
>> We may want a partitioned PMU but not have FEAT_FGT to untrap the
>> specific registers that would normally be untrapped. Add handling for
>> those trapped register accesses that does the right thing if the PMU
>> is partitioned.

>> For registers that shouldn't be written to hardware because they
>> require special handling (PMEVTYPER and PMOVS), write to the virtual
>> register. A later patch will ensure these are handled correctly at
>> vcpu_load time.

>> Signed-off-by: Colton Lewis <coltonlewis@google.com>

> I'd prefer an approach that provides a single accessor helper that takes
> a vcpu_sysreg enum as an argument and internally handles the dispatch
> between partitioned and emulated PMUs. That goes for all of the PMU
> sysregs.

That seems ugly to me. It'll need a giant switch or two to re-dispatch
to the correct sysreg handling when we were already dispatched courtesy
of the function we are in.

Are you thinking:

single_accessor(vcpu_sysreg)
{
         if (is_partitioned) {
            switch (vcpu_sysreg) {
            ...
            }
            return;
        }

        switch (vcpu_sysreg) {
        ...
        }
}

or I could do the switch on the outside and duplicate the is_partitioned
check but that's the same as what happens now with extra steps.


> This will help you reuse some of the PMU emuation code that you'll still
> need for things like nested...

I'm not seeing what you mean. Could you explain further please?

> Thanks,
> Oliver

^ permalink raw reply

* Re: [PATCH v2 1/3] Doc: deprecated.rst: add strlcat()
From: Randy Dunlap @ 2026-05-14 17:51 UTC (permalink / raw)
  To: Kees Cook, Manuel Ebner
  Cc: Andy Shevchenko, Jonathan Corbet, Shuah Khan, Andy Whitcroft,
	Joe Perches, Dwaipayan Ray, Lukas Bulwahn, Geert Uytterhoeven,
	David Laight, Jani Nikula, Heiko Carstens,
	open list:DOCUMENTATION PROCESS, open list:DOCUMENTATION,
	open list
In-Reply-To: <202605140931.913048A68B@keescook>



On 5/14/26 9:31 AM, Kees Cook wrote:
> On Thu, May 14, 2026 at 06:26:53PM +0200, Manuel Ebner wrote:
>> add strlcat and alternatives
>>
>> Signed-off-by: Manuel Ebner <manuelebner@mailbox.org>
>> ---
>>  Documentation/process/deprecated.rst | 7 +++++++
>>  1 file changed, 7 insertions(+)
>>
>> diff --git a/Documentation/process/deprecated.rst b/Documentation/process/deprecated.rst
>> index fed56864d036..06e802f4bbfd 100644
>> --- a/Documentation/process/deprecated.rst
>> +++ b/Documentation/process/deprecated.rst
>> @@ -153,6 +153,13 @@ used, and the destinations should be marked with the `__nonstring
>>  attribute to avoid future compiler warnings. For cases still needing
>>  NUL-padding, strtomem_pad() can be used.
>>  
>> +strlcat()
>> +---------
>> +strlcat() must re-scan the destination string from the beginning on each
>> +call (O(n^2) behavior). Alternatives are seq_buf_puts() and seq_buf_printf().
>> +snprintf(), scnprintf() and sysfs_emit() are possible aswell, but the adoption
>> +of the arguments needs to be taken care off.
>> +
> 
> How about just:
> 
> strlcat() must re-scan the destination string from the beginning on each
> call (O(n^2) behavior). Use the seq_buf API or similar instead.
> 

Yeah, that avoids the "aswell" (should be "as well").

> 
>>  strlcpy()
>>  ---------
>>  strlcpy() reads the entire source buffer first (since the return value
>> -- 
>> 2.54.0
>>
> 

-- 
~Randy


^ permalink raw reply

* Re: [PATCH v7 07/20] KVM: arm64: Set up FGT for Partitioned PMU
From: Colton Lewis @ 2026-05-14 17:49 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvm, alexandru.elisei, pbonzini, corbet, linux, catalin.marinas,
	will, maz, oliver.upton, mizhang, joey.gouly, suzuki.poulose,
	yuzenghui, mark.rutland, shuah, gankulkarni, james.clark,
	linux-doc, linux-kernel, linux-arm-kernel, kvmarm,
	linux-perf-users, linux-kselftest
In-Reply-To: <agQpbiD8Fi6fzomf@kernel.org>

Hi Oliver. Thanks for the review.

Oliver Upton <oupton@kernel.org> writes:

> On Mon, May 04, 2026 at 09:18:00PM +0000, Colton Lewis wrote:
>> +static void __compute_hdfgrtr(struct kvm_vcpu *vcpu)
>> +{
>> +	__compute_fgt(vcpu, HDFGRTR_EL2);
>> +
>> +	*vcpu_fgt(vcpu, HDFGRTR_EL2) |=
>> +		HDFGRTR_EL2_PMOVS
>> +		| HDFGRTR_EL2_PMCCFILTR_EL0
>> +		| HDFGRTR_EL2_PMEVTYPERn_EL0
>> +		| HDFGRTR_EL2_PMCEIDn_EL0
>> +		| HDFGRTR_EL2_PMMIR_EL1;
>> +}
>> +

> I've given this feedback at least twice already...

> Operators go on the preceding line in the case of line continuations.

I apologize for letting that slip through again.

>> +
>> +/**
>> + * kvm_pmu_is_partitioned() - Determine if given PMU is partitioned
>> + * @pmu: Pointer to arm_pmu struct
>> + *
>> + * Determine if given PMU is partitioned by looking at hpmn field. The
>> + * PMU is partitioned if this field is less than the number of
>> + * counters in the system.
>> + *
>> + * Return: True if the PMU is partitioned, false otherwise
>> + */
>> +bool kvm_pmu_is_partitioned(struct arm_pmu *pmu)
>> +{
>> +	if (!pmu)
>> +		return false;
>> +
>> +	return pmu->max_guest_counters >= 0 &&
>> +		pmu->max_guest_counters <= *host_data_ptr(nr_event_counters);
>> +}
>> +
>> +/**
>> + * kvm_vcpu_pmu_is_partitioned() - Determine if given VCPU has a  
>> partitioned PMU
>> + * @vcpu: Pointer to kvm_vcpu struct
>> + *
>> + * Determine if given VCPU has a partitioned PMU by extracting that
>> + * field and passing it to :c:func:`kvm_pmu_is_partitioned`
>> + *
>> + * Return: True if the VCPU PMU is partitioned, false otherwise
>> + */
>> +bool kvm_vcpu_pmu_is_partitioned(struct kvm_vcpu *vcpu)
>> +{
>> +	return kvm_pmu_is_partitioned(vcpu->kvm->arch.arm_pmu) &&
>> +		false;
>> +}

> Ok, I'm thoroughly confused about these predicates.

> Whether or not a vCPU is using a partitioned PMU is a per-VM property.
> This is separate from whether or not the backing arm_pmu has a range of
> available counters for the guest to use.

> It is entirely possible that a VM *isn't* using the partitioned PMU
> feature (i.e. backed with perf events) yet the supporting arm_pmu has a
> guest counter range.

Yes and I add that to this predicate in a later patch when I introduce
the flag. I can always reorder to introduce the flag before (or along
with) this predicate.

>> +#if !defined(__KVM_NVHE_HYPERVISOR__)
>> +bool kvm_vcpu_pmu_is_partitioned(struct kvm_vcpu *vcpu);
>> +bool kvm_vcpu_pmu_use_fgt(struct kvm_vcpu *vcpu);
>> +#else
>> +static inline bool kvm_vcpu_pmu_is_partitioned(struct kvm_vcpu *vcpu)
>> +{
>> +	return false;
>> +}
>> +
>> +static inline bool kvm_vcpu_pmu_use_fgt(struct kvm_vcpu *vcpu)
>> +{
>> +	return false;
>> +}
>> +#endif
>> +

> Don't use ifdeffery for this. Aim to have a single definition and rely
> on has_vhe() to do the rest of the work.

Will do.


> Thanks,
> Oliver

^ permalink raw reply

* [PATCH v3] cgroup/dmem: introduce a peak file
From: Thadeu Lima de Souza Cascardo @ 2026-05-14 17:36 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner, Michal Koutný, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Jonathan Corbet, Shuah Khan, Maarten Lankhorst, Maxime Ripard,
	Natalie Vock, Tvrtko Ursulin
  Cc: cgroups, linux-kernel, linux-mm, linux-doc, dri-devel, kernel-dev,
	Thadeu Lima de Souza Cascardo

Just like we have memory.peak, introduce a dmem.peak, which uses the
page_counter support for that.

For now, make it read-only.

This allows for memory usage monitoring without polling dmem.current when
the information needed is the maximum device memory used. That can be used
for capacity planning, such that dmem.max can be properly setup for a given
workload. It can also be used for debugging to determine whether a given
workload would have caused eviction or system memory use.

Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com>
---
Changes in v3:
- EDITME: describe what is new in this series revision.
- EDITME: use bulletpoints and terse descriptions.
- Link to v2: https://patch.msgid.link/20260513-dmem_peak-v2-1-dac06999db9e@igalia.com

Changes in v2:
- Make it read-only for now and adjust documentation accordingly.
- Link to v1: https://patch.msgid.link/20260506-dmem_peak-v1-0-8d803eb3449c@igalia.com
---
 Documentation/admin-guide/cgroup-v2.rst |  6 ++++++
 kernel/cgroup/dmem.c                    | 15 +++++++++++++++
 2 files changed, 21 insertions(+)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 6efd0095ed99..d103623b2be4 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -2808,6 +2808,12 @@ DMEM Interface Files
 	The semantics are the same as for the memory cgroup controller, and are
 	calculated in the same way.
 
+  dmem.peak
+	A read-only nested-keyed file that exists on non-root cgroups.
+
+	The max device memory usage recorded for the cgroup and its
+	descendants since the creation of the cgroup for each region.
+
   dmem.capacity
 	A read-only file that describes maximum region capacity.
 	It only exists on the root cgroup. Not all memory can be
diff --git a/kernel/cgroup/dmem.c b/kernel/cgroup/dmem.c
index 4753a67d0f0f..6430c7ce1e03 100644
--- a/kernel/cgroup/dmem.c
+++ b/kernel/cgroup/dmem.c
@@ -182,6 +182,11 @@ static u64 get_resource_current(struct dmem_cgroup_pool_state *pool)
 	return pool ? page_counter_read(&pool->cnt) : 0;
 }
 
+static u64 get_resource_peak(struct dmem_cgroup_pool_state *pool)
+{
+	return pool ? READ_ONCE(pool->cnt.watermark) : 0;
+}
+
 static void reset_all_resource_limits(struct dmem_cgroup_pool_state *rpool)
 {
 	set_resource_min(rpool, 0);
@@ -808,6 +813,11 @@ static int dmemcg_limit_show(struct seq_file *sf, void *v,
 	return 0;
 }
 
+static int dmem_cgroup_region_peak_show(struct seq_file *sf, void *v)
+{
+	return dmemcg_limit_show(sf, v, get_resource_peak);
+}
+
 static int dmem_cgroup_region_current_show(struct seq_file *sf, void *v)
 {
 	return dmemcg_limit_show(sf, v, get_resource_current);
@@ -856,6 +866,11 @@ static struct cftype files[] = {
 		.name = "current",
 		.seq_show = dmem_cgroup_region_current_show,
 	},
+	{
+		.name = "peak",
+		.seq_show = dmem_cgroup_region_peak_show,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
 	{
 		.name = "min",
 		.write = dmem_cgroup_region_min_write,

---
base-commit: d3b0a7f21119f5a66cb76aa28fb8cc13206aaf7d
change-id: 20260409-dmem_peak-3abc1be95072

Best regards,
--  
Thadeu Lima de Souza Cascardo <cascardo@igalia.com>


^ permalink raw reply related

* Re: [PATCH v13 3/4] gpio: rpmsg: add generic rpmsg GPIO driver
From: Mathieu Poirier @ 2026-05-14 17:35 UTC (permalink / raw)
  To: Shenwei Wang
  Cc: Arnaud POULIQUEN, Beleswar Prasad Padhi, Andrew Lunn,
	Linus Walleij, Bartosz Golaszewski, Jonathan Corbet, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Bjorn Andersson, Frank Li,
	Sascha Hauer, Shuah Khan, linux-gpio@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	Pengutronix Kernel Team, Fabio Estevam, Peng Fan,
	devicetree@vger.kernel.org, linux-remoteproc@vger.kernel.org,
	imx@lists.linux.dev, linux-arm-kernel@lists.infradead.org,
	dl-linux-imx, Bartosz Golaszewski
In-Reply-To: <PAXPR04MB9185BFA6E7375FAD0B15B021893C2@PAXPR04MB9185.eurprd04.prod.outlook.com>

On Thu, May 07, 2026 at 07:43:33PM +0000, Shenwei Wang wrote:
> 
> 
> > -----Original Message-----
> > From: Mathieu Poirier <mathieu.poirier@linaro.org>
> > Sent: Thursday, May 7, 2026 12:13 PM
> > To: Arnaud POULIQUEN <arnaud.pouliquen@foss.st.com>
> > Cc: Beleswar Prasad Padhi <b-padhi@ti.com>; Shenwei Wang
> > <shenwei.wang@nxp.com>; Andrew Lunn <andrew@lunn.ch>; Linus Walleij
> > <linusw@kernel.org>; Bartosz Golaszewski <brgl@kernel.org>; Jonathan Corbet
> > <corbet@lwn.net>; Rob Herring <robh@kernel.org>; Krzysztof Kozlowski
> > <krzk+dt@kernel.org>; Conor Dooley <conor+dt@kernel.org>; Bjorn Andersson
> > <andersson@kernel.org>; Frank Li <frank.li@nxp.com>; Sascha Hauer
> > <s.hauer@pengutronix.de>; Shuah Khan <skhan@linuxfoundation.org>; linux-
> > gpio@vger.kernel.org; linux-doc@vger.kernel.org; linux-kernel@vger.kernel.org;
> > Pengutronix Kernel Team <kernel@pengutronix.de>; Fabio Estevam
> > <festevam@gmail.com>; Peng Fan <peng.fan@nxp.com>;
> > devicetree@vger.kernel.org; linux-remoteproc@vger.kernel.org;
> > imx@lists.linux.dev; linux-arm-kernel@lists.infradead.org; dl-linux-imx <linux-
> > imx@nxp.com>; Bartosz Golaszewski <brgl@bgdev.pl>
> > Subject: [EXT] Re: [PATCH v13 3/4] gpio: rpmsg: add generic rpmsg GPIO driver
> > > > >  From my perspective, based on your proposal:
> > > > >   1) Linux should send a get_config message to the remote proc (0x405 ->
> > 0xD). 2) The remote processor would respond with the list of ports, associated
> > > > >      with an remote endpoint addresses.
> > > >
> > > >
> > > > Agreed, we can scale it for multiple remote endpoints like this.
> > > >
> > > > >   3) Linux would parse the response, compare it with the DT, enable the
> > GPIO
> > > > >      ports accordingly, creating it local endpoint and associating it with
> > > > >      the remote endpoint.
> > > > > Using name service to identify the ports should avoid step 1 & 2 ...
> > > >
> > > >
> > > > Yes, but won't that make a lot of hard-codings in the driver?
> > > >
> > > > +static struct rpmsg_device_id rpmsg_gpio_channel_id_table[] = {
> > > > +    { .name = "rpmsg-io-25" },
> > > > +    { .name = "rpmsg-io-32" },
> > > > +    { .name = "rpmsg-io-35" },
> > > > +    { },
> > > > +};
> > > >
> > > > What if tomorrow another vendor decides to add more remoteproc
> > > > controlled GPIO ports to Linux, they would have to update this
> > > > struct in the driver everytime. And the port indexes (25/32/35)
> > > > could also differ between vendors. We should make the driver dynamic
> > > > i.e. vendor agnostic.
> > > >
> > > > I think querying the remote firmware at runtime (step 1 & 2 above)
> > > > is a common design pattern and makes the driver vendor agnostic. But
> > > > feel free to correct me.
> > > >
> > >
> > > You are right. My proposal would require a patch in rpmsg-core. The
> > > idea of allowing a postfix in the compatible string has been discussed
> > > before, but, if I remember correctly, it was not concluded.
> > >
> > 
> > I also remember discussing this.  I even reviewed one of Arnaud's patch and
> > submitted one myself.  This must have been in 2020 and the reason why it wasn't
> > merged has escaped my memory.
> > 
> > > /* rpmsg devices and drivers are matched using the service name */
> > > static inline int rpmsg_id_match(const struct rpmsg_device *rpdev,
> > >                                 const struct rpmsg_device_id *id) {
> > >       size_t len;
> > >
> > > +     len = strnlen(id->name, RPMSG_NAME_SIZE);
> > > +     if (len && id->name[len - 1] == '*')
> > > +             return !strncmp(id->name, rpdev->id.name, len - 1);
> > >
> > >       return strncmp(id->name, rpdev->id.name, RPMSG_NAME_SIZE) == 0;
> > > }
> > >
> > > Then, in rpmsg-gpio, and possibly in other drivers such as rpmsg-tty
> > > and a future rpmsg-i2c, we could use:
> > > static struct rpmsg_device_id rpmsg_gpio_channel_id_table[] = {
> > >     { .name = "rpmsg-io" },
> > >     { .name = "rpmsg-io-*" },
> > >     { },
> > > };
> > 
> > That was my initial approach.  We don't even need an additional "rpmsg-io-*" in
> > rpmsg_gpio_channel_id_table[].  All we need is:
> > 
> > /* rpmsg devices and drivers are matched using the service name */ static inline
> > int rpmsg_id_match(const struct rpmsg_device *rpdev,
> >                                  const struct rpmsg_device_id *id) {
> >  +     size_t len = strnlen(id->name, RPMSG_NAME_SIZE);
> > 
> >  -     return strncmp(id->name, rpdev->id.name, RPMSG_NAME_SIZE) == 0;
> >  +     return strncmp(id->name, rpdev->id.name, len) == 0;
> > }
> > 
> 
> If we encode the port index directly into ept->src, for example:
> 
>     ept->src = (baseaddr << 8) | port_index;
>

There is no rpmsg_endpoint::src.  You likely meant ept->addr.  This would work
but not optimal on two front:

(1) rpms_endpoint::addr is a u32 and idr_alloc() returns an 'int'.  As such
there is a possibility of conflict.  I concede the possibility is marginal, but
it still exists.

(2) By proceeding this way, the kernel exposes the GPIO controller it knows
about.  It is preferrable to have the remote processor tell the kernel about the
GPIO controller it wants.

I am done reviewing this revision.  Given the amount of refactoring needed, I
will not look at the code.  Please refer to this reply [1] for what I am
expecting in the next revision. 

[1]. https://lwn.net/ml/all/CANLsYkwBk0KbN-k9ce+5=oT+scdZ3nU5AOr3Fz4zT=0AFzghDA@mail.gmail.com/
 
> where baseaddr can be derived from the channel address, we can avoid the possible address conflict.
> 
> With this approach, the patch to rpmsg-core would no longer be necessary.
> 
> Thanks,
> Shenwei
> 
> > And let the rpmsg-virtio-gpio driver parse @rpdev->id.name to match with a
> > GPIO controller in the DT.
> > 
> > >
> > > If exact name matching is strongly required, then this proposal would
> > > not be suitablea.
> > >
> > > A third option would be a combination of both approaches: instantiate
> > > the device using the same name service from the remote side, as done
> > > in rpmsg-tty. In that case, a get_config message, or a similar
> > > mechanism, would also be needed to retrieve the port information from the
> > remote side.
> > >
> > 
> > I'm not overly fond of a get_config message because it is one more thing we have
> > to define and maintain.
> > 
> > Arnaud: is there a get_config message already defined for rpmsg_tty?
> > 
> > Beleswar: Can you provide a link to a virtio device that would use a get_config
> > message?
> > 
> > > Tanmaya also proposed another alternative based on reserved addresses.
> > >
> > > At this point, I suggest letting Mathieu review the discussion and
> > > recommend the most suitable approach.
> > >
> > > Thanks,
> > > Arnaud
> > >
> > > > >
> > > > > At the end, whatever solution is implemented, my main concern is
> > > > > that the Linux driver design should, if possible, avoid adding
> > > > > unnecessary complexity or limitations on the remote side (for instance in
> > openAMP project).
> > > >
> > > >
> > > > Yes definitely, I want the same. Feel free to let me know if this
> > > > does not suit with the OpenAMP project.
> > > >
> > > > Thanks,
> > > > Beleswar
> > > >
> > > > >
> > > > > Thanks,
> > > > > Arnaud
> > > > >
> > > > >
> > > > > > So Linux does not need to send the port idx everytime while
> > > > > > sending a gpio message anymore.
> > > > > >
> > > > > > Thanks,
> > > > > > Beleswar
> > > > > >
> > > > > > [...]
> > > > > >
> > > > >
> > >

^ permalink raw reply

* [PATCH net-next v5 8/8] selftests: drv-net: add netkit devmem tests
From: Bobby Eshleman @ 2026-05-14 17:22 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Shuah Khan, Alex Shi,
	Yanteng Si, Dongliang Mu, Michael Chan, Pavan Chebbi,
	Joshua Washington, Harshitha Ramamurthy, Saeed Mahameed,
	Tariq Toukan, Mark Bloch, Leon Romanovsky, Alexander Duyck,
	kernel-team, Daniel Borkmann, Nikolay Aleksandrov, Shuah Khan,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Shuah Khan, Alex Shi,
	Yanteng Si, Dongliang Mu, Michael Chan, Pavan Chebbi,
	Joshua Washington, Harshitha Ramamurthy, Saeed Mahameed,
	Tariq Toukan, Mark Bloch, Leon Romanovsky, Alexander Duyck,
	kernel-team, Daniel Borkmann, Nikolay Aleksandrov, Shuah Khan
  Cc: dw, sdf.kernel, mohsin.bashr, willemb, jiang.kun2, xu.xin16,
	wang.yaxin, netdev, linux-doc, linux-kernel, linux-rdma, bpf,
	linux-kselftest, Stanislav Fomichev, Mina Almasry, netdev,
	linux-doc, linux-kernel, linux-rdma, bpf, linux-kselftest,
	Bobby Eshleman
In-Reply-To: <20260514-tcp-dm-netkit-v5-0-408c59b91e66@meta.com>

From: Bobby Eshleman <bobbyeshleman@meta.com>

Add nk_devmem.py with four tests for TCP devmem through a netkit device.

These tests are just duplicates of the original devmem tests, with some
adjusted parameters such as telling ncdevmem to avoid device setup
(since it only has access to netkit, not a phys device).

Each test uses NetDrvContEnv with primary_rx_redirect=True to set up the
BPF redirect program on the primary netkit interface, then calls a
shared run_*() helper which probes for devmem support and configures
the NIC (HDS, RSS, queue lease) before driving the test. NIC state is
restored per-test via defer() callbacks registered inside the helper.

Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
Changes in v5:
- Move require_devmem() inside test functions so ksft_run() reports it
  as a SKIP (Sashiko).
- Drop the inaccurate "mirroring the nk_qlease.py pattern" claim from
  v4 (Sashiko).

Changes in v4:
- Call configure_nic()/cleanup_nic() once around ksft_run() rather than
  relying on per-test configuration inside the run_* helpers.

Changes in v3:
- Reorder os.path expressions
- Drop @ksft_disruptive from check_nk_rx_hds to mirror the original
  check_rx_hds in devmem.py

Changes in v2:
- Add nk_devmem.py to TEST_PROGS in Makefile (Sashiko)
---
 tools/testing/selftests/drivers/net/hw/Makefile    |  1 +
 .../testing/selftests/drivers/net/hw/nk_devmem.py  | 46 ++++++++++++++++++++++
 2 files changed, 47 insertions(+)

diff --git a/tools/testing/selftests/drivers/net/hw/Makefile b/tools/testing/selftests/drivers/net/hw/Makefile
index 5e49d7bffced..c7a1206880ea 100644
--- a/tools/testing/selftests/drivers/net/hw/Makefile
+++ b/tools/testing/selftests/drivers/net/hw/Makefile
@@ -35,6 +35,7 @@ TEST_PROGS = \
 	irq.py \
 	loopback.sh \
 	nic_timestamp.py \
+	nk_devmem.py \
 	nk_netns.py \
 	nk_qlease.py \
 	ntuple.py \
diff --git a/tools/testing/selftests/drivers/net/hw/nk_devmem.py b/tools/testing/selftests/drivers/net/hw/nk_devmem.py
new file mode 100755
index 000000000000..300ed2a70ab4
--- /dev/null
+++ b/tools/testing/selftests/drivers/net/hw/nk_devmem.py
@@ -0,0 +1,46 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+"""Test devmem TCP with netkit."""
+
+import os
+from devmem_lib import setup_test, run_rx, run_tx, run_tx_chunks, run_rx_hds
+from lib.py import ksft_run, ksft_exit, ksft_disruptive
+from lib.py import NetDrvContEnv
+
+
+@ksft_disruptive
+def check_nk_rx(cfg) -> None:
+    """Run the devmem RX test through netkit."""
+    run_rx(cfg)
+
+
+@ksft_disruptive
+def check_nk_tx(cfg) -> None:
+    """Run the devmem TX test through netkit."""
+    run_tx(cfg)
+
+
+@ksft_disruptive
+def check_nk_tx_chunks(cfg) -> None:
+    """Run the devmem TX chunking test through netkit."""
+    run_tx_chunks(cfg)
+
+
+def check_nk_rx_hds(cfg) -> None:
+    """Run the HDS test through netkit."""
+    run_rx_hds(cfg)
+
+
+def main() -> None:
+    """Run the netkit devmem test cases."""
+    with NetDrvContEnv(__file__, rxqueues=2, primary_rx_redirect=True) as cfg:
+        setup_test(cfg,
+                   os.path.join(os.path.dirname(os.path.abspath(__file__)),
+                                "ncdevmem"))
+        ksft_run([check_nk_rx, check_nk_tx, check_nk_tx_chunks,
+                  check_nk_rx_hds], args=(cfg,))
+    ksft_exit()
+
+
+if __name__ == "__main__":
+    main()

-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCH net-next v5 7/8] selftests: drv-net: add primary_rx_redirect support to NetDrvContEnv
From: Bobby Eshleman @ 2026-05-14 17:22 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Shuah Khan, Alex Shi,
	Yanteng Si, Dongliang Mu, Michael Chan, Pavan Chebbi,
	Joshua Washington, Harshitha Ramamurthy, Saeed Mahameed,
	Tariq Toukan, Mark Bloch, Leon Romanovsky, Alexander Duyck,
	kernel-team, Daniel Borkmann, Nikolay Aleksandrov, Shuah Khan,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Shuah Khan, Alex Shi,
	Yanteng Si, Dongliang Mu, Michael Chan, Pavan Chebbi,
	Joshua Washington, Harshitha Ramamurthy, Saeed Mahameed,
	Tariq Toukan, Mark Bloch, Leon Romanovsky, Alexander Duyck,
	kernel-team, Daniel Borkmann, Nikolay Aleksandrov, Shuah Khan
  Cc: dw, sdf.kernel, mohsin.bashr, willemb, jiang.kun2, xu.xin16,
	wang.yaxin, netdev, linux-doc, linux-kernel, linux-rdma, bpf,
	linux-kselftest, Stanislav Fomichev, Mina Almasry, netdev,
	linux-doc, linux-kernel, linux-rdma, bpf, linux-kselftest,
	Bobby Eshleman
In-Reply-To: <20260514-tcp-dm-netkit-v5-0-408c59b91e66@meta.com>

From: Bobby Eshleman <bobbyeshleman@meta.com>

When sending from a namespace that has access to a netkit device with a
leased queue, the nk primary in the host namespace needs to redirect its
RX to the physical device. This patch adds that redirection bpf program
and teaches the harness to install it.

Add primary_rx_redirect=False parameter to NetDrvContEnv.__init__().
When enabled, _attach_primary_rx_redirect_bpf() attaches a new BPF TC
program (nk_primary_rx_redirect.bpf.c) to the primary (host-side) netkit
interface. The program redirects non-ICMPv6 IPv6 packets to the physical
NIC via bpf_redirect_neigh(), with the physical ifindex configured via
the .bss map. ICMPv6 is left on the host's netkit primary so IPv6
neighbor discovery still work locally.

Extract _find_bss_map_id() from _attach_bpf() into a reusable helper so
other BPF attachment methods can use it.

Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
Changes in v5:
- Use sys.byteorder when packing phys_ifindex into the BPF .bss map
  (Sashiko).

Changes in v3:
- nk_primary_rx_redirect.bpf.c: add header includes to avoid hardcoding
  values
- update commit message explaining why ICMP is passed through
- env.py: re-use _tc_ensure_clsact() (had to add ifname paramater)
- env.py: gate the remote IPv6 host route install on primary_rx_redirect
  by moving it from _setup_ns() into _attach_primary_rx_redirect_bpf()
---
 .../drivers/net/hw/nk_primary_rx_redirect.bpf.c    | 39 +++++++++
 tools/testing/selftests/drivers/net/lib/py/env.py  | 94 +++++++++++++++++-----
 2 files changed, 115 insertions(+), 18 deletions(-)

diff --git a/tools/testing/selftests/drivers/net/hw/nk_primary_rx_redirect.bpf.c b/tools/testing/selftests/drivers/net/hw/nk_primary_rx_redirect.bpf.c
new file mode 100644
index 000000000000..46ff494b23de
--- /dev/null
+++ b/tools/testing/selftests/drivers/net/hw/nk_primary_rx_redirect.bpf.c
@@ -0,0 +1,39 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/bpf.h>
+#include <linux/pkt_cls.h>
+#include <linux/if_ether.h>
+#include <linux/in.h>
+#include <linux/ipv6.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+
+#define ctx_ptr(field)		((void *)(long)(field))
+
+volatile __u32 phys_ifindex;
+
+SEC("tc/ingress")
+int nk_primary_rx_redirect(struct __sk_buff *skb)
+{
+	void *data_end = ctx_ptr(skb->data_end);
+	void *data = ctx_ptr(skb->data);
+	struct ethhdr *eth;
+	struct ipv6hdr *ip6h;
+
+	eth = data;
+	if ((void *)(eth + 1) > data_end)
+		return TC_ACT_OK;
+
+	if (eth->h_proto != bpf_htons(ETH_P_IPV6))
+		return TC_ACT_OK;
+
+	ip6h = data + sizeof(struct ethhdr);
+	if ((void *)(ip6h + 1) > data_end)
+		return TC_ACT_OK;
+
+	if (ip6h->nexthdr == IPPROTO_ICMPV6)
+		return TC_ACT_OK;
+
+	return bpf_redirect_neigh(phys_ifindex, NULL, 0, 0);
+}
+
+char __license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/drivers/net/lib/py/env.py b/tools/testing/selftests/drivers/net/lib/py/env.py
index 409b41922245..ef317aef3a0a 100644
--- a/tools/testing/selftests/drivers/net/lib/py/env.py
+++ b/tools/testing/selftests/drivers/net/lib/py/env.py
@@ -2,6 +2,7 @@
 
 import ipaddress
 import os
+import sys
 import time
 import json
 from pathlib import Path
@@ -336,15 +337,18 @@ class NetDrvContEnv(NetDrvEpEnv):
               +---------------+
     """
 
-    def __init__(self, src_path, rxqueues=1, **kwargs):
+    def __init__(self, src_path, rxqueues=1, primary_rx_redirect=False, **kwargs):
         self.netns = None
         self._nk_host_ifname = None
         self.nk_guest_ifname = None
         self._tc_clsact_added = False
         self._tc_attached = False
+        self._primary_rx_redirect_attached = False
+        self._primary_rx_redirect_clsact_added = False
         self._bpf_prog_pref = None
         self._bpf_prog_id = None
         self._init_ns_attached = False
+        self._remote_route_added = False
         self._old_fwd = None
         self._old_accept_ra = None
 
@@ -396,8 +400,18 @@ class NetDrvContEnv(NetDrvEpEnv):
 
         self._setup_ns()
         self._attach_bpf()
+        if primary_rx_redirect:
+            self._attach_primary_rx_redirect_bpf()
 
     def __del__(self):
+        if self._primary_rx_redirect_attached:
+            cmd(f"tc filter del dev {self._nk_host_ifname} ingress", fail=False)
+            self._primary_rx_redirect_attached = False
+
+        if self._primary_rx_redirect_clsact_added:
+            cmd(f"tc qdisc del dev {self._nk_host_ifname} clsact", fail=False)
+            self._primary_rx_redirect_clsact_added = False
+
         if self._tc_attached:
             cmd(f"tc filter del dev {self.ifname} ingress pref {self._bpf_prog_pref}")
             self._tc_attached = False
@@ -406,6 +420,11 @@ class NetDrvContEnv(NetDrvEpEnv):
             cmd(f"tc qdisc del dev {self.ifname} clsact")
             self._tc_clsact_added = False
 
+        if self._remote_route_added:
+            cmd(f"ip -6 route del {self.nk_guest_ipv6}/128",
+                host=self.remote, fail=False)
+            self._remote_route_added = False
+
         if self._nk_host_ifname:
             cmd(f"ip link del dev {self._nk_host_ifname}")
             self._nk_host_ifname = None
@@ -459,13 +478,19 @@ class NetDrvContEnv(NetDrvEpEnv):
         ip(f"-6 addr add {self.nk_guest_ipv6}/64 dev {self.nk_guest_ifname} nodad", ns=self.netns)
         ip(f"-6 route add default via fe80::1 dev {self.nk_guest_ifname}", ns=self.netns)
 
-    def _tc_ensure_clsact(self):
-        qdisc = json.loads(cmd(f"tc -j qdisc show dev {self.ifname}").stdout)
+    def _tc_ensure_clsact(self, ifname=None):
+        """Ensure a clsact qdisc exists on @ifname.
+
+        Returns True if this call added the qdisc, otherwise returns False.
+        """
+        if ifname is None:
+            ifname = self.ifname
+        qdisc = json.loads(cmd(f"tc -j qdisc show dev {ifname}").stdout)
         for q in qdisc:
             if q['kind'] == 'clsact':
-                return
-        cmd(f"tc qdisc add dev {self.ifname} clsact")
-        self._tc_clsact_added = True
+                return False
+        cmd(f"tc qdisc add dev {ifname} clsact")
+        return True
 
     def _get_bpf_prog_ids(self):
         filters = json.loads(cmd(f"tc -j filter show dev {self.ifname} ingress").stdout)
@@ -476,28 +501,28 @@ class NetDrvContEnv(NetDrvEpEnv):
                 return (bpf['pref'], bpf['options']['prog']['id'])
         raise Exception("Failed to get BPF prog ID")
 
+    def _find_bss_map_id(self, prog_id):
+        """Find the .bss map ID for a loaded BPF program."""
+        prog_info = bpftool(f"prog show id {prog_id}", json=True)
+        for map_id in prog_info.get("map_ids", []):
+            map_info = bpftool(f"map show id {map_id}", json=True)
+            if map_info.get("name", "").endswith("bss"):
+                return map_id
+        raise Exception(f"Failed to find .bss map for prog {prog_id}")
+
     def _attach_bpf(self):
         bpf_obj = self.test_dir / "nk_forward.bpf.o"
         if not bpf_obj.exists():
             raise KsftSkipEx("BPF prog not found")
 
-        self._tc_ensure_clsact()
+        if self._tc_ensure_clsact():
+            self._tc_clsact_added = True
         cmd(f"tc filter add dev {self.ifname} ingress bpf obj {bpf_obj}"
             " sec tc/ingress direct-action")
         self._tc_attached = True
 
         (self._bpf_prog_pref, self._bpf_prog_id) = self._get_bpf_prog_ids()
-        prog_info = bpftool(f"prog show id {self._bpf_prog_id}", json=True)
-        map_ids = prog_info.get("map_ids", [])
-
-        bss_map_id = None
-        for map_id in map_ids:
-            map_info = bpftool(f"map show id {map_id}", json=True)
-            if map_info.get("name").endswith("bss"):
-                bss_map_id = map_id
-
-        if bss_map_id is None:
-            raise Exception("Failed to find .bss map")
+        bss_map_id = self._find_bss_map_id(self._bpf_prog_id)
 
         ipv6_addr = ipaddress.IPv6Address(self.ipv6_prefix)
         ipv6_bytes = ipv6_addr.packed
@@ -505,3 +530,36 @@ class NetDrvContEnv(NetDrvEpEnv):
         value = ipv6_bytes + ifindex_bytes
         value_hex = ' '.join(f'{b:02x}' for b in value)
         bpftool(f"map update id {bss_map_id} key hex 00 00 00 00 value hex {value_hex}")
+
+    def _attach_primary_rx_redirect_bpf(self):
+        """Attach BPF redirect program on the primary netkit ingress."""
+        bpf_obj = self.test_dir / "nk_primary_rx_redirect.bpf.o"
+        if not bpf_obj.exists():
+            raise KsftSkipEx("Primary RX redirect BPF prog not found")
+
+        if self._tc_ensure_clsact(self._nk_host_ifname):
+            self._primary_rx_redirect_clsact_added = True
+        cmd(f"tc filter add dev {self._nk_host_ifname} ingress"
+            f" bpf obj {bpf_obj} sec tc/ingress direct-action")
+        self._primary_rx_redirect_attached = True
+
+        ip(f"-6 route add {self.nk_guest_ipv6}/128 via {self.addr_v['6']}",
+           host=self.remote)
+        self._remote_route_added = True
+
+        filters = json.loads(
+            cmd(f"tc -j filter show dev {self._nk_host_ifname} ingress").stdout)
+        redirect_prog_id = None
+        for bpf in filters:
+            if 'options' not in bpf:
+                continue
+            if bpf['options']['bpf_name'].startswith('nk_primary_rx_redirect'):
+                redirect_prog_id = bpf['options']['prog']['id']
+                break
+        if redirect_prog_id is None:
+            raise Exception("Failed to get primary RX redirect BPF prog ID")
+
+        bss_map_id = self._find_bss_map_id(redirect_prog_id)
+        phys_ifindex_bytes = self.ifindex.to_bytes(4, byteorder=sys.byteorder)
+        value_hex = ' '.join(f'{b:02x}' for b in phys_ifindex_bytes)
+        bpftool(f"map update id {bss_map_id} key hex 00 00 00 00 value hex {value_hex}")

-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCH net-next v5 6/8] selftests: drv-net: refactor devmem command builders into lib module
From: Bobby Eshleman @ 2026-05-14 17:22 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Shuah Khan, Alex Shi,
	Yanteng Si, Dongliang Mu, Michael Chan, Pavan Chebbi,
	Joshua Washington, Harshitha Ramamurthy, Saeed Mahameed,
	Tariq Toukan, Mark Bloch, Leon Romanovsky, Alexander Duyck,
	kernel-team, Daniel Borkmann, Nikolay Aleksandrov, Shuah Khan,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Shuah Khan, Alex Shi,
	Yanteng Si, Dongliang Mu, Michael Chan, Pavan Chebbi,
	Joshua Washington, Harshitha Ramamurthy, Saeed Mahameed,
	Tariq Toukan, Mark Bloch, Leon Romanovsky, Alexander Duyck,
	kernel-team, Daniel Borkmann, Nikolay Aleksandrov, Shuah Khan
  Cc: dw, sdf.kernel, mohsin.bashr, willemb, jiang.kun2, xu.xin16,
	wang.yaxin, netdev, linux-doc, linux-kernel, linux-rdma, bpf,
	linux-kselftest, Stanislav Fomichev, Mina Almasry, netdev,
	linux-doc, linux-kernel, linux-rdma, bpf, linux-kselftest,
	Bobby Eshleman
In-Reply-To: <20260514-tcp-dm-netkit-v5-0-408c59b91e66@meta.com>

From: Bobby Eshleman <bobbyeshleman@meta.com>

Adding netkit-based devmem tests is a straight-forward copy of devmem
test commands plus some args for the nk cases, so this patch breaks out
these command builders into helpers used by both.

Though we tried to avoid libraries to avoid increasing the barrier of
entry/complexity (see selftests/drivers/net/README.md, section "Avoid
libraries and frameworks"), factoring out these functions seemed like
the lesser of two evils in this case of using the same commands, just
with slightly different args per environment.

I experimented with just having all of the tests in the same file to
avoid having helpers in a library file, but because ksft_run() is
limited to a single call per file, and the new tests will require
different environments (NetDrvContEnv/NetDrvEpEnv), it would have been
necessary to have each test set up its own environment instead of
sharing one for the entire ksft_run() run. This came at the cost of
ballooning the test time (from under 5s to 30s on my test system), so to
strike a balance these tests were placed in separate files so they could
keep a shared environment across a single ksft_run() run shared across
all tests using the same env type (introduced in subsequent patches).

The helpers work transparently with both plain and netkit environments
by inspecting cfg for netkit-specific attributes (netns, nk_queue,
etc...).

Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
Changes in v5:
- Place the shared helpers in devmem_lib.py next to the test scripts
  rather than under lib/py/ (Jakub).
- Add devmem_lib.py to TEST_FILES (Jakub).
- configure_nic(): register cleanup via defer() and drop the separate
  cleanup_nic() helper. (Sashiko)
- Move configure_nic() into the run_*() helpers (with an early return
  outside a netkit env) so the queue lease doesn't break
  require_devmem()'s bind-rx check.
- Make the queue lease idempotent across run_*() calls.
- Fix pylint import-error and import-order.

Changes in v4:
- Fixed bad change list version, v4 -> v3 (Stan)

Changes in v3:
- Make socat_send() always bind the source; drop its bind= parameter
  and the matching bind=not_ns at the run_rx call site.
- Drop socat_send()'s nodelay= arg; have buf_size>0 imply TCP_NODELAY
  since they are only meaningful together.
- configure_nic(): stash originals on cfg instead of using defer(); add
  paired cleanup_nic() helper. Drop the per-test configure_nic() calls
  from run_rx/run_tx/run_tx_chunks/run_rx_hds; the netkit test file
  invokes configure_nic/cleanup_nic once around ksft_run().
- make cfg.devmem_supported and cfg.devmem_probed public attrs (no '_')
  for sake of linting
- general cleanup of the code, linting fixes
- In setup_test, drop the unused cfg.listen_ns = getattr(cfg, 'netns',
  None) assignment.
- In run_rx, pass flow_steer=not_ns to ncdevmem_rx and bind=not_ns to
  socat_send to avoid changing functionality (we want just a straight
  refactor here)

Changes in v2:
- Move require_devmem() into individual test functions so KsftSkipEx goes up to
  ksft_run() (Sashiko)
- in ncdevmem_rx(), move -v 7 to take effect for both netns and
  non-netns when verify=True
---
 tools/testing/selftests/drivers/net/hw/Makefile    |   1 +
 tools/testing/selftests/drivers/net/hw/devmem.py   |  77 ++-----
 .../testing/selftests/drivers/net/hw/devmem_lib.py | 222 +++++++++++++++++++++
 3 files changed, 236 insertions(+), 64 deletions(-)

diff --git a/tools/testing/selftests/drivers/net/hw/Makefile b/tools/testing/selftests/drivers/net/hw/Makefile
index 82809d5b2478..5e49d7bffced 100644
--- a/tools/testing/selftests/drivers/net/hw/Makefile
+++ b/tools/testing/selftests/drivers/net/hw/Makefile
@@ -52,6 +52,7 @@ TEST_PROGS = \
 	#
 
 TEST_FILES := \
+	devmem_lib.py \
 	ethtool_lib.sh \
 	#
 
diff --git a/tools/testing/selftests/drivers/net/hw/devmem.py b/tools/testing/selftests/drivers/net/hw/devmem.py
index ee863e90d1e0..031cf9905f65 100755
--- a/tools/testing/selftests/drivers/net/hw/devmem.py
+++ b/tools/testing/selftests/drivers/net/hw/devmem.py
@@ -2,91 +2,40 @@
 # SPDX-License-Identifier: GPL-2.0
 
 from os import path
-from lib.py import ksft_run, ksft_exit
-from lib.py import ksft_eq, KsftSkipEx
+from devmem_lib import setup_test, run_rx, run_tx, run_tx_chunks, run_rx_hds
+from lib.py import ksft_run, ksft_exit, ksft_disruptive
 from lib.py import NetDrvEpEnv
-from lib.py import bkg, cmd, rand_port, wait_port_listen
-from lib.py import ksft_disruptive
-
-
-def require_devmem(cfg):
-    if not hasattr(cfg, "_devmem_probed"):
-        probe_command = f"{cfg.bin_local} -f {cfg.ifname}"
-        cfg._devmem_supported = cmd(probe_command, fail=False, shell=True).ret == 0
-        cfg._devmem_probed = True
-
-    if not cfg._devmem_supported:
-        raise KsftSkipEx("Test requires devmem support")
 
 
 @ksft_disruptive
 def check_rx(cfg) -> None:
-    require_devmem(cfg)
-
-    port = rand_port()
-    socat = f"socat -u - TCP{cfg.addr_ipver}:{cfg.baddr}:{port},bind={cfg.remote_baddr}:{port}"
-    listen_cmd = f"{cfg.bin_local} -l -f {cfg.ifname} -s {cfg.addr} -p {port} -c {cfg.remote_addr} -v 7"
-
-    with bkg(listen_cmd, exit_wait=True) as ncdevmem:
-        wait_port_listen(port)
-        cmd(f"yes $(echo -e \x01\x02\x03\x04\x05\x06) | \
-            head -c 1K | {socat}", host=cfg.remote, shell=True)
-
-    ksft_eq(ncdevmem.ret, 0)
+    """Run the devmem RX test."""
+    run_rx(cfg)
 
 
 @ksft_disruptive
 def check_tx(cfg) -> None:
-    require_devmem(cfg)
-
-    port = rand_port()
-    listen_cmd = f"socat -U - TCP{cfg.addr_ipver}-LISTEN:{port}"
-
-    with bkg(listen_cmd, host=cfg.remote, exit_wait=True) as socat:
-        wait_port_listen(port, host=cfg.remote)
-        cmd(f"echo -e \"hello\\nworld\"| {cfg.bin_local} -f {cfg.ifname} -s {cfg.remote_addr} -p {port}", shell=True)
-
-    ksft_eq(socat.stdout.strip(), "hello\nworld")
+    """Run the devmem TX test."""
+    run_tx(cfg)
 
 
 @ksft_disruptive
 def check_tx_chunks(cfg) -> None:
-    require_devmem(cfg)
-
-    port = rand_port()
-    listen_cmd = f"socat -U - TCP{cfg.addr_ipver}-LISTEN:{port}"
-
-    with bkg(listen_cmd, host=cfg.remote, exit_wait=True) as socat:
-        wait_port_listen(port, host=cfg.remote)
-        cmd(f"echo -e \"hello\\nworld\"| {cfg.bin_local} -f {cfg.ifname} -s {cfg.remote_addr} -p {port} -z 3", shell=True)
-
-    ksft_eq(socat.stdout.strip(), "hello\nworld")
+    """Run the devmem TX chunking test."""
+    run_tx_chunks(cfg)
 
 
 def check_rx_hds(cfg) -> None:
-    """Test HDS splitting across payload sizes."""
-    require_devmem(cfg)
-
-    for size in [1, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192]:
-        port = rand_port()
-        listen_cmd = f"{cfg.bin_local} -L -l -f {cfg.ifname} -s {cfg.addr} -p {port}"
-
-        with bkg(listen_cmd, exit_wait=True) as ncdevmem:
-            wait_port_listen(port)
-            cmd(f"dd if=/dev/zero bs={size} count=1 2>/dev/null | " +
-                f"socat -b {size} -u - TCP{cfg.addr_ipver}:{cfg.baddr}:{port},nodelay",
-                host=cfg.remote, shell=True)
-
-        ksft_eq(ncdevmem.ret, 0, f"HDS failed for payload size {size}")
+    """Run the HDS test."""
+    run_rx_hds(cfg)
 
 
 def main() -> None:
+    """Run the devmem test cases."""
     with NetDrvEpEnv(__file__) as cfg:
-        cfg.bin_local = path.abspath(path.dirname(__file__) + "/ncdevmem")
-        cfg.bin_remote = cfg.remote.deploy(cfg.bin_local)
-
+        setup_test(cfg, path.abspath(path.dirname(__file__) + "/ncdevmem"))
         ksft_run([check_rx, check_tx, check_tx_chunks, check_rx_hds],
-                 args=(cfg, ))
+                 args=(cfg,))
     ksft_exit()
 
 
diff --git a/tools/testing/selftests/drivers/net/hw/devmem_lib.py b/tools/testing/selftests/drivers/net/hw/devmem_lib.py
new file mode 100644
index 000000000000..0921ff03eb81
--- /dev/null
+++ b/tools/testing/selftests/drivers/net/hw/devmem_lib.py
@@ -0,0 +1,222 @@
+# SPDX-License-Identifier: GPL-2.0
+"""Shared helpers for devmem TCP selftests."""
+
+import re
+
+from lib.py import (bkg, cmd, defer, ethtool, rand_port, wait_port_listen,
+                    ksft_eq, KsftSkipEx, NetNSEnter, EthtoolFamily,
+                    NetdevFamily)
+
+
+def require_devmem(cfg):
+    """Probe ncdevmem on cfg.ifname and SKIP the test if devmem isn't supported."""
+    if not hasattr(cfg, "devmem_probed"):
+        probe_command = f"{cfg.bin_local} -f {cfg.ifname}"
+        cfg.devmem_supported = cmd(probe_command, fail=False, shell=True).ret == 0
+        cfg.devmem_probed = True
+
+    if not cfg.devmem_supported:
+        raise KsftSkipEx("Test requires devmem support")
+
+
+def configure_nic(cfg):
+    """Channels, rings, RSS, queue lease for netkit devmem."""
+    if not hasattr(cfg, 'netns'):
+        return
+
+    cfg.require_ipver('6')
+    ethnl = EthtoolFamily()
+
+    channels = ethnl.channels_get({'header': {'dev-index': cfg.ifindex}})
+    channels = channels['combined-count']
+    if channels < 2:
+        raise KsftSkipEx(
+            'Test requires NETIF with at least 2 combined channels'
+        )
+
+    rings = ethnl.rings_get({'header': {'dev-index': cfg.ifindex}})
+    orig_rx_rings = rings['rx']
+    orig_hds_thresh = rings.get('hds-thresh', 0)
+    orig_data_split = rings.get('tcp-data-split', 'unknown')
+
+    ethnl.rings_set({'header': {'dev-index': cfg.ifindex},
+                     'tcp-data-split': 'enabled',
+                     'hds-thresh': 0,
+                     'rx': min(64, orig_rx_rings)})
+    defer(ethnl.rings_set, {'header': {'dev-index': cfg.ifindex},
+                            'tcp-data-split': orig_data_split,
+                            'hds-thresh': orig_hds_thresh,
+                            'rx': orig_rx_rings})
+
+    cfg.src_queue = channels - 1
+    ethtool(f"-X {cfg.ifname} equal {cfg.src_queue}")
+    defer(ethtool, f"-X {cfg.ifname} default")
+
+    if not hasattr(cfg, 'nk_queue'):
+        with NetNSEnter(str(cfg.netns)):
+            netdevnl = NetdevFamily()
+            lease_result = netdevnl.queue_create({
+                "ifindex": cfg.nk_guest_ifindex,
+                "type": "rx",
+                "lease": {
+                    "ifindex": cfg.ifindex,
+                    "queue": {"id": cfg.src_queue, "type": "rx"},
+                    "netns-id": 0,
+                },
+            })
+            cfg.nk_queue = lease_result['id']
+
+
+def set_flow_rule(cfg, port):
+    """Install a flow rule steering to src_queue and return the flow rule ID."""
+    output = ethtool(
+        f"-N {cfg.ifname} flow-type tcp6 dst-port {port}"
+        f" action {cfg.src_queue}"
+    ).stdout
+    return int(re.search(r'ID (\d+)', output).group(1))
+
+
+def ncdevmem_rx(cfg, port, verify=True, fail_on_linear=False, flow_steer=False):
+    """Build the ncdevmem RX listener command."""
+    if hasattr(cfg, 'netns'):
+        flow_rule_id = set_flow_rule(cfg, port)
+        defer(ethtool, f"-N {cfg.ifname} delete {flow_rule_id}")
+
+        ifname = cfg.nk_guest_ifname
+        addr = cfg.nk_guest_ipv6
+        extras = [f"-t {cfg.nk_queue}", "-q 1", "-n"]
+    else:
+        ifname = cfg.ifname
+        addr = cfg.addr
+        extras = []
+        if flow_steer:
+            extras.append(f"-c {cfg.remote_addr}")
+
+    if verify:
+        extras.append("-v 7")
+    if fail_on_linear:
+        extras.append("-L")
+
+    parts = [cfg.bin_local, "-l", f"-f {ifname}", f"-s {addr}",
+             f"-p {port}", *extras]
+    return " ".join(parts)
+
+
+def ncdevmem_tx(cfg, port, chunk_size=0):
+    """Build the ncdevmem TX send command."""
+    if hasattr(cfg, 'netns'):
+        ifname = cfg.nk_guest_ifname
+        addr = cfg.remote_addr_v['6']
+        extras = ["-t 0", "-q 1", "-n"]
+    else:
+        ifname = cfg.ifname
+        addr = cfg.remote_addr
+        extras = []
+
+    if chunk_size:
+        extras.append(f"-z {chunk_size}")
+
+    parts = [cfg.bin_local, f"-f {ifname}", f"-s {addr}",
+             f"-p {port}", *extras]
+    return " ".join(parts)
+
+
+def socat_send(cfg, port, buf_size=0):
+    """Socat command for sending to the devmem listener.
+
+    When buf_size > 0, force one TCP segment per write of exactly that size by
+    setting socat's buffer (-b) and disabling Nagle (TCP_NODELAY).
+    """
+    proto = f"TCP{cfg.addr_ipver}"
+
+    if hasattr(cfg, 'netns'):
+        addr = f"[{cfg.nk_guest_ipv6}]"
+    else:
+        addr = cfg.baddr
+
+    suffix = f",bind={cfg.remote_baddr}:{port}"
+
+    buf = ""
+    if buf_size:
+        buf = f"-b {buf_size}"
+        suffix += ",nodelay"
+
+    return f"socat {buf} -u - {proto}:{addr}:{port}{suffix}"
+
+
+def socat_listen(cfg, port):
+    """Socat listen command for TX tests."""
+    return f"socat -U - TCP{cfg.addr_ipver}-LISTEN:{port}"
+
+
+def setup_test(cfg, bin_local):
+    """Stash the local ncdevmem path on cfg and deploy it to the remote."""
+    cfg.bin_local = bin_local
+    cfg.bin_remote = cfg.remote.deploy(cfg.bin_local)
+
+
+def run_rx(cfg):
+    """Run the devmem RX test."""
+    require_devmem(cfg)
+    configure_nic(cfg)
+    port = rand_port()
+    socat = socat_send(cfg, port)
+    data_pipe = (f"yes $(echo -e \x01\x02\x03\x04\x05\x06) | head -c 1K"
+                 f" | {socat}")
+    netns = getattr(cfg, "netns", None)
+
+    listen_cmd = ncdevmem_rx(cfg, port, flow_steer=not hasattr(cfg, 'netns'))
+    with bkg(listen_cmd, exit_wait=True, ns=netns) as ncdevmem:
+        wait_port_listen(port, proto="tcp", ns=netns)
+        cmd(data_pipe, host=cfg.remote, shell=True)
+    ksft_eq(ncdevmem.ret, 0)
+
+
+def run_tx(cfg):
+    """Run the devmem TX test."""
+    require_devmem(cfg)
+    configure_nic(cfg)
+    netns = getattr(cfg, "netns", None)
+    port = rand_port()
+    tx_cmd = ncdevmem_tx(cfg, port)
+    listen_cmd = socat_listen(cfg, port)
+
+    with bkg(listen_cmd, host=cfg.remote, exit_wait=True) as socat:
+        wait_port_listen(port, host=cfg.remote)
+        cmd(f"bash -c 'echo -e \"hello\\nworld\" | {tx_cmd}'", ns=netns, shell=True)
+    ksft_eq(socat.stdout.strip(), "hello\nworld")
+
+
+def run_tx_chunks(cfg):
+    """Run the devmem TX chunking test."""
+    require_devmem(cfg)
+    configure_nic(cfg)
+    netns = getattr(cfg, "netns", None)
+    port = rand_port()
+    tx_cmd = ncdevmem_tx(cfg, port, chunk_size=3)
+    listen_cmd = socat_listen(cfg, port)
+
+    with bkg(listen_cmd, host=cfg.remote, exit_wait=True) as socat:
+        wait_port_listen(port, host=cfg.remote)
+        cmd(f"bash -c 'echo -e \"hello\\nworld\" | {tx_cmd}'", ns=netns, shell=True)
+    ksft_eq(socat.stdout.strip(), "hello\nworld")
+
+
+def run_rx_hds(cfg):
+    """Run the HDS test by running devmem RX across a segment size sweep."""
+    require_devmem(cfg)
+    configure_nic(cfg)
+    netns = getattr(cfg, "netns", None)
+
+    for size in [1, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192]:
+        port = rand_port()
+
+        listen_cmd = ncdevmem_rx(cfg, port, verify=False,
+                                 fail_on_linear=True)
+        socat = socat_send(cfg, port, buf_size=size)
+
+        with bkg(listen_cmd, exit_wait=True, ns=netns) as ncdevmem:
+            wait_port_listen(port, proto="tcp", ns=netns)
+            cmd(f"dd if=/dev/zero bs={size} count=1 2>/dev/null | "
+                f"{socat}", host=cfg.remote, shell=True)
+        ksft_eq(ncdevmem.ret, 0, f"HDS failed for payload size {size}")

-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCH net-next v5 5/8] selftests: drv-net: make attr _nk_guest_ifname public
From: Bobby Eshleman @ 2026-05-14 17:22 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Shuah Khan, Alex Shi,
	Yanteng Si, Dongliang Mu, Michael Chan, Pavan Chebbi,
	Joshua Washington, Harshitha Ramamurthy, Saeed Mahameed,
	Tariq Toukan, Mark Bloch, Leon Romanovsky, Alexander Duyck,
	kernel-team, Daniel Borkmann, Nikolay Aleksandrov, Shuah Khan,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Shuah Khan, Alex Shi,
	Yanteng Si, Dongliang Mu, Michael Chan, Pavan Chebbi,
	Joshua Washington, Harshitha Ramamurthy, Saeed Mahameed,
	Tariq Toukan, Mark Bloch, Leon Romanovsky, Alexander Duyck,
	kernel-team, Daniel Borkmann, Nikolay Aleksandrov, Shuah Khan
  Cc: dw, sdf.kernel, mohsin.bashr, willemb, jiang.kun2, xu.xin16,
	wang.yaxin, netdev, linux-doc, linux-kernel, linux-rdma, bpf,
	linux-kselftest, Stanislav Fomichev, Mina Almasry, netdev,
	linux-doc, linux-kernel, linux-rdma, bpf, linux-kselftest,
	Bobby Eshleman
In-Reply-To: <20260514-tcp-dm-netkit-v5-0-408c59b91e66@meta.com>

From: Bobby Eshleman <bobbyeshleman@meta.com>

Subsequent patches will use the _nk_guest_ifname as a public attr for
setting up devmem. Rename to nk_guest_ifname to avoid angering the
linter about the '_' prefix being used for a non-private attr.

Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
 tools/testing/selftests/drivers/net/hw/nk_qlease.py |  8 ++++----
 tools/testing/selftests/drivers/net/lib/py/env.py   | 16 ++++++++--------
 2 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/tools/testing/selftests/drivers/net/hw/nk_qlease.py b/tools/testing/selftests/drivers/net/hw/nk_qlease.py
index aa83dc321328..139a91ebd229 100755
--- a/tools/testing/selftests/drivers/net/hw/nk_qlease.py
+++ b/tools/testing/selftests/drivers/net/hw/nk_qlease.py
@@ -71,7 +71,7 @@ def test_iou_zcrx(cfg) -> None:
     flow_rule_id = set_flow_rule(cfg)
     defer(ethtool, f"-N {cfg.ifname} delete {flow_rule_id}")
 
-    rx_cmd = f"ip netns exec {cfg.netns.name} {cfg.bin_local} -s -p {cfg.port} -i {cfg._nk_guest_ifname} -q {cfg.nk_queue}"
+    rx_cmd = f"ip netns exec {cfg.netns.name} {cfg.bin_local} -s -p {cfg.port} -i {cfg.nk_guest_ifname} -q {cfg.nk_queue}"
     tx_cmd = f"{cfg.bin_remote} -c -h {cfg.nk_guest_ipv6} -p {cfg.port} -l 12840"
     with bkg(rx_cmd, exit_wait=True):
         wait_port_listen(cfg.port, proto="tcp", ns=cfg.netns)
@@ -128,7 +128,7 @@ def test_attach_xdp_with_mp(cfg) -> None:
 
     netdevnl = NetdevFamily()
 
-    rx_cmd = f"ip netns exec {cfg.netns.name} {cfg.bin_local} -s -p {cfg.port} -i {cfg._nk_guest_ifname} -q {cfg.nk_queue}"
+    rx_cmd = f"ip netns exec {cfg.netns.name} {cfg.bin_local} -s -p {cfg.port} -i {cfg.nk_guest_ifname} -q {cfg.nk_queue}"
     with bkg(rx_cmd):
         wait_port_listen(cfg.port, proto="tcp", ns=cfg.netns)
 
@@ -178,7 +178,7 @@ def test_destroy(cfg) -> None:
     ethtool(f"-X {cfg.ifname} equal {cfg.src_queue}")
     defer(ethtool, f"-X {cfg.ifname} default")
 
-    rx_cmd = f"ip netns exec {cfg.netns.name} {cfg.bin_local} -s -p {cfg.port} -i {cfg._nk_guest_ifname} -q {cfg.nk_queue}"
+    rx_cmd = f"ip netns exec {cfg.netns.name} {cfg.bin_local} -s -p {cfg.port} -i {cfg.nk_guest_ifname} -q {cfg.nk_queue}"
     rx_proc = cmd(rx_cmd, background=True)
     wait_port_listen(cfg.port, proto="tcp", ns=cfg.netns)
 
@@ -196,7 +196,7 @@ def test_destroy(cfg) -> None:
     ip(f"link del dev {cfg._nk_host_ifname}")
     kill_timer.join()
     cfg._nk_host_ifname = None
-    cfg._nk_guest_ifname = None
+    cfg.nk_guest_ifname = None
 
     queue_info = netdevnl.queue_get(
         {"ifindex": cfg.ifindex, "id": cfg.src_queue, "type": "rx"}
diff --git a/tools/testing/selftests/drivers/net/lib/py/env.py b/tools/testing/selftests/drivers/net/lib/py/env.py
index 24ce122abd9c..409b41922245 100644
--- a/tools/testing/selftests/drivers/net/lib/py/env.py
+++ b/tools/testing/selftests/drivers/net/lib/py/env.py
@@ -339,7 +339,7 @@ class NetDrvContEnv(NetDrvEpEnv):
     def __init__(self, src_path, rxqueues=1, **kwargs):
         self.netns = None
         self._nk_host_ifname = None
-        self._nk_guest_ifname = None
+        self.nk_guest_ifname = None
         self._tc_clsact_added = False
         self._tc_attached = False
         self._bpf_prog_pref = None
@@ -390,7 +390,7 @@ class NetDrvContEnv(NetDrvEpEnv):
 
         netkit_links.sort(key=lambda x: x['ifindex'])
         self._nk_host_ifname = netkit_links[1]['ifname']
-        self._nk_guest_ifname = netkit_links[0]['ifname']
+        self.nk_guest_ifname = netkit_links[0]['ifname']
         self.nk_host_ifindex = netkit_links[1]['ifindex']
         self.nk_guest_ifindex = netkit_links[0]['ifindex']
 
@@ -409,7 +409,7 @@ class NetDrvContEnv(NetDrvEpEnv):
         if self._nk_host_ifname:
             cmd(f"ip link del dev {self._nk_host_ifname}")
             self._nk_host_ifname = None
-            self._nk_guest_ifname = None
+            self.nk_guest_ifname = None
 
         if self._init_ns_attached:
             cmd("ip netns del init", fail=False)
@@ -448,16 +448,16 @@ class NetDrvContEnv(NetDrvEpEnv):
         cmd("ip netns attach init 1")
         self._init_ns_attached = True
         ip("netns set init 0", ns=self.netns)
-        ip(f"link set dev {self._nk_guest_ifname} netns {self.netns.name}")
+        ip(f"link set dev {self.nk_guest_ifname} netns {self.netns.name}")
         ip(f"link set dev {self._nk_host_ifname} up")
         ip(f"-6 addr add fe80::1/64 dev {self._nk_host_ifname} nodad")
         ip(f"-6 route add {self.nk_guest_ipv6}/128 via fe80::2 dev {self._nk_host_ifname}")
 
         ip("link set lo up", ns=self.netns)
-        ip(f"link set dev {self._nk_guest_ifname} up", ns=self.netns)
-        ip(f"-6 addr add fe80::2/64 dev {self._nk_guest_ifname}", ns=self.netns)
-        ip(f"-6 addr add {self.nk_guest_ipv6}/64 dev {self._nk_guest_ifname} nodad", ns=self.netns)
-        ip(f"-6 route add default via fe80::1 dev {self._nk_guest_ifname}", ns=self.netns)
+        ip(f"link set dev {self.nk_guest_ifname} up", ns=self.netns)
+        ip(f"-6 addr add fe80::2/64 dev {self.nk_guest_ifname}", ns=self.netns)
+        ip(f"-6 addr add {self.nk_guest_ipv6}/64 dev {self.nk_guest_ifname} nodad", ns=self.netns)
+        ip(f"-6 route add default via fe80::1 dev {self.nk_guest_ifname}", ns=self.netns)
 
     def _tc_ensure_clsact(self):
         qdisc = json.loads(cmd(f"tc -j qdisc show dev {self.ifname}").stdout)

-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCH net-next v5 4/8] selftests: drv-net: ncdevmem: add -n flag to skip NIC configuration
From: Bobby Eshleman @ 2026-05-14 17:22 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Shuah Khan, Alex Shi,
	Yanteng Si, Dongliang Mu, Michael Chan, Pavan Chebbi,
	Joshua Washington, Harshitha Ramamurthy, Saeed Mahameed,
	Tariq Toukan, Mark Bloch, Leon Romanovsky, Alexander Duyck,
	kernel-team, Daniel Borkmann, Nikolay Aleksandrov, Shuah Khan,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Shuah Khan, Alex Shi,
	Yanteng Si, Dongliang Mu, Michael Chan, Pavan Chebbi,
	Joshua Washington, Harshitha Ramamurthy, Saeed Mahameed,
	Tariq Toukan, Mark Bloch, Leon Romanovsky, Alexander Duyck,
	kernel-team, Daniel Borkmann, Nikolay Aleksandrov, Shuah Khan
  Cc: dw, sdf.kernel, mohsin.bashr, willemb, jiang.kun2, xu.xin16,
	wang.yaxin, netdev, linux-doc, linux-kernel, linux-rdma, bpf,
	linux-kselftest, Stanislav Fomichev, Mina Almasry, netdev,
	linux-doc, linux-kernel, linux-rdma, bpf, linux-kselftest,
	Bobby Eshleman
In-Reply-To: <20260514-tcp-dm-netkit-v5-0-408c59b91e66@meta.com>

From: Bobby Eshleman <bobbyeshleman@meta.com>

Add a -n (skip_config) flag that causes ncdevmem to skip NIC
configuration when operating as an RX server. When -n is passed,
ncdevmem skips configuring header split, RSS, and flow steering, as well
as their teardown on exit.

This allows ksft tests to pre-configure the NIC in the host namespace
before launching ncdevmem in the guest namespace. This is needed for
netkit devmem tests where the test harness namespace has direct access
to the NIC and the ncdevmem namespace does not.

Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
 tools/testing/selftests/drivers/net/hw/ncdevmem.c | 58 +++++++++++++----------
 1 file changed, 34 insertions(+), 24 deletions(-)

diff --git a/tools/testing/selftests/drivers/net/hw/ncdevmem.c b/tools/testing/selftests/drivers/net/hw/ncdevmem.c
index e098d6534c3c..d96e8a3b5a65 100644
--- a/tools/testing/selftests/drivers/net/hw/ncdevmem.c
+++ b/tools/testing/selftests/drivers/net/hw/ncdevmem.c
@@ -93,6 +93,7 @@ static char *port;
 static size_t do_validation;
 static int start_queue = -1;
 static int num_queues = -1;
+static int skip_config;
 static char *ifname;
 static unsigned int ifindex;
 static unsigned int dmabuf_id;
@@ -828,7 +829,7 @@ static struct netdev_queue_id *create_queues(void)
 
 static int do_server(struct memory_buffer *mem)
 {
-	struct ethtool_rings_get_rsp *ring_config;
+	struct ethtool_rings_get_rsp *ring_config = NULL;
 	char ctrl_data[sizeof(int) * 20000];
 	size_t non_page_aligned_frags = 0;
 	struct sockaddr_in6 client_addr;
@@ -851,27 +852,29 @@ static int do_server(struct memory_buffer *mem)
 		return -1;
 	}
 
-	ring_config = get_ring_config();
-	if (!ring_config) {
-		pr_err("Failed to get current ring configuration");
-		return -1;
-	}
+	if (!skip_config) {
+		ring_config = get_ring_config();
+		if (!ring_config) {
+			pr_err("Failed to get current ring configuration");
+			return -1;
+		}
 
-	if (configure_headersplit(ring_config, 1)) {
-		pr_err("Failed to enable TCP header split");
-		goto err_free_ring_config;
-	}
+		if (configure_headersplit(ring_config, 1)) {
+			pr_err("Failed to enable TCP header split");
+			goto err_free_ring_config;
+		}
 
-	/* Configure RSS to divert all traffic from our devmem queues */
-	if (configure_rss()) {
-		pr_err("Failed to configure rss");
-		goto err_reset_headersplit;
-	}
+		/* Configure RSS to divert all traffic from our devmem queues */
+		if (configure_rss()) {
+			pr_err("Failed to configure rss");
+			goto err_reset_headersplit;
+		}
 
-	/* Flow steer our devmem flows to start_queue */
-	if (configure_flow_steering(&server_sin)) {
-		pr_err("Failed to configure flow steering");
-		goto err_reset_rss;
+		/* Flow steer our devmem flows to start_queue */
+		if (configure_flow_steering(&server_sin)) {
+			pr_err("Failed to configure flow steering");
+			goto err_reset_rss;
+		}
 	}
 
 	if (bind_rx_queue(ifindex, mem->fd, create_queues(), num_queues, &ys)) {
@@ -1052,13 +1055,17 @@ static int do_server(struct memory_buffer *mem)
 err_unbind:
 	ynl_sock_destroy(ys);
 err_reset_flow_steering:
-	reset_flow_steering();
+	if (!skip_config)
+		reset_flow_steering();
 err_reset_rss:
-	reset_rss();
+	if (!skip_config)
+		reset_rss();
 err_reset_headersplit:
-	restore_ring_config(ring_config);
+	if (!skip_config)
+		restore_ring_config(ring_config);
 err_free_ring_config:
-	ethtool_rings_get_rsp_free(ring_config);
+	if (!skip_config)
+		ethtool_rings_get_rsp_free(ring_config);
 	return err;
 }
 
@@ -1404,7 +1411,7 @@ int main(int argc, char *argv[])
 	int is_server = 0, opt;
 	int ret, err = 1;
 
-	while ((opt = getopt(argc, argv, "Lls:c:p:v:q:t:f:z:")) != -1) {
+	while ((opt = getopt(argc, argv, "Lls:c:p:v:q:t:f:z:n")) != -1) {
 		switch (opt) {
 		case 'L':
 			fail_on_linear = true;
@@ -1436,6 +1443,9 @@ int main(int argc, char *argv[])
 		case 'z':
 			max_chunk = atoi(optarg);
 			break;
+		case 'n':
+			skip_config = 1;
+			break;
 		case '?':
 			fprintf(stderr, "unknown option: %c\n", optopt);
 			break;

-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCH net-next v5 3/8] net: devmem: support TX over NETMEM_TX_NO_DMA devices
From: Bobby Eshleman @ 2026-05-14 17:22 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Shuah Khan, Alex Shi,
	Yanteng Si, Dongliang Mu, Michael Chan, Pavan Chebbi,
	Joshua Washington, Harshitha Ramamurthy, Saeed Mahameed,
	Tariq Toukan, Mark Bloch, Leon Romanovsky, Alexander Duyck,
	kernel-team, Daniel Borkmann, Nikolay Aleksandrov, Shuah Khan,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Shuah Khan, Alex Shi,
	Yanteng Si, Dongliang Mu, Michael Chan, Pavan Chebbi,
	Joshua Washington, Harshitha Ramamurthy, Saeed Mahameed,
	Tariq Toukan, Mark Bloch, Leon Romanovsky, Alexander Duyck,
	kernel-team, Daniel Borkmann, Nikolay Aleksandrov, Shuah Khan
  Cc: dw, sdf.kernel, mohsin.bashr, willemb, jiang.kun2, xu.xin16,
	wang.yaxin, netdev, linux-doc, linux-kernel, linux-rdma, bpf,
	linux-kselftest, Stanislav Fomichev, Mina Almasry, netdev,
	linux-doc, linux-kernel, linux-rdma, bpf, linux-kselftest,
	Bobby Eshleman
In-Reply-To: <20260514-tcp-dm-netkit-v5-0-408c59b91e66@meta.com>

From: Bobby Eshleman <bobbyeshleman@meta.com>

When a netkit virtual device leases queues from a physical NIC, devmem
TX bindings created on the netkit device must still result in the dmabuf
being mapped for dma by the physical device. This patch accomplishes
this by teaching the bind handler to search for the underlying
DMA-capable device by looking it up via leased rx queues. The function
netdev_find_netmem_tx_dev(), used for finding the underlying DMA-capable
device, can be extended to support other non-netkit NETMEM_TX_NO_DMA
devices in the future if needed.

Additionally, this patch extends validate_xmit_unreadable_skb() to
support the netkit case, where the skb is validated twice: once on the
netkit guest device and again on the physical NIC after BPF redirect or
ip forwarding.

Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
Changes in v4:
- Fold the `NETMEM_TX_NO_DMA` check in `validate_xmit_unreadable_skb()`
  (Stan, Jakub)
- Convert `binding->vdev` to void* opaque cookie with comment (Jakub)

Changes in v3:
- Fix validate_xmit_unreadable_skb() bug for non-devmem
  unreadable niovs (should not be dropped)
- Major simplification of validate_xmit_unreadable_skb()
- Fix prematurely released lock in bind-tx handler (Jakub)

Changes in v2:
- In validate_xmit_unreadable_skb() to check netmem_tx mode before
  inspecting frags (Jakub)
- Lock bind_dev around netdev_queue_get_dma_dev() when bind_dev !=
  netdev to fix lockdep (Sashiko)
---
 net/core/dev.c         |  3 ++-
 net/core/devmem.c      |  6 +++--
 net/core/devmem.h      | 10 ++++++--
 net/core/netdev-genl.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++----
 4 files changed, 72 insertions(+), 10 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 2da2688fe490..bbc93b181ef9 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3993,7 +3993,8 @@ static struct sk_buff *validate_xmit_unreadable_skb(struct sk_buff *skb,
 	struct skb_shared_info *shinfo;
 	struct net_iov *niov;
 
-	if (likely(skb_frags_readable(skb)))
+	if (likely(skb_frags_readable(skb) ||
+		   dev->netmem_tx == NETMEM_TX_NO_DMA))
 		goto out;
 
 	if (dev->netmem_tx == NETMEM_TX_NONE)
diff --git a/net/core/devmem.c b/net/core/devmem.c
index 468344739db2..893643909f6a 100644
--- a/net/core/devmem.c
+++ b/net/core/devmem.c
@@ -181,7 +181,7 @@ int net_devmem_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
 }
 
 struct net_devmem_dmabuf_binding *
-net_devmem_bind_dmabuf(struct net_device *dev,
+net_devmem_bind_dmabuf(struct net_device *dev, void *vdev,
 		       struct device *dma_dev,
 		       enum dma_data_direction direction,
 		       unsigned int dmabuf_fd, struct netdev_nl_sock *priv,
@@ -212,6 +212,7 @@ net_devmem_bind_dmabuf(struct net_device *dev,
 	}
 
 	binding->dev = dev;
+	binding->vdev = vdev;
 	xa_init_flags(&binding->bound_rxqs, XA_FLAGS_ALLOC);
 
 	err = percpu_ref_init(&binding->ref,
@@ -396,7 +397,8 @@ struct net_devmem_dmabuf_binding *net_devmem_get_binding(struct sock *sk,
 	 */
 	dst_dev = dst_dev_rcu(dst);
 	if (unlikely(!dst_dev) ||
-	    unlikely(dst_dev != READ_ONCE(binding->dev))) {
+	    unlikely(dst_dev != READ_ONCE(binding->dev) &&
+		     dst_dev != READ_ONCE(binding->vdev))) {
 		err = -ENODEV;
 		goto out_unlock;
 	}
diff --git a/net/core/devmem.h b/net/core/devmem.h
index 1c5c18581fcb..3852a56036cb 100644
--- a/net/core/devmem.h
+++ b/net/core/devmem.h
@@ -19,7 +19,13 @@ struct net_devmem_dmabuf_binding {
 	struct dma_buf *dmabuf;
 	struct dma_buf_attachment *attachment;
 	struct sg_table *sgt;
+	/* Physical NIC that does the actual DMA for this binding. */
 	struct net_device *dev;
+	/* Opaque cookie identifying the virtual device (e.g. netkit) the user
+	 * called bind-tx on. Used only for pointer comparison. Never
+	 * dereferenced.
+	 */
+	void *vdev;
 	struct gen_pool *chunk_pool;
 	/* Protect dev */
 	struct mutex lock;
@@ -84,7 +90,7 @@ struct dmabuf_genpool_chunk_owner {
 
 void __net_devmem_dmabuf_binding_free(struct work_struct *wq);
 struct net_devmem_dmabuf_binding *
-net_devmem_bind_dmabuf(struct net_device *dev,
+net_devmem_bind_dmabuf(struct net_device *dev, void *vdev,
 		       struct device *dma_dev,
 		       enum dma_data_direction direction,
 		       unsigned int dmabuf_fd, struct netdev_nl_sock *priv,
@@ -165,7 +171,7 @@ static inline void net_devmem_put_net_iov(struct net_iov *niov)
 }
 
 static inline struct net_devmem_dmabuf_binding *
-net_devmem_bind_dmabuf(struct net_device *dev,
+net_devmem_bind_dmabuf(struct net_device *dev, void *vdev,
 		       struct device *dma_dev,
 		       enum dma_data_direction direction,
 		       unsigned int dmabuf_fd,
diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
index 4d2c49371cdb..b4d48f3672a5 100644
--- a/net/core/netdev-genl.c
+++ b/net/core/netdev-genl.c
@@ -1077,7 +1077,7 @@ int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info)
 		goto err_rxq_bitmap;
 	}
 
-	binding = net_devmem_bind_dmabuf(netdev, dma_dev, DMA_FROM_DEVICE,
+	binding = net_devmem_bind_dmabuf(netdev, NULL, dma_dev, DMA_FROM_DEVICE,
 					 dmabuf_fd, priv, info->extack);
 	if (IS_ERR(binding)) {
 		err = PTR_ERR(binding);
@@ -1119,9 +1119,43 @@ int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info)
 	return err;
 }
 
+/* Find the DMA-capable device for a netmem TX binding.
+ *
+ * For NETMEM_TX_DMA devices, return the device itself.
+ * For NETMEM_TX_NO_DMA devices, walk leased RX queues to find the underlying
+ * physical device and return it.
+ */
+static struct net_device *
+netdev_find_netmem_tx_dev(struct net_device *dev)
+{
+	struct netdev_rx_queue *lease_rxq;
+	struct net_device *phys_dev;
+	int i;
+
+	if (dev->netmem_tx == NETMEM_TX_DMA)
+		return dev;
+
+	if (dev->netmem_tx != NETMEM_TX_NO_DMA)
+		return NULL;
+
+	for (i = 0; i < dev->real_num_rx_queues; i++) {
+		lease_rxq = READ_ONCE(__netif_get_rx_queue(dev, i)->lease);
+		if (!lease_rxq)
+			continue;
+
+		phys_dev = lease_rxq->dev;
+		if (netif_device_present(phys_dev) &&
+		    phys_dev->netmem_tx == NETMEM_TX_DMA)
+			return phys_dev;
+	}
+
+	return NULL;
+}
+
 int netdev_nl_bind_tx_doit(struct sk_buff *skb, struct genl_info *info)
 {
 	struct net_devmem_dmabuf_binding *binding;
+	struct net_device *bind_dev;
 	struct netdev_nl_sock *priv;
 	struct net_device *netdev;
 	struct device *dma_dev;
@@ -1171,22 +1205,41 @@ int netdev_nl_bind_tx_doit(struct sk_buff *skb, struct genl_info *info)
 		goto err_unlock_netdev;
 	}
 
-	dma_dev = netdev_queue_get_dma_dev(netdev, 0, NETDEV_QUEUE_TYPE_TX);
-	binding = net_devmem_bind_dmabuf(netdev, dma_dev, DMA_TO_DEVICE,
-					 dmabuf_fd, priv, info->extack);
+	bind_dev = netdev_find_netmem_tx_dev(netdev);
+	if (!bind_dev) {
+		err = -EOPNOTSUPP;
+		NL_SET_ERR_MSG(info->extack,
+			       "No DMA-capable device found for netmem TX");
+		goto err_unlock_netdev;
+	}
+
+	if (bind_dev != netdev)
+		netdev_lock(bind_dev);
+
+	dma_dev = netdev_queue_get_dma_dev(bind_dev, 0, NETDEV_QUEUE_TYPE_TX);
+
+	binding = net_devmem_bind_dmabuf(bind_dev,
+					 bind_dev != netdev ? netdev : NULL,
+					 dma_dev, DMA_TO_DEVICE, dmabuf_fd,
+					 priv, info->extack);
 	if (IS_ERR(binding)) {
 		err = PTR_ERR(binding);
-		goto err_unlock_netdev;
+		goto err_unlock_bind_dev;
 	}
 
 	nla_put_u32(rsp, NETDEV_A_DMABUF_ID, binding->id);
 	genlmsg_end(rsp, hdr);
 
+	if (bind_dev != netdev)
+		netdev_unlock(bind_dev);
 	netdev_unlock(netdev);
 	mutex_unlock(&priv->lock);
 
 	return genlmsg_reply(rsp, info);
 
+err_unlock_bind_dev:
+	if (bind_dev != netdev)
+		netdev_unlock(bind_dev);
 err_unlock_netdev:
 	netdev_unlock(netdev);
 err_unlock_sock:

-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCH net-next v5 2/8] net: netkit: declare NETMEM_TX_NO_DMA mode
From: Bobby Eshleman @ 2026-05-14 17:22 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Shuah Khan, Alex Shi,
	Yanteng Si, Dongliang Mu, Michael Chan, Pavan Chebbi,
	Joshua Washington, Harshitha Ramamurthy, Saeed Mahameed,
	Tariq Toukan, Mark Bloch, Leon Romanovsky, Alexander Duyck,
	kernel-team, Daniel Borkmann, Nikolay Aleksandrov, Shuah Khan,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Shuah Khan, Alex Shi,
	Yanteng Si, Dongliang Mu, Michael Chan, Pavan Chebbi,
	Joshua Washington, Harshitha Ramamurthy, Saeed Mahameed,
	Tariq Toukan, Mark Bloch, Leon Romanovsky, Alexander Duyck,
	kernel-team, Daniel Borkmann, Nikolay Aleksandrov, Shuah Khan
  Cc: dw, sdf.kernel, mohsin.bashr, willemb, jiang.kun2, xu.xin16,
	wang.yaxin, netdev, linux-doc, linux-kernel, linux-rdma, bpf,
	linux-kselftest, Stanislav Fomichev, Mina Almasry, netdev,
	linux-doc, linux-kernel, linux-rdma, bpf, linux-kselftest,
	Bobby Eshleman
In-Reply-To: <20260514-tcp-dm-netkit-v5-0-408c59b91e66@meta.com>

From: Bobby Eshleman <bobbyeshleman@meta.com>

Some virtual devices like netkit (or ifb) never DMA and never touch frag
contents, they just forward the skb to another device. They are unable
to forward unreadable skbs, however, because they fail to pass TX
validation checks on dev->netmem_tx. The existing two-state
NETMEM_TX_NONE / NETMEM_TX_DMA doesn't give the TX validator enough
information to differentiate devices that will attempt DMA on the
unreadable skb from those that will simply route it untouched.

Add a third mode to the enum so drivers can indicate 1) if they have
netmem TX support, and 2) if they do, whether they are DMA-capable:

NETMEM_TX_NO_DMA - pass-through, device never DMAs

Widen dev->netmem_tx from a 1-bit field to 2 bits to fit the new value,
and declare netkit as NETMEM_TX_NO_DMA. Devmem TX support over these
devices comes in a follow-up patch.

Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
Changes in v3:
- net_cachelines/net_device.rst: align the netmem_tx row's type column
  with the rest of the table by using "unsigned_long:2" instead of
  "unsigned long:2"
- Split this into a distinct patch (Jakub)
---
 Documentation/networking/net_cachelines/net_device.rst | 2 +-
 Documentation/networking/netmem.rst                    | 3 +++
 Documentation/translations/zh_CN/networking/netmem.rst | 3 +++
 drivers/net/netkit.c                                   | 1 +
 include/linux/netdevice.h                              | 3 ++-
 5 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/Documentation/networking/net_cachelines/net_device.rst b/Documentation/networking/net_cachelines/net_device.rst
index 1c19bb7705df..7b3392553fd6 100644
--- a/Documentation/networking/net_cachelines/net_device.rst
+++ b/Documentation/networking/net_cachelines/net_device.rst
@@ -10,7 +10,7 @@ Type                                Name                        fastpath_tx_acce
 =================================== =========================== =================== =================== ===================================================================================
 unsigned_long:32                    priv_flags                  read_mostly                             __dev_queue_xmit(tx)
 unsigned_long:1                     lltx                        read_mostly                             HARD_TX_LOCK,HARD_TX_TRYLOCK,HARD_TX_UNLOCK(tx)
-unsigned long:1                     netmem_tx:1;                read_mostly
+unsigned_long:2                     netmem_tx:2;                read_mostly
 char                                name[16]
 struct netdev_name_node*            name_node
 struct dev_ifalias*                 ifalias
diff --git a/Documentation/networking/netmem.rst b/Documentation/networking/netmem.rst
index 5ccadba4f373..217869d1108d 100644
--- a/Documentation/networking/netmem.rst
+++ b/Documentation/networking/netmem.rst
@@ -99,3 +99,6 @@ Driver TX Requirements
    appropriate mode:
 
    - `NETMEM_TX_DMA`: for physical devices that perform DMA.
+
+   - `NETMEM_TX_NO_DMA`: for virtual or passthrough devices that do
+     not DMA, but still support handling of netmem-backed skbs.
diff --git a/Documentation/translations/zh_CN/networking/netmem.rst b/Documentation/translations/zh_CN/networking/netmem.rst
index 9c84423b7528..320f3eacf51b 100644
--- a/Documentation/translations/zh_CN/networking/netmem.rst
+++ b/Documentation/translations/zh_CN/networking/netmem.rst
@@ -92,3 +92,6 @@ dma-mapping API 去处理。
 2. 驱动程序应将 `netdev->netmem_tx` 设置为适当的模式：
 
    - `NETMEM_TX_DMA`：适用于执行 DMA 的物理设备。
+
+   - `NETMEM_TX_NO_DMA`：适用于不执行 DMA 的虚拟或透传设备，但仍支持
+     处理 netmem 支持的 skb。
diff --git a/drivers/net/netkit.c b/drivers/net/netkit.c
index 5e2eecc3165d..0ad6a806d7d5 100644
--- a/drivers/net/netkit.c
+++ b/drivers/net/netkit.c
@@ -466,6 +466,7 @@ static void netkit_setup(struct net_device *dev)
 	dev->priv_flags |= IFF_NO_QUEUE;
 	dev->priv_flags |= IFF_DISABLE_NETPOLL;
 	dev->lltx = true;
+	dev->netmem_tx = NETMEM_TX_NO_DMA;
 
 	dev->netdev_ops     = &netkit_netdev_ops;
 	dev->ethtool_ops    = &netkit_ethtool_ops;
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index b7a4503f7cdb..bf3dd9b2c1a7 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1797,6 +1797,7 @@ enum netdev_stat_type {
 enum netmem_tx_mode {
 	NETMEM_TX_NONE,		/* no netmem TX support */
 	NETMEM_TX_DMA,		/* DMA-capable netmem TX (real HW) */
+	NETMEM_TX_NO_DMA,	/* no DMA, e.g. passthrough for virtual devs */
 };
 
 enum netdev_reg_state {
@@ -2143,7 +2144,7 @@ struct net_device {
 	struct_group(priv_flags_fast,
 		unsigned long		priv_flags:32;
 		unsigned long		lltx:1;
-		unsigned long		netmem_tx:1;
+		unsigned long		netmem_tx:2;
 	);
 	const struct net_device_ops *netdev_ops;
 	const struct header_ops *header_ops;

-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCH net-next v5 1/8] net: convert netmem_tx flag to enum
From: Bobby Eshleman @ 2026-05-14 17:22 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Shuah Khan, Alex Shi,
	Yanteng Si, Dongliang Mu, Michael Chan, Pavan Chebbi,
	Joshua Washington, Harshitha Ramamurthy, Saeed Mahameed,
	Tariq Toukan, Mark Bloch, Leon Romanovsky, Alexander Duyck,
	kernel-team, Daniel Borkmann, Nikolay Aleksandrov, Shuah Khan,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Shuah Khan, Alex Shi,
	Yanteng Si, Dongliang Mu, Michael Chan, Pavan Chebbi,
	Joshua Washington, Harshitha Ramamurthy, Saeed Mahameed,
	Tariq Toukan, Mark Bloch, Leon Romanovsky, Alexander Duyck,
	kernel-team, Daniel Borkmann, Nikolay Aleksandrov, Shuah Khan
  Cc: dw, sdf.kernel, mohsin.bashr, willemb, jiang.kun2, xu.xin16,
	wang.yaxin, netdev, linux-doc, linux-kernel, linux-rdma, bpf,
	linux-kselftest, Stanislav Fomichev, Mina Almasry, netdev,
	linux-doc, linux-kernel, linux-rdma, bpf, linux-kselftest,
	Bobby Eshleman
In-Reply-To: <20260514-tcp-dm-netkit-v5-0-408c59b91e66@meta.com>

From: Bobby Eshleman <bobbyeshleman@meta.com>

Devices that support netmem TX previously set dev->netmem_tx = true.
This was checked in validate_xmit_unreadable_skb() to drop unreadable
skbs (skbs with dmabuf-backed frags) before they reach drivers that
would mishandle them or devices that would not have the iommu mappings
for them.

A subsequent patch will introduce a third state for virtual devices
that forward unreadable skbs without ever performing DMA on them. To
prepare for that, convert the boolean dev->netmem_tx into an enum:

NETMEM_TX_NONE   - no netmem TX support (drop unreadable skbs)
NETMEM_TX_DMA    - full support, device does DMA

Update the existing NIC drivers (bnxt, gve, mlx5, fbnic) and the
validators in net/core to use the new enum. No functional change.

Acked-by: Harshitha Ramamurthy <hramamurthy@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
Changes in v4:
- netdevice.h: netmem_tx enum list -> only "device netmem TX mode" in
  comment (Stan)

Changes in v3:
- Split NO_DMA changes into subsequent commit (Jakub)
- Move !netdev->netmem_tx -> netdev->netmem_tx ==
  NETMEM_TX_NONE conversions to this patch (Jakub)

Changes in v2:
- Squash driver conversion patches (2-5) into patch 1 (Jakub)
---
 Documentation/networking/netmem.rst                    | 5 ++++-
 Documentation/translations/zh_CN/networking/netmem.rst | 4 +++-
 drivers/net/ethernet/broadcom/bnxt/bnxt.c              | 2 +-
 drivers/net/ethernet/google/gve/gve_main.c             | 2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c      | 2 +-
 drivers/net/ethernet/meta/fbnic/fbnic_netdev.c         | 2 +-
 include/linux/netdevice.h                              | 7 ++++++-
 net/core/dev.c                                         | 2 +-
 net/core/netdev-genl.c                                 | 2 +-
 9 files changed, 19 insertions(+), 9 deletions(-)

diff --git a/Documentation/networking/netmem.rst b/Documentation/networking/netmem.rst
index b63aded46337..5ccadba4f373 100644
--- a/Documentation/networking/netmem.rst
+++ b/Documentation/networking/netmem.rst
@@ -95,4 +95,7 @@ Driver TX Requirements
    netdev@, or reach out to the maintainers and/or almasrymina@google.com for
    help adding the netmem API.
 
-2. Driver should declare support by setting `netdev->netmem_tx = true`
+2. Driver should declare support by setting `netdev->netmem_tx` to the
+   appropriate mode:
+
+   - `NETMEM_TX_DMA`: for physical devices that perform DMA.
diff --git a/Documentation/translations/zh_CN/networking/netmem.rst b/Documentation/translations/zh_CN/networking/netmem.rst
index fe351a240f02..9c84423b7528 100644
--- a/Documentation/translations/zh_CN/networking/netmem.rst
+++ b/Documentation/translations/zh_CN/networking/netmem.rst
@@ -89,4 +89,6 @@ dma-mapping API 去处理。
 使用某个还不存在的 netmem API，你可以自行添加并提交到 netdev@，也可以联系维护
 人员或者发送邮件至 almasrymina@google.com 寻求帮助。
 
-2. 驱动程序应通过设置 netdev->netmem_tx = true 来表明自身支持 netmem 功能。
+2. 驱动程序应将 `netdev->netmem_tx` 设置为适当的模式：
+
+   - `NETMEM_TX_DMA`：适用于执行 DMA 的物理设备。
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 945a86696f2f..d4f93e62f583 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -17123,7 +17123,7 @@ static int bnxt_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 	dev->queue_mgmt_ops = &bnxt_queue_mgmt_ops_unsupp;
 	if (BNXT_SUPPORTS_QUEUE_API(bp))
 		dev->queue_mgmt_ops = &bnxt_queue_mgmt_ops;
-	dev->netmem_tx = true;
+	dev->netmem_tx = NETMEM_TX_DMA;
 
 	rc = register_netdev(dev);
 	if (rc)
diff --git a/drivers/net/ethernet/google/gve/gve_main.c b/drivers/net/ethernet/google/gve/gve_main.c
index 00750643e614..e4d78ae52daf 100644
--- a/drivers/net/ethernet/google/gve/gve_main.c
+++ b/drivers/net/ethernet/google/gve/gve_main.c
@@ -2894,7 +2894,7 @@ static int gve_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 		goto abort_with_wq;
 
 	if (!gve_is_gqi(priv) && !gve_is_qpl(priv))
-		dev->netmem_tx = true;
+		dev->netmem_tx = NETMEM_TX_DMA;
 
 	err = register_netdev(dev);
 	if (err)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 85b1ccbd351f..90d2979f1a4f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -5966,7 +5966,7 @@ static void mlx5e_build_nic_netdev(struct net_device *netdev)
 
 	netdev->priv_flags       |= IFF_UNICAST_FLT;
 
-	netdev->netmem_tx = true;
+	netdev->netmem_tx = NETMEM_TX_DMA;
 
 	netif_set_tso_max_size(netdev, GSO_MAX_SIZE);
 	mlx5e_set_xdp_feature(priv);
diff --git a/drivers/net/ethernet/meta/fbnic/fbnic_netdev.c b/drivers/net/ethernet/meta/fbnic/fbnic_netdev.c
index 4dea2bb58d2f..f99ca551c1ce 100644
--- a/drivers/net/ethernet/meta/fbnic/fbnic_netdev.c
+++ b/drivers/net/ethernet/meta/fbnic/fbnic_netdev.c
@@ -752,7 +752,7 @@ struct net_device *fbnic_netdev_alloc(struct fbnic_dev *fbd)
 	netdev->netdev_ops = &fbnic_netdev_ops;
 	netdev->stat_ops = &fbnic_stat_ops;
 	netdev->queue_mgmt_ops = &fbnic_queue_mgmt_ops;
-	netdev->netmem_tx = true;
+	netdev->netmem_tx = NETMEM_TX_DMA;
 
 	fbnic_set_ethtool_ops(netdev);
 
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index e7af71491a47..b7a4503f7cdb 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1794,6 +1794,11 @@ enum netdev_stat_type {
 	NETDEV_PCPU_STAT_DSTATS, /* struct pcpu_dstats */
 };
 
+enum netmem_tx_mode {
+	NETMEM_TX_NONE,		/* no netmem TX support */
+	NETMEM_TX_DMA,		/* DMA-capable netmem TX (real HW) */
+};
+
 enum netdev_reg_state {
 	NETREG_UNINITIALIZED = 0,
 	NETREG_REGISTERED,	/* completed register_netdevice */
@@ -1815,7 +1820,7 @@ enum netdev_reg_state {
  *	@lltx:		device supports lockless Tx. Deprecated for real HW
  *			drivers. Mainly used by logical interfaces, such as
  *			bonding and tunnels
- *	@netmem_tx:	device support netmem_tx.
+ *	@netmem_tx:	device netmem TX mode
  *
  *	@name:	This is the first field of the "visible" part of this structure
  *		(i.e. as seen by users in the "Space.c" file).  It is the name
diff --git a/net/core/dev.c b/net/core/dev.c
index b0691e03dd6b..2da2688fe490 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3996,7 +3996,7 @@ static struct sk_buff *validate_xmit_unreadable_skb(struct sk_buff *skb,
 	if (likely(skb_frags_readable(skb)))
 		goto out;
 
-	if (!dev->netmem_tx)
+	if (dev->netmem_tx == NETMEM_TX_NONE)
 		goto out_free;
 
 	shinfo = skb_shinfo(skb);
diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
index b8f6076d8007..4d2c49371cdb 100644
--- a/net/core/netdev-genl.c
+++ b/net/core/netdev-genl.c
@@ -1164,7 +1164,7 @@ int netdev_nl_bind_tx_doit(struct sk_buff *skb, struct genl_info *info)
 		goto err_unlock_netdev;
 	}
 
-	if (!netdev->netmem_tx) {
+	if (netdev->netmem_tx == NETMEM_TX_NONE) {
 		err = -EOPNOTSUPP;
 		NL_SET_ERR_MSG(info->extack,
 			       "Driver does not support netmem TX");

-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCH net-next v5 0/8] net: devmem: support devmem with netkit devices
From: Bobby Eshleman @ 2026-05-14 17:22 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Shuah Khan, Alex Shi,
	Yanteng Si, Dongliang Mu, Michael Chan, Pavan Chebbi,
	Joshua Washington, Harshitha Ramamurthy, Saeed Mahameed,
	Tariq Toukan, Mark Bloch, Leon Romanovsky, Alexander Duyck,
	kernel-team, Daniel Borkmann, Nikolay Aleksandrov, Shuah Khan,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Shuah Khan, Alex Shi,
	Yanteng Si, Dongliang Mu, Michael Chan, Pavan Chebbi,
	Joshua Washington, Harshitha Ramamurthy, Saeed Mahameed,
	Tariq Toukan, Mark Bloch, Leon Romanovsky, Alexander Duyck,
	kernel-team, Daniel Borkmann, Nikolay Aleksandrov, Shuah Khan
  Cc: dw, sdf.kernel, mohsin.bashr, willemb, jiang.kun2, xu.xin16,
	wang.yaxin, netdev, linux-doc, linux-kernel, linux-rdma, bpf,
	linux-kselftest, Stanislav Fomichev, Mina Almasry, netdev,
	linux-doc, linux-kernel, linux-rdma, bpf, linux-kselftest,
	Bobby Eshleman

This series enables TCP devmem TX through netkit devices.

Netkit now supports queue leasing. A physical NIC's RX queue can be
leased to a netkit guest interface inside a container namespace. This
gives the container a devmem-capable data path on the RX side (bind-rx,
etc...). On the TX side, the container process binds to its netkit guest
interface and sends traffic that netkit redirects (via BPF or ip
forwarding) to the physical NIC for DMA.

Two things in the existing devmem TX path prevent this from working:

1. validate_xmit_unreadable_skb() requires dev->netmem_tx before it will
   forward a dmabuf-backed (unreadable) skb. This protects skbs from
   landing on devices that don't have the IOMMU mappings for the backing
   dmabuf or that don't speak netmem. Netkit, however, does not support
   DMA, doesn't attempt to read unreadable skb pages and so doesn't
   break netmem (it is pure skb routing and redirection). It is
   functionally capable of routing unreadable skbs, but there is no way
   for the TX validation pathway to distinguish between a device that
   will actually attempt DMA-ing the skb and another device
   (like netkit) that does not DMA but also does not break
   netmem.

2. bind_tx_doit uses the bound device as the DMA device.  When the user
   binds devmem TX to the netkit guest, the bind handler attempts to
   create DMA mappings against netkit, which has no DMA capability and
   no IOMMU mappings.

This series solves these problems as follows:

1. Extend netmem_tx to two bits, assigned to one of three values:

   NETMEM_TX_NONE   - netmem not supported
   NETMEM_TX_DMA    - netmem supported and performs DMA
   NETMEM_TX_NO_DMA - netmem supported, but does not DMA

   With these bits, phys devices can set NETMEM_TX_DMA and devices like
   netkit set NETMEM_TX_NO_DMA. The validation TX path ensures that any
   DMA-capable netdev exactly matches the bound device, guaranteeing the
   correct mapping of the bound dmabuf. The validation TX path also
   allows devices with NETMEM_TX_NO_DMA to pass, knowing these devices
   will not misuse netmem or run into IOMMU faults. After redirection or
   routing and the skb finally makes its way through the stack to a
   physical device's TX path, the above NETMEM_TX_DMA check is performed
   again to guarantee the device has the appropriate binding/mappings.

2. On TX bind, the bind handler recognizes NETMEM_TX_NO_DMA devices and
   finds the phys TX device and binds to that instead. For the netkit
   case, if it has been leased a queue from a DMA-capable device
   already, then the bind action is performed on the DMA-capable device
   instead and the dmabuf is mapped correctly.

---
Changes in v5:
- configure_nic(): register cleanup via defer() and drop the separate
  cleanup_nic() helper, this avoids leaking resources init'd during
  setup. (Sashiko)
- Use sys.byteorder when packing phys_ifindex into the BPF .bss map (Sashiko).
- fix unhandle KsftSkip in tests (Sashiko).
- see per-patch changes for more details
- Link to v4: https://lore.kernel.org/r/20260511-tcp-dm-netkit-v4-0-841b78b99d74@meta.com

Changes in v4:
- remove enum list from netmem_tx comment (Stan)
- fold NETMEM_TX_NO_DMA check in validate_xmit_unreadable_skb() into
  skb_frags_readable check (Stan, Jakub)
- change binding->vdev to void ptr cookie with comment (Jakub)
- Fixed the bad change list version number (Stan)
- Link to v3: https://lore.kernel.org/r/20260507-tcp-dm-netkit-v3-0-52821445867c@meta.com

Changes in v3:
- Fix validate_xmit_unreadable_skb() logic for non-devmem
  unreadable niovs (should not be dropped) (Sashiko)
- Simplify lock handling in bind_tx, no premature release (Jakub)
- split NO_DMA changes into separate patch (Jakub)
- fixed some pylint issues, one required an additional patch ("selftests:
  drv-net: make attr _nk_guest_ifname public") to rename a variable from
  private to public
- see per-patch changelist for more detailed changes
- Link to v2: https://lore.kernel.org/r/20260504-tcp-dm-netkit-v2-0-56d52ac72fd4@meta.com

Changes in v2:
- Squash driver conversion patches (2-5) into patch 1 (Jakub)
- In validate_xmit_unreadable_skb() to check netmem_tx mode before inspecting
  frags (Jakub)
- Lock bind_dev around netdev_queue_get_dma_dev() when bind_dev != netdev to
  fix lockdep (Sashiko)
- Move require_devmem() into individual test functions so KsftSkipEx goes up to
  ksft_run() (Sashiko)
- Add nk_devmem.py to TEST_PROGS in Makefile (Sashiko)
- Link to v1:
  https://lore.kernel.org/all/20260428-tcp-dm-netkit-v1-0-719280eba4d2@meta.com/

Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>

---
Bobby Eshleman (8):
      net: convert netmem_tx flag to enum
      net: netkit: declare NETMEM_TX_NO_DMA mode
      net: devmem: support TX over NETMEM_TX_NO_DMA devices
      selftests: drv-net: ncdevmem: add -n flag to skip NIC configuration
      selftests: drv-net: make attr _nk_guest_ifname public
      selftests: drv-net: refactor devmem command builders into lib module
      selftests: drv-net: add primary_rx_redirect support to NetDrvContEnv
      selftests: drv-net: add netkit devmem tests

 .../networking/net_cachelines/net_device.rst       |   2 +-
 Documentation/networking/netmem.rst                |   8 +-
 .../translations/zh_CN/networking/netmem.rst       |   7 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt.c          |   2 +-
 drivers/net/ethernet/google/gve/gve_main.c         |   2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |   2 +-
 drivers/net/ethernet/meta/fbnic/fbnic_netdev.c     |   2 +-
 drivers/net/netkit.c                               |   1 +
 include/linux/netdevice.h                          |  10 +-
 net/core/dev.c                                     |   5 +-
 net/core/devmem.c                                  |   6 +-
 net/core/devmem.h                                  |  10 +-
 net/core/netdev-genl.c                             |  65 +++++-
 tools/testing/selftests/drivers/net/hw/Makefile    |   2 +
 tools/testing/selftests/drivers/net/hw/devmem.py   |  77 ++-----
 .../testing/selftests/drivers/net/hw/devmem_lib.py | 222 +++++++++++++++++++++
 tools/testing/selftests/drivers/net/hw/ncdevmem.c  |  58 +++---
 .../testing/selftests/drivers/net/hw/nk_devmem.py  |  46 +++++
 .../drivers/net/hw/nk_primary_rx_redirect.bpf.c    |  39 ++++
 .../testing/selftests/drivers/net/hw/nk_qlease.py  |   8 +-
 tools/testing/selftests/drivers/net/lib/py/env.py  | 110 +++++++---
 21 files changed, 545 insertions(+), 139 deletions(-)
---
base-commit: 8ebd24a7822cbae25beeafba49b2159d6a68a5f2
change-id: 20260423-tcp-dm-netkit-2bd78b638d30

Best regards,
-- 
Bobby Eshleman <bobbyeshleman@meta.com>

^ permalink raw reply

* Re: [PATCH v12 10/16] KVM: guest_memfd: Add flag to remove from direct map
From: Ackerley Tng @ 2026-05-14 16:45 UTC (permalink / raw)
  To: Takahiro Itazuri, fvdl, seanjc, ljs
  Cc: Liam.Howlett, agordeev, ajones, akpm, alex, andrii, aou, ast,
	baolu.lu, borntraeger, bp, bpf, catalin.marinas, chenhuacai,
	corbet, coxu, daniel, dave.hansen, david, derekmn, dev.jain,
	eddyz87, gerald.schaefer, gor, haoluo, hca, hpa, itazur, jackabt,
	jackmanb, jannh, jgg, jgross, jhubbard, jiayuan.chen, jmattson,
	joey.gouly, john.fastabend, jolsa, jthoughton, kalyazin, kas,
	kernel, kpsingh, kvm, kvmarm, lenb, linux-arm-kernel, linux-doc,
	linux-fsdevel, linux-kernel, linux-kselftest, linux-mm, linux-pm,
	linux-riscv, linux-s390, loongarch, lorenzo.stoakes, luto,
	maobibo, martin.lau, maz, mhocko, mingo, mlevitsk,
	nikita.kalyazin, oupton, palmer, patrick.roy, pavel, pbonzini,
	peterx, peterz, pfalcato, pjw, prsampat, rafael, riel, rppt,
	ryan.roberts, sdf, shijie, skhan, song, surenb, suzuki.poulose,
	svens, tabba, tglx, thuth, urezki, vannapurve, vbabka, will,
	willy, wu.fei9, x86, yang, yangyicong, yonghong.song, yosry,
	yu-cheng.yu, yuzenghui, zhengqi.arch, zulinx86
In-Reply-To: <20260508081812.12345-1-itazur@amazon.com>

Takahiro Itazuri <itazur@amazon.com> writes:

>
> [...snip...]
>

Brought this topic up on the guest_memfd biweekly today!

>
> Agreed with both of you.  I'll adopt the filemap-level approach:
>
> - Move the zap/restore hooks from guest_memfd into filemap_add_folio()
>   / filemap_remove_folio().
> - Tighten AS_NO_DIRECT_MAP semantics so that, for folios in such a
>   mapping, the direct map is invalid for the entire time the folio
>   resides in the page cache.
> - Drop the per-folio KVM_GMEM_FOLIO_NO_DIRECT_MAP bookkeeping in
>   folio->private, since the existence of the folio in the mapping is
>   itself the state.
>
> On each guest memory population path,
>
> - memcpy-based population from userspace goes through the userspace
>   mapping of guest_memfd, not through the kernel direct map, so the
>   filemap-level invariant doesn't affect it.  But this is slow, which
>   is what motivated the write() syscall support.
>
> - write(): meant to speed up the userspace-memcpy case above by doing
>   the copy in the kernel.  I believe Brendan's __GFP_UNMAPPED/mermap
>   work [1] would give us a low-overhead way to get temporary kernel
>   access to an AS_NO_DIRECT_MAP.  Landing mermap may take a while, but
>   this series does not introduce the write() path, so mermap is not a
>   blocker for now.
>
> - kvm_gmem_populate(): this is a TDX/SNP-only path, and NO_DIRECT_MAP
>   is not available on those VM types —
>   kvm_arch_gmem_supports_no_direct_map() returns false for
>   KVM_X86_TDX_VM and KVM_X86_SNP_VM, which are its only callers
>   today.  So it doesn't interact with the filemap invariant IIUC.
>

I'm a little bit uncomfortable this statement since it seems to say TDX
and SNP aren't taken care of. Would just like to discuss (for
a line of sight to SNP and TDX support):

For non-in-place population where the source physical page is different
from the destination physical page,

+ TDX: the TDX module does the population and works with physical
  addresses, so no issue with populate? Other parts of TDX may have
  trouble though, but that can be handled later.
+ SNP: sev_gmem_post_populate() does a memcpy() after using
  kmap_local_page()

Would mermap be a drop in replacement for kmap_local_page() here? Would
guest_memfd need to force a TLB flush after mermap+memcpy?

> So, unless I'm missing any path, adopting the filemap-level approach in
> this series should be fine.
>
>
> I'd like to consult with you folks on how to proceed in advance.  In a
> separate reply on the cover letter thread [2], Lorenzo and Sean
> suggested that the mm pieces should go through the mm subsystem:
>
> On Tue, Apr 21, 2026 at 04:36:00PM +0000, Sean Christopherson wrote:
>> Yeah, when the time comes, the mm pieces definitely need to go through the mm
>> tree.  Ideally, I think this would be merged in two separate parts, with all mm
>> changes going through the mm tree, and then the KVM changes through the KVM tree
>> using a stable topic branch/tag from Andrew.
>
> I see two reasonable paths to get there, and would appreciate your
> input on which you prefer:
>
> Path A — validate on KVM side first, then split:
>   - Post v13 as a single series on the KVM list, gather feedback and
>     make sure the design is acceptable to KVM reviewers.
>   - Once v13 looks good ("the time comes"), do the MM/KVM split,
>     rebase the MM part onto the appropriate MM branch, and post the
>     MM part to linux-mm to build consensus with MM maintainers.
>
> Path B — split early and seek MM consensus in parallel:
>   - With the filemap rework already in place, do the MM/KVM split
>     now and post the MM part to linux-mm directly.  The KVM part follows
>     on top of a stable topic from MM.
>
> Which of the two would you rather see?  Happy to go either way.
>

Vlastimil pointed out that there's a temporary limitation now that the
mm-tree cannot do stable branches shared between trees now.

I think it depends on how quickly you plan to refresh this series, but
Path A wouldn't be blocked by the temporary limitation.

My opinion would be to go ahead with a new revision (Path A) to fully
address comments before splitting the series. Any Reviewed-bys can be
carried over to the split series anyway :)

Alternatively you could wait till conversion lands :P Either one of us
will need to do more work for conversion wrt direct map removal.

>
> [1] https://lore.kernel.org/all/20260320-page_alloc-unmapped-v2-0-28bf1bd54f41@google.com/
> [2] https://lore.kernel.org/all/20260506080753.14517-1-itazur@amazon.com/
>
> Takahiro

^ permalink raw reply

* Re: [PATCH net-next 1/2] net: ti: icssg: Derive stats array lengths from ARRAY_SIZE
From: Jacob Keller @ 2026-05-14 16:38 UTC (permalink / raw)
  To: MD Danish Anwar, David CARLIER
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Jonathan Corbet, Shuah Khan, Roger Quadros,
	Andrew Lunn, Meghana Malladi, Kevin Hao, Vadim Fedorenko, netdev,
	linux-doc, linux-kernel, linux-arm-kernel, Vignesh Raghavendra
In-Reply-To: <e28d2931-004e-4935-a0b5-d554ece76fb2@ti.com>

On 5/13/2026 9:56 PM, MD Danish Anwar wrote:
> Hi Jacob,
> 
> On 14/05/26 1:30 am, Jacob Keller wrote:
>> The way we solved this in the Intel drivers is to use a single array
>> which contains both the stat name as well as the offset from the
>> structure where the stat resides.
>>
>> The stat string code just iterates over the stat list for the strings,
>> while the stat value code iterates the array and computes the stat
>> address from the offset and size and base structure pointer. Each object
>> that has stats has its own stat array structure.
>>
>> This is probably overkill, but the advantage is that the strings and
>> their values are stored together and adding a new stat is as simple as
>> adding a new entry to that list.
>>
>> I.e.
>>
>> struct ice_stats {
>>         char stat_string[ETH_GSTRING_LEN];
>>         int sizeof_stat;
>>         int stat_offset;
>> };
>>
>> #define ICE_STAT(_type, _name, _stat) { \
>>         .stat_string = _name, \
>>         .sizeof_stat = sizeof_field(_type, _stat), \
>>         .stat_offset = offsetof(_type, _stat) \
>> }
>>
>> #define ICE_VSI_STAT(_name, _stat) \
>>                 ICE_STAT(struct ice_vsi, _name, _stat)
>> #define ICE_PF_STAT(_name, _stat) \
>>                 ICE_STAT(struct ice_pf, _name, _stat)
>>
>>
>> Then the stats for the individial arrays are defined like this:
>>
>> static const struct ice_stats ice_gstrings_vsi_stats[] = {
>>         ICE_VSI_STAT(ICE_RX_UNICAST, eth_stats.rx_unicast),
>>         ICE_VSI_STAT(ICE_TX_UNICAST, eth_stats.tx_unicast),
>>         ICE_VSI_STAT(ICE_RX_MULTICAST, eth_stats.rx_multicast),
>>         ICE_VSI_STAT(ICE_TX_MULTICAST, eth_stats.tx_multicast),
>>         ICE_VSI_STAT(ICE_RX_BROADCAST, eth_stats.rx_broadcast),
>>         ICE_VSI_STAT(ICE_TX_BROADCAST, eth_stats.tx_broadcast),
>> 	...
>> };
>>
>> (Note, ICE_RX_UNICAST is a macro that defines the string value.. I don't
>> recall who changed this to macros or why vs just having the strings be
>> directly in the definition...)
>>
> 
> Thanks for sharing the ice driver pattern — that's a clean design.
> 
>> This is probably a lot bigger refactor to make work, and may not be
>> exactly suitable for your driver. I've considered "upgrading" these data
> 
> Yes, I need to see if refactoring is applicable to ICSSG or not. I will
> look into this and send a separate patch / series in future if
> applicable. For this series I will stick with what David Carlier suggested.
> 
Makes sense. No need to bloat this by more work immediately. I just
wanted to point it out.

I'd like to uplift this infrastructure into core helpers so that other
drivers can more easily benefit without duplicating the scaffolding, but
its always been a low priority task especially since the private driver
stats are somewhat of a legacy/deprecated interface. (Maintainers don't
generally like driver developers making up their own statistics...)

^ permalink raw reply

* [PATCH v2 3/3] drivers: add deprecated remarks to strlcat()
From: Manuel Ebner @ 2026-05-14 16:30 UTC (permalink / raw)
  To: Andy Shevchenko, Kees Cook, Jonathan Corbet, Shuah Khan,
	Andy Whitcroft, Joe Perches, Dwaipayan Ray, Lukas Bulwahn,
	Geert Uytterhoeven, David Laight, Randy Dunlap, Jani Nikula,
	Heiko Carstens, open list:DOCUMENTATION PROCESS,
	open list:DOCUMENTATION, open list
  Cc: Manuel Ebner
In-Reply-To: <20260514160719.105084-3-manuelebner@mailbox.org>

add kernel-doc comment to strlcat() function definitions

Signed-off-by: Manuel Ebner <manuelebner@mailbox.org>
---
 lib/string.c                  | 11 +++++++++++
 tools/include/nolibc/string.h | 11 +++++++++++
 2 files changed, 22 insertions(+)

diff --git a/lib/string.c b/lib/string.c
index b632c71df1a5..0a44ca5ca7e6 100644
--- a/lib/string.c
+++ b/lib/string.c
@@ -249,6 +249,17 @@ EXPORT_SYMBOL(strncat);
 #endif
 
 #ifndef __HAVE_ARCH_STRLCAT
+/**
+ * strlcat - Append a string to an existing string
+ *
+ * @dest: pointer to %NUL-terminated string to append to
+ * @src: pointer to %NUL-terminated string to append from
+ * @count: Maximum bytes available in @dest
+ *
+ * Do not use this function. Prefer building the string with
+ * formatting, via scnprintf(), seq_buf, or similar.
+ *
+ */
 size_t strlcat(char *dest, const char *src, size_t count)
 {
 	size_t dsize = strlen(dest);
diff --git a/tools/include/nolibc/string.h b/tools/include/nolibc/string.h
index 4000926f44ac..1a4b51135705 100644
--- a/tools/include/nolibc/string.h
+++ b/tools/include/nolibc/string.h
@@ -208,6 +208,17 @@ char *strndup(const char *str, size_t maxlen)
 }
 
 static __attribute__((unused))
+/**
+ * strlcat - Append a string to an existing string
+ *
+ * @dst: pointer to %NUL-terminated string to append to
+ * @src: pointer to %NUL-terminated string to append from
+ * @size: Maximum bytes available in @dst
+ *
+ * Do not use this function. Prefer building the string with
+ * formatting, via scnprintf(), seq_buf, or similar.
+ *
+ */
 size_t strlcat(char *dst, const char *src, size_t size)
 {
 	size_t len = strnlen(dst, size);
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH v2 2/3] scripts: checkpatch.pl: add warning for strlcat()
From: Kees Cook @ 2026-05-14 16:32 UTC (permalink / raw)
  To: Manuel Ebner
  Cc: Andy Shevchenko, Jonathan Corbet, Shuah Khan, Andy Whitcroft,
	Joe Perches, Dwaipayan Ray, Lukas Bulwahn, Geert Uytterhoeven,
	David Laight, Randy Dunlap, Jani Nikula, Heiko Carstens,
	open list:DOCUMENTATION PROCESS, open list:DOCUMENTATION,
	open list
In-Reply-To: <20260514162858.107919-2-manuelebner@mailbox.org>

On Thu, May 14, 2026 at 06:28:59PM +0200, Manuel Ebner wrote:
> add a warning for strlcat()
> 
> Signed-off-by: Manuel Ebner <manuelebner@mailbox.org>

Reviewed-by: Kees Cook <kees@kernel.org>

-- 
Kees Cook

^ permalink raw reply

* Re: [PATCH v2 1/3] Doc: deprecated.rst: add strlcat()
From: Kees Cook @ 2026-05-14 16:31 UTC (permalink / raw)
  To: Manuel Ebner
  Cc: Andy Shevchenko, Jonathan Corbet, Shuah Khan, Andy Whitcroft,
	Joe Perches, Dwaipayan Ray, Lukas Bulwahn, Geert Uytterhoeven,
	David Laight, Randy Dunlap, Jani Nikula, Heiko Carstens,
	open list:DOCUMENTATION PROCESS, open list:DOCUMENTATION,
	open list
In-Reply-To: <20260514162652.107714-2-manuelebner@mailbox.org>

On Thu, May 14, 2026 at 06:26:53PM +0200, Manuel Ebner wrote:
> add strlcat and alternatives
> 
> Signed-off-by: Manuel Ebner <manuelebner@mailbox.org>
> ---
>  Documentation/process/deprecated.rst | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/Documentation/process/deprecated.rst b/Documentation/process/deprecated.rst
> index fed56864d036..06e802f4bbfd 100644
> --- a/Documentation/process/deprecated.rst
> +++ b/Documentation/process/deprecated.rst
> @@ -153,6 +153,13 @@ used, and the destinations should be marked with the `__nonstring
>  attribute to avoid future compiler warnings. For cases still needing
>  NUL-padding, strtomem_pad() can be used.
>  
> +strlcat()
> +---------
> +strlcat() must re-scan the destination string from the beginning on each
> +call (O(n^2) behavior). Alternatives are seq_buf_puts() and seq_buf_printf().
> +snprintf(), scnprintf() and sysfs_emit() are possible aswell, but the adoption
> +of the arguments needs to be taken care off.
> +

How about just:

strlcat() must re-scan the destination string from the beginning on each
call (O(n^2) behavior). Use the seq_buf API or similar instead.


>  strlcpy()
>  ---------
>  strlcpy() reads the entire source buffer first (since the return value
> -- 
> 2.54.0
> 

-- 
Kees Cook

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox