Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH bpf-next 00/13] bpf: adding map batch processing support
From: Yonghong Song @ 2019-08-30 22:38 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Alexei Starovoitov, bpf@vger.kernel.org, netdev@vger.kernel.org,
	Brian Vazquez, Daniel Borkmann, Kernel Team, Quentin Monnet
In-Reply-To: <20190830143508.73c30631@cakuba.netronome.com>



On 8/30/19 2:35 PM, Jakub Kicinski wrote:
> On Fri, 30 Aug 2019 07:25:54 +0000, Yonghong Song wrote:
>> On 8/29/19 11:39 AM, Jakub Kicinski wrote:
>>> On Wed, 28 Aug 2019 23:45:02 -0700, Yonghong Song wrote:
>>>> Brian Vazquez has proposed BPF_MAP_DUMP command to look up more than one
>>>> map entries per syscall.
>>>>     https://lore.kernel.org/bpf/CABCgpaU3xxX6CMMxD+1knApivtc2jLBHysDXw-0E9bQEL0qC3A@mail.gmail.com/T/#t
>>>>
>>>> During discussion, we found more use cases can be supported in a similar
>>>> map operation batching framework. For example, batched map lookup and delete,
>>>> which can be really helpful for bcc.
>>>>     https://github.com/iovisor/bcc/blob/master/tools/tcptop.py#L233-L243
>>>>     https://github.com/iovisor/bcc/blob/master/tools/slabratetop.py#L129-L138
>>>>       
>>>> Also, in bcc, we have API to delete all entries in a map.
>>>>     https://github.com/iovisor/bcc/blob/master/src/cc/api/BPFTable.h#L257-L264
>>>>
>>>> For map update, batched operations also useful as sometimes applications need
>>>> to populate initial maps with more than one entry. For example, the below
>>>> example is from kernel/samples/bpf/xdp_redirect_cpu_user.c:
>>>>     https://github.com/torvalds/linux/blob/master/samples/bpf/xdp_redirect_cpu_user.c#L543-L550
>>>>
>>>> This patch addresses all the above use cases. To make uapi stable, it also
>>>> covers other potential use cases. Four bpf syscall subcommands are introduced:
>>>>       BPF_MAP_LOOKUP_BATCH
>>>>       BPF_MAP_LOOKUP_AND_DELETE_BATCH
>>>>       BPF_MAP_UPDATE_BATCH
>>>>       BPF_MAP_DELETE_BATCH
>>>>
>>>> In userspace, application can iterate through the whole map one batch
>>>> as a time, e.g., bpf_map_lookup_batch() in the below:
>>>>       p_key = NULL;
>>>>       p_next_key = &key;
>>>>       while (true) {
>>>>          err = bpf_map_lookup_batch(fd, p_key, &p_next_key, keys, values,
>>>>                                     &batch_size, elem_flags, flags);
>>>>          if (err) ...
>>>>          if (p_next_key) break; // done
>>>>          if (!p_key) p_key = p_next_key;
>>>>       }
>>>> Please look at individual patches for details of new syscall subcommands
>>>> and examples of user codes.
>>>>
>>>> The testing is also done in a qemu VM environment:
>>>>         measure_lookup: max_entries 1000000, batch 10, time 342ms
>>>>         measure_lookup: max_entries 1000000, batch 1000, time 295ms
>>>>         measure_lookup: max_entries 1000000, batch 1000000, time 270ms
>>>>         measure_lookup: max_entries 1000000, no batching, time 1346ms
>>>>         measure_lookup_delete: max_entries 1000000, batch 10, time 433ms
>>>>         measure_lookup_delete: max_entries 1000000, batch 1000, time 363ms
>>>>         measure_lookup_delete: max_entries 1000000, batch 1000000, time 357ms
>>>>         measure_lookup_delete: max_entries 1000000, not batch, time 1894ms
>>>>         measure_delete: max_entries 1000000, batch, time 220ms
>>>>         measure_delete: max_entries 1000000, not batch, time 1289ms
>>>> For a 1M entry hash table, batch size of 10 can reduce cpu time
>>>> by 70%. Please see patch "tools/bpf: measure map batching perf"
>>>> for details of test codes.
>>>
>>> Hi Yonghong!
>>>
>>> great to see this, we have been looking at implementing some way to
>>> speed up map walks as well.
>>>
>>> The direction we were looking in, after previous discussions [1],
>>> however, was to provide a BPF program which can run the logic entirely
>>> within the kernel.
>>>
>>> We have a rough PoC on the FW side (we can offload the program which
>>> walks the map, which is pretty neat), but the kernel verifier side
>>> hasn't really progressed. It will soon.
>>>
>>> The rough idea is that the user space provides two programs, "filter"
>>> and "dumper":
>>>
>>> 	bpftool map exec id XYZ filter pinned /some/prog \
>>> 				dumper pinned /some/other_prog
>>>
>>> Both programs get this context:
>>>
>>> struct map_op_ctx {
>>> 	u64 key;
>>> 	u64 value;
>>> }
>>>
>>> We need a per-map implementation of the exec side, but roughly maps
>>> would do:
>>>
>>> 	LIST_HEAD(deleted);
>>>
>>> 	for entry in map {
>>> 		struct map_op_ctx {
>>> 			.key	= entry->key,
>>> 			.value	= entry->value,
>>> 		};
>>>
>>> 		act = BPF_PROG_RUN(filter, &map_op_ctx);
>>> 		if (act & ~ACT_BITS)
>>> 			return -EINVAL;
>>>
>>> 		if (act & DELETE) {
>>> 			map_unlink(entry);
>>> 			list_add(entry, &deleted);
>>> 		}
>>> 		if (act & STOP)
>>> 			break;
>>> 	}
>>>
>>> 	synchronize_rcu();
>>>
>>> 	for entry in deleted {
>>> 		struct map_op_ctx {
>>> 			.key	= entry->key,
>>> 			.value	= entry->value,
>>> 		};
>>> 		
>>> 		BPF_PROG_RUN(dumper, &map_op_ctx);
>>> 		map_free(entry);
>>> 	}
>>>
>>> The filter program can't perform any map operations other than lookup,
>>> otherwise we won't be able to guarantee that we'll walk the entire map
>>> (if the filter program deletes some entries in a unfortunate order).
>>
>> Looks like you will provide a new program type and per-map
>> implementation of above code. My patch set indeed avoided per-map
>> implementation for all of lookup/delete/get-next-key...
> 
> Indeed, the simple batched ops are undeniably lower LoC.
> 
>>> If user space just wants a pure dump it can simply load a program which
>>> dumps the entries into a perf ring.
>>
>> percpu perf ring is not really ideal for user space which simply just
>> want to get some key/value pairs back. Some kind of generate non-per-cpu
>> ring buffer might be better for such cases.
> 
> I don't think it had to be per-cpu, but I may be blissfully ignorant
> about the perf ring details :) bpf_perf_event_output() takes flags,
> which are effectively selecting the "output CPU", no?

Right, it does not need to be per-cpu. One particular cpu
can be selected. Binding to which cpu might be always
subject to debate like why this cpu, not another cpu.
This works, and I am just thinking a ring buffer without
binding to cpu is better and less confusion to user.
But this may need yet another ring buffer implementation
in the kernel and people might not like it.

^ permalink raw reply

* Re: [PATCH] net: stmmac: Variable "val" in function sun8i_dwmac_set_syscon() could be uninitialized
From: Yizhuo Zhai @ 2019-08-30 22:29 UTC (permalink / raw)
  To: David Miller
  Cc: Chengyu Song, Zhiyun Qian, Giuseppe Cavallaro, Alexandre Torgue,
	Maxime Ripard, Chen-Yu Tsai, netdev, linux-arm-kernel,
	linux-kernel
In-Reply-To: <20190208.230117.1867217574108955553.davem@davemloft.net>

Hi David:

Thanks for your feedback, this patch should work for v4.14.


On Fri, Feb 8, 2019 at 11:01 PM David Miller <davem@davemloft.net> wrote:
>
> From: Yizhuo <yzhai003@ucr.edu>
> Date: Thu,  7 Feb 2019 09:46:23 -0800
>
> > In function sun8i_dwmac_set_syscon(), local variable "val" could
> > be uninitialized if function regmap_read() returns -EINVAL.
> > However, it will be used directly in the if statement, which
> > is potentially unsafe.
> >
> > Signed-off-by: Yizhuo <yzhai003@ucr.edu>
>
> This doesn't apply to any of my trees.



--
Kind Regards,

Yizhuo Zhai

Computer Science, Graduate Student
University of California, Riverside

^ permalink raw reply

* Re: [PATCH 0/4 net-next] flow_offload: update mangle action representation
From: Jakub Kicinski @ 2019-08-30 22:33 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel, davem, netdev, vishal, saeedm, jiri
In-Reply-To: <20190830090710.g7q2chf3qulfs5e4@salvia>

On Fri, 30 Aug 2019 11:07:10 +0200, Pablo Neira Ayuso wrote:
> > > * The front-end coalesces consecutive pedit actions into one single
> > >   word, so drivers can mangle IPv6 and ethernet address fields in one
> > >   single go.  
> > 
> > You still only coalesce up to 16 bytes, no?  
> 
> You only have to rise FLOW_ACTION_MANGLE_MAXLEN coming in this patch
> if you need more. I don't know of any packet field larger than 16
> bytes. If there is a use-case for this, it should be easy to rise that
> definition.

Please see the definitions of:

struct nfp_fl_set_eth
struct nfp_fl_set_ip4_addrs
struct nfp_fl_set_ip4_ttl_tos
struct nfp_fl_set_ipv6_tc_hl_fl
struct nfp_fl_set_ipv6_addr
struct nfp_fl_set_tport

These are the programming primitives for header rewrites in the NFP.
Since each of those contains more than just one field, we'll have to
keep all the field coalescing logic in the driver, even if you coalesce
while fields (i.e. IPv6 addresses).

Perhaps it's not a serious blocker for the series, but it'd be nice if
rewrite action grouping was handled in the core. Since you're already
poking at that code..

^ permalink raw reply

* Re: [PATCH v2 net 1/3] taprio: Fix kernel panic in taprio_destroy
From: Vinicius Costa Gomes @ 2019-08-30 22:30 UTC (permalink / raw)
  To: Vladimir Oltean, xiyou.wangcong, jiri, davem, vedang.patel,
	leandro.maciel.dorileo
  Cc: netdev, Vladimir Oltean
In-Reply-To: <20190830010723.32096-2-olteanv@gmail.com>

Vladimir Oltean <olteanv@gmail.com> writes:

> taprio_init may fail earlier than this line:
>
> 	list_add(&q->taprio_list, &taprio_list);
>
> i.e. due to the net device not being multi queue.
>
> Attempting to remove q from the global taprio_list when it is not part
> of it will result in a kernel panic.
>
> Fix it by matching list_add and list_del better to one another in the
> order of operations. This way we can keep the deletion unconditional
> and with lower complexity - O(1).
>
> Cc: Leandro Dorileo <leandro.maciel.dorileo@intel.com>
> Fixes: 7b9eba7ba0c1 ("net/sched: taprio: fix picos_per_byte miscalculation")
> Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
> ---

Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>

^ permalink raw reply

* Re: [RFC PATCH v2 net-next 00/15] tc-taprio offload for SJA1105 DSA
From: Jakub Kicinski @ 2019-08-30 22:28 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Florian Fainelli, Vivien Didelot, Andrew Lunn, David S. Miller,
	Vinicius Costa Gomes, vedang.patel, Richard Cochran, weifeng.voon,
	jiri, m-karicheri2, Jose.Abreu, Ilias Apalodimas,
	Jamal Hadi Salim, xiyou.wangcong, netdev
In-Reply-To: <CA+h21hr==OStFfgaswzU7HtFg_bHZPoZD5JTQD+-e4jWwZYWHQ@mail.gmail.com>

On Fri, 30 Aug 2019 13:11:11 +0300, Vladimir Oltean wrote:
> On Fri, 30 Aug 2019 at 04:21, Jakub Kicinski wrote:
> > On Fri, 30 Aug 2019 03:46:20 +0300, Vladimir Oltean wrote:  
> > > - Configuring the switch over SPI cannot apparently be done from this
> > >   ndo_setup_tc callback because it runs in atomic context. I also have
> > >   some downstream patches to offload tc clsact matchall with mirred
> > >   action, but in that case it looks like the atomic context restriction
> > >   does not apply.  
> >
> > This sounds really surprising ndo_setup_tc should always be allowed to
> > sleep. Can the taprio limitation be lifted somehow?  
> 
> I need to get more familiar with the taprio internal data structures.
> I think you're suggesting to get those updated to a consistent state
> while under spin_lock_bh(qdisc_lock(sch)), then call ndo_setup_tc from
> outside that critical section?

I'm not 100% sure how taprio handles locking TBH, it just seems naive
that HW callback will not need to sleep, so the kernel should make sure
that callback can sleep. Otherwise we'll end up with 3/4 of drivers
implementing some async work routine...

Sorry, I know that's quite general and not that helpful.

> Also, I just noticed that I introduced a bug in taprio_disable_offload
> with my reference counting addition. The qdisc can't just pass stack
> memory to the driver, now that it's allowing it to keep it. So
> definitely the patch needs more refactoring.

Ah, a slight deja vu, I think someone else has done it in the past :)

^ permalink raw reply

* Re: [PATCH bpf-next 00/13] bpf: adding map batch processing support
From: Yonghong Song @ 2019-08-30 22:24 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Stanislav Fomichev, Brian Vazquez, Alexei Starovoitov,
	bpf@vger.kernel.org, netdev@vger.kernel.org, Daniel Borkmann,
	Kernel Team, Quentin Monnet
In-Reply-To: <20190830141024.743c8d02@cakuba.netronome.com>



On 8/30/19 2:10 PM, Jakub Kicinski wrote:
> On Fri, 30 Aug 2019 20:55:33 +0000, Yonghong Song wrote:
>>> I guess you can hold off this series a bit and discuss it at LPC,
>>> you have a talk dedicated to that :-) (and afaiu, you are all going)
>>
>> Absolutely. We will have a discussion on map batching and I signed
>> on with that :-)
> 
> FWIW unfortunately neither Quentin nor I will be able to make the LPC
> this year :( But the idea had been floated a few times (I'd certainly
> not claim more authorship here than Ed or Alexei), and is simple enough
> to move forward over email (I hope).

Yes, let us continue to discuss over emails. :-)

^ permalink raw reply

* Re: [PATCH v6 net-next 16/19] ionic: Add netdev-event handling
From: Jakub Kicinski @ 2019-08-30 22:24 UTC (permalink / raw)
  To: Shannon Nelson; +Cc: netdev, davem
In-Reply-To: <336eacda-27ae-d513-7888-429e34e29b76@pensando.io>

On Fri, 30 Aug 2019 14:36:23 -0700, Shannon Nelson wrote:
> On 8/29/19 4:37 PM, Jakub Kicinski wrote:
> > On Thu, 29 Aug 2019 11:27:17 -0700, Shannon Nelson wrote:  
> >> When the netdev gets a new name from userland, pass that name
> >> down to the NIC for internal tracking.
> >>
> >> Signed-off-by: Shannon Nelson <snelson@pensando.io>  
> > There is a precedent in ACPI for telling the FW what OS is running but
> > how is the interface name useful for the firmware I can't really tell.  
> It is so we can correlate the host's interface name with the internal 
> port data for internal logging.

Okay.. honestly brings back too many bad memories of arguing with
engineers who spent their lives building middle boxes..

I just hope this won't spread.

^ permalink raw reply

* Re: Proposal: r8152 firmware patching framework
From: Prashant Malani @ 2019-08-30 22:24 UTC (permalink / raw)
  To: Hayes Wang, David Miller, netdev, bambi.yeh, amber.chen, ryankao,
	jackc, albertk, marcochen
  Cc: nic_swsd, Grant Grundler
In-Reply-To: <CACeCKacOcg01NuCWgf2RRer3bdmW-CH7d90Y+iD2wQh5Ka6Mew@mail.gmail.com>

(Adding a few more Realtek folks)

Friendly ping. Any thoughts / feedback, Realtek folks (and others) ?

On Thu, Aug 29, 2019 at 11:40 AM Prashant Malani <pmalani@chromium.org> wrote:
>
> Hi,
>
> The r8152 driver source code distributed by Realtek (on
> www.realtek.com) contains firmware patches. This involves binary
> byte-arrays being written byte/word-wise to the hardware memory
> Example: grundler@chromium.org (cc-ed) has an experimental patch which
> includes the firmware patching code which was distributed with the
> Realtek source :
> https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/1417953
>
> It would be nice to have a way to incorporate these firmware fixes
> into the upstream code. Since having indecipherable byte-arrays is not
> possible upstream, I propose the following:
> - We use the assistance of Realtek to come up with a format which the
> firmware patch files can follow (this can be documented in the
> comments).
>        - A real simple format could look like this:
>                +
> <section1><size_in_bytes><address1><data1><address2><data2>...<addressN><dataN><section2>...
>                 + The driver would be able to understand how to parse
> each section (e.g is each data entry a byte or a word?)
>
> - We use request_firmware() to load the firmware, parse it and write
> the data to the relevant registers.
>
> I'm unfamiliar with what the preferred method of firmware patching is,
> so I hope the maintainers can help suggest the best path forward.
>
> As an aside: It would be great if Realtek could publish a list of
> fixes that the firmware patches implement (I think a list on the
> driver download page on the Realtek website would be an excellent
> starting point).
>
> Thanks and Best regards,
>
> -Prashant

^ permalink raw reply

* Re: [PATCH v6 net-next 15/19] ionic: Add Tx and Rx handling
From: Jakub Kicinski @ 2019-08-30 22:21 UTC (permalink / raw)
  To: Shannon Nelson; +Cc: netdev, davem
In-Reply-To: <31ea9c22-15e0-86db-a92d-76cee56fc738@pensando.io>

On Fri, 30 Aug 2019 14:44:24 -0700, Shannon Nelson wrote:
> On 8/29/19 4:33 PM, Jakub Kicinski wrote:
> > On Thu, 29 Aug 2019 11:27:16 -0700, Shannon Nelson wrote:  
>  [...]  
> > There's definitely a function for helping drivers which can't do full
> > TSO slice up the packet, but I can't find it now 😫😫
> >
> > Eric would definitely know.
> >
> > Did you have a look? Would it be useful here?  
> 
> Yes, obviously this could use some work for clarity and supportability, 
> and I think for performance as well.  But since it works, I've been 
> concentrating on getting other parts of the driver working before coming 
> back to this.  If there are some tools that can help clean this up, I 
> would be interested to see them.

Ha! I think I found it, can you take a look at net/core/tso.c and see
if any of those help?

I'm not opposed to merging as is, but at the same time I'm not
volunteering to review that function..

^ permalink raw reply

* Re: [PATCH v6 net-next 07/19] ionic: Add basic adminq support
From: David Miller @ 2019-08-30 22:17 UTC (permalink / raw)
  To: jakub.kicinski; +Cc: snelson, netdev
In-Reply-To: <20190830151604.1a7dd276@cakuba.netronome.com>

From: Jakub Kicinski <jakub.kicinski@netronome.com>
Date: Fri, 30 Aug 2019 15:16:04 -0700

> On Fri, 30 Aug 2019 12:31:07 -0700, Shannon Nelson wrote:
>> On 8/29/19 3:52 PM, Jakub Kicinski wrote:
>> > On Thu, 29 Aug 2019 11:27:08 -0700, Shannon Nelson wrote:  
>> >> +static void ionic_lif_qcq_deinit(struct ionic_lif *lif, struct ionic_qcq *qcq)
>> >> +{
>> >> +	struct ionic_dev *idev = &lif->ionic->idev;
>> >> +	struct device *dev = lif->ionic->dev;
>> >> +
>> >> +	if (!qcq)
>> >> +		return;
>> >> +
>> >> +	ionic_debugfs_del_qcq(qcq);
>> >> +
>> >> +	if (!(qcq->flags & IONIC_QCQ_F_INITED))
>> >> +		return;
>> >> +
>> >> +	if (qcq->flags & IONIC_QCQ_F_INTR) {
>> >> +		ionic_intr_mask(idev->intr_ctrl, qcq->intr.index,
>> >> +				IONIC_INTR_MASK_SET);
>> >> +		synchronize_irq(qcq->intr.vector);
>> >> +		devm_free_irq(dev, qcq->intr.vector, &qcq->napi);  
>> > Doesn't free_irq() basically imply synchronize_irq()?  
>> 
>> The synchronize_irq() waits for any threaded handlers to finish, while 
>> free_irq() only waits for HW handling.  This helps makes sure we don't 
>> have anything still running before we remove resources.
> 
> mm.. I'm no IRQ expert but it strikes me as surprising as that'd mean
> every single driver would always have to run synchronize_irq() on
> module exit, no?
> 
> I see there is a kthread_stop() in __free_irq(), you sure it doesn't
> wait for threaded IRQs?

I'm pretty sure it does.

^ permalink raw reply

* Re: [PATCH v6 net-next 14/19] ionic: Add initial ethtool support
From: Jakub Kicinski @ 2019-08-30 22:16 UTC (permalink / raw)
  To: Shannon Nelson; +Cc: netdev, davem
In-Reply-To: <4c140c92-38b7-7c81-2a82-d23df8d16252@pensando.io>

On Fri, 30 Aug 2019 14:25:12 -0700, Shannon Nelson wrote:
> On 8/29/19 4:10 PM, Jakub Kicinski wrote:
> > On Thu, 29 Aug 2019 11:27:15 -0700, Shannon Nelson wrote:  
> >> +static int ionic_get_module_eeprom(struct net_device *netdev,
> >> +				   struct ethtool_eeprom *ee,
> >> +				   u8 *data)
> >> +{
> >> +	struct ionic_lif *lif = netdev_priv(netdev);
> >> +	struct ionic_dev *idev = &lif->ionic->idev;
> >> +	struct ionic_xcvr_status *xcvr;
> >> +	char tbuf[sizeof(xcvr->sprom)];
> >> +	int count = 10;
> >> +	u32 len;
> >> +
> >> +	/* The NIC keeps the module prom up-to-date in the DMA space
> >> +	 * so we can simply copy the module bytes into the data buffer.
> >> +	 */
> >> +	xcvr = &idev->port_info->status.xcvr;
> >> +	len = min_t(u32, sizeof(xcvr->sprom), ee->len);
> >> +
> >> +	do {
> >> +		memcpy(data, xcvr->sprom, len);
> >> +		memcpy(tbuf, xcvr->sprom, len);
> >> +
> >> +		/* Let's make sure we got a consistent copy */
> >> +		if (!memcmp(data, tbuf, len))
> >> +			break;
> >> +
> >> +	} while (--count);  
> > Should this return an error if the image was never consistent?  
> 
> Sure, how about -EBUSY?

Or EAGAIN ? Not sure

^ permalink raw reply

* Re: [PATCH v6 net-next 07/19] ionic: Add basic adminq support
From: Jakub Kicinski @ 2019-08-30 22:16 UTC (permalink / raw)
  To: Shannon Nelson; +Cc: netdev, davem
In-Reply-To: <bad39320-8e67-e280-5e35-612cbdc49b6f@pensando.io>

On Fri, 30 Aug 2019 12:31:07 -0700, Shannon Nelson wrote:
> On 8/29/19 3:52 PM, Jakub Kicinski wrote:
> > On Thu, 29 Aug 2019 11:27:08 -0700, Shannon Nelson wrote:  
> >> +static void ionic_lif_qcq_deinit(struct ionic_lif *lif, struct ionic_qcq *qcq)
> >> +{
> >> +	struct ionic_dev *idev = &lif->ionic->idev;
> >> +	struct device *dev = lif->ionic->dev;
> >> +
> >> +	if (!qcq)
> >> +		return;
> >> +
> >> +	ionic_debugfs_del_qcq(qcq);
> >> +
> >> +	if (!(qcq->flags & IONIC_QCQ_F_INITED))
> >> +		return;
> >> +
> >> +	if (qcq->flags & IONIC_QCQ_F_INTR) {
> >> +		ionic_intr_mask(idev->intr_ctrl, qcq->intr.index,
> >> +				IONIC_INTR_MASK_SET);
> >> +		synchronize_irq(qcq->intr.vector);
> >> +		devm_free_irq(dev, qcq->intr.vector, &qcq->napi);  
> > Doesn't free_irq() basically imply synchronize_irq()?  
> 
> The synchronize_irq() waits for any threaded handlers to finish, while 
> free_irq() only waits for HW handling.  This helps makes sure we don't 
> have anything still running before we remove resources.

mm.. I'm no IRQ expert but it strikes me as surprising as that'd mean
every single driver would always have to run synchronize_irq() on
module exit, no?

I see there is a kthread_stop() in __free_irq(), you sure it doesn't
wait for threaded IRQs?

^ permalink raw reply

* Re: [PATCH net-next v2 0/2] Fixes for unlocked cls hardware offload API refactoring
From: David Miller @ 2019-08-30 22:12 UTC (permalink / raw)
  To: vladbu; +Cc: netdev, jhs, xiyou.wangcong, jiri, saeedm, idosch,
	sergei.shtylyov
In-Reply-To: <20190829161517.20935-1-vladbu@mellanox.com>

From: Vlad Buslov <vladbu@mellanox.com>
Date: Thu, 29 Aug 2019 19:15:15 +0300

> Two fixes for my "Refactor cls hardware offload API to support
> rtnl-independent drivers" series.

Series applied, thanks Vlad.

^ permalink raw reply

* Re: [PATCH net-next] tcp_bbr: clarify that bbr_bdp() rounds up in comments
From: David Miller @ 2019-08-30 22:10 UTC (permalink / raw)
  To: luke.w.hsiao; +Cc: netdev, lukehsiao, soheil, ncardwell, priyarjha
In-Reply-To: <20190829140244.195954-1-luke.w.hsiao@gmail.com>

From: Luke Hsiao <luke.w.hsiao@gmail.com>
Date: Thu, 29 Aug 2019 10:02:44 -0400

> From: Luke Hsiao <lukehsiao@google.com>
> 
> This explicitly clarifies that bbr_bdp() returns the rounded-up value of
> the bandwidth-delay product and why in the comments.
> 
> Signed-off-by: Luke Hsiao <lukehsiao@google.com>
> Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
> Acked-by: Neal Cardwell <ncardwell@google.com>
> Acked-by: Priyaranjan Jha <priyarjha@google.com>

Applied.

^ permalink raw reply

* Re: [patch net-next] sched: act_vlan: implement stats_update callback
From: David Miller @ 2019-08-30 22:10 UTC (permalink / raw)
  To: jiri; +Cc: netdev, mlxsw, pengfeil, jhs, xiyou.wangcong
In-Reply-To: <20190829133842.13958-1-jiri@resnulli.us>

From: Jiri Pirko <jiri@resnulli.us>
Date: Thu, 29 Aug 2019 15:38:42 +0200

> From: Jiri Pirko <jiri@mellanox.com>
> 
> Implement this callback in order to get the offloaded stats added to the
> kernel stats.
> 
> Reported-by: Pengfei Liu <pengfeil@mellanox.com>
> Signed-off-by: Jiri Pirko <jiri@mellanox.com>

Applied.

^ permalink raw reply

* Re: [PATCH net] rxrpc: Fix lack of conn cleanup when local endpoint is cleaned up [ver #2]
From: David Miller @ 2019-08-30 22:07 UTC (permalink / raw)
  To: dhowells; +Cc: netdev, marc.dionne, linux-afs, linux-kernel
In-Reply-To: <156708433176.26714.15112030261308314860.stgit@warthog.procyon.org.uk>

From: David Howells <dhowells@redhat.com>
Date: Thu, 29 Aug 2019 14:12:11 +0100

> When a local endpoint is ceases to be in use, such as when the kafs module
> is unloaded, the kernel will emit an assertion failure if there are any
> outstanding client connections:
> 
> 	rxrpc: Assertion failed
> 	------------[ cut here ]------------
> 	kernel BUG at net/rxrpc/local_object.c:433!
> 
> and even beyond that, will evince other oopses if there are service
> connections still present.
> 
> Fix this by:
 ...
> Only after destroying the connections can we close the socket lest we get
> an oops in a workqueue that's looking at a connection or a peer.
> 
> Fixes: 3d18cbb7fd0c ("rxrpc: Fix conn expiry timers")
> Signed-off-by: David Howells <dhowells@redhat.com>
> Tested-by: Marc Dionne <marc.dionne@auristor.com>

Applied.

^ permalink raw reply

* Re: [PATCH net 0/7] rxrpc: Fix use of skb_cow_data()
From: David Miller @ 2019-08-30 21:55 UTC (permalink / raw)
  To: dhowells; +Cc: netdev, linux-afs, linux-kernel
In-Reply-To: <156708405310.26102.7954021163316252673.stgit@warthog.procyon.org.uk>

From: David Howells <dhowells@redhat.com>
Date: Thu, 29 Aug 2019 14:07:33 +0100

> Here's a series of patches that replaces the use of skb_cow_data() in rxrpc
> with skb_unshare() early on in the input process.  The problem that is
> being seen is that skb_cow_data() indirectly requires that the maximum
> usage count on an sk_buff be 1, and it may generate an assertion failure in
> pskb_expand_head() if not.
> 
> This can occur because rxrpc_input_data() may be still holding a ref when
> it has just attached the sk_buff to the rx ring and given that attachment
> its own ref.  If recvmsg happens fast enough, skb_cow_data() can see the
> ref still held by the softirq handler.
> 
> Further, a packet may contain multiple subpackets, each of which gets its
> own attachment to the ring and its own ref - also making skb_cow_data() go
> bang.
> 
> Fix this by:
 ...
> There are also patches to improve the rxrpc_skb tracepoint to make sure
> that Tx-derived buffers are identified separately from Rx-derived buffers
> in the trace.

This looks great, thanks for reimplementing this using skb_unshare().

> The patches are tagged here:
> 
> 	git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git
> 	rxrpc-fixes-20190827

Pulled, thanks again.

^ permalink raw reply

* Re: [PATCH 1/2] rtnetlink: gate MAC address with an LSM hook
From: Paul Moore @ 2019-08-30 21:46 UTC (permalink / raw)
  To: Michal Kubecek
  Cc: netdev, Jeffrey Vander Stoep, David Miller, LSM List, selinux
In-Reply-To: <20190829074516.GM29594@unicorn.suse.cz>

On Thu, Aug 29, 2019 at 3:45 AM Michal Kubecek <mkubecek@suse.cz> wrote:
> On Tue, Aug 27, 2019 at 04:47:04PM -0400, Paul Moore wrote:
> >
> > I'm also not a big fan of inserting the hook in rtnl_fill_ifinfo(); as
> > presented it is way too specific for a LSM hook for me to be happy.
> > However, I do agree that giving the LSMs some control over netlink
> > messages makes sense.  As others have pointed out, it's all a matter
> > of where to place the hook.
> >
> > If we only care about netlink messages which leverage nlattrs I
> > suppose one option that I haven't seen mentioned would be to place a
> > hook in nla_put().  While it is a bit of an odd place for a hook, it
> > would allow the LSM easy access to the skb and attribute type to make
> > decisions, and all of the callers should already be checking the
> > return code (although we would need to verify this).  One notable
> > drawback (not the only one) is that the hook is going to get hit
> > multiple times for each message.
>
> For most messages, "multiple times" would mean tens, for many even
> hundreds of calls. For each, you would have to check corresponding
> socket (and possibly also genetlink header) to see which netlink based
> protocol it is and often even parse existing part of the message to get
> the context (because the same numeric attribute type can mean something
> completely different if it appears in a nested attribute).
>
> Also, nla_put() (or rather __nla_put()) is not used for all attributes,
> one may also use nla_reserve() and then compose the attribute date in
> place.

I never said it was a great idea, just an idea ;)

Honestly I'm just trying to spur some discussion on this so we can
hopefully arrive at a solution which allows a LSM to control kernel
generated netlink messages that we can all accept.

-- 
paul moore
www.paul-moore.com

^ permalink raw reply

* RE: [PATCH] ncsi-netlink: support sending NC-SI commands over Netlink interface
From: Ben Wei @ 2019-08-30 21:46 UTC (permalink / raw)
  To: Terry Duncan, sam@mendozajonas.com, davem@davemloft.net,
	netdev@vger.kernel.org, openbmc@lists.ozlabs.org,
	Justin.Lee1@Dell.com
In-Reply-To: <0da11d73-b3ab-53f6-f695-30857a743a7b@linux.intel.com>

> On 8/22/19 5:02 PM, Ben Wei wrote:
> > This patch extends ncsi-netlink command line utility to send NC-SI command to kernel driver
> > via NCSI_CMD_SEND_CMD command.
> > 
> > New command line option -o (opcode) is used to specify NC-SI command and optional payload.
> > 
>
> Thank you for posting this Ben.
> Something looks off on this next line but it looks fine in your pull 
> request in the github.com/sammj/ncsi-netlink repo.
>
> > +static int send_cb(struct nl_msg *msg, void *arg) { #define
> > +ETHERNET_HEADER_SIZE 16
> > +

Yes I think my email client was not configured correctly for plain text so it's removing certain line breaks.
Hopefully I have this figured out now so my future patches won't have this issue.

>
> Do you have plans to upstream your yocto recipe for this repo?

Yes I sure can upstream the recipe file. I had to make local changes to build ncsi-netlink for my BMC platform.
Is there a group I may submit my recipe to? 

Thanks,
-Ben

^ permalink raw reply

* Re: [PATCH v6 net-next 15/19] ionic: Add Tx and Rx handling
From: Shannon Nelson @ 2019-08-30 21:44 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: netdev, davem
In-Reply-To: <20190829163319.0f4e8707@cakuba.netronome.com>

On 8/29/19 4:33 PM, Jakub Kicinski wrote:
> On Thu, 29 Aug 2019 11:27:16 -0700, Shannon Nelson wrote:
>> +static int ionic_tx_tso(struct ionic_queue *q, struct sk_buff *skb)
>> +{
>> +	struct ionic_tx_stats *stats = q_to_tx_stats(q);
>> +	struct ionic_desc_info *abort = q->head;
>> +	struct device *dev = q->lif->ionic->dev;
>> +	struct ionic_desc_info *rewind = abort;
>> +	struct ionic_txq_sg_elem *elem;
>> +	struct ionic_txq_desc *desc;
>> +	unsigned int frag_left = 0;
>> +	unsigned int offset = 0;
>> +	unsigned int len_left;
>> +	dma_addr_t desc_addr;
>> +	unsigned int hdrlen;
>> +	unsigned int nfrags;
>> +	unsigned int seglen;
>> +	u64 total_bytes = 0;
>> +	u64 total_pkts = 0;
>> +	unsigned int left;
>> +	unsigned int len;
>> +	unsigned int mss;
>> +	skb_frag_t *frag;
>> +	bool start, done;
>> +	bool outer_csum;
>> +	bool has_vlan;
>> +	u16 desc_len;
>> +	u8 desc_nsge;
>> +	u16 vlan_tci;
>> +	bool encap;
>> +	int err;
>> +
>> +	mss = skb_shinfo(skb)->gso_size;
>> +	nfrags = skb_shinfo(skb)->nr_frags;
>> +	len_left = skb->len - skb_headlen(skb);
>> +	outer_csum = (skb_shinfo(skb)->gso_type & SKB_GSO_GRE_CSUM) ||
>> +		     (skb_shinfo(skb)->gso_type & SKB_GSO_UDP_TUNNEL_CSUM);
>> +	has_vlan = !!skb_vlan_tag_present(skb);
>> +	vlan_tci = skb_vlan_tag_get(skb);
>> +	encap = skb->encapsulation;
>> +
>> +	/* Preload inner-most TCP csum field with IP pseudo hdr
>> +	 * calculated with IP length set to zero.  HW will later
>> +	 * add in length to each TCP segment resulting from the TSO.
>> +	 */
>> +
>> +	if (encap)
>> +		err = ionic_tx_tcp_inner_pseudo_csum(skb);
>> +	else
>> +		err = ionic_tx_tcp_pseudo_csum(skb);
>> +	if (err)
>> +		return err;
>> +
>> +	if (encap)
>> +		hdrlen = skb_inner_transport_header(skb) - skb->data +
>> +			 inner_tcp_hdrlen(skb);
>> +	else
>> +		hdrlen = skb_transport_offset(skb) + tcp_hdrlen(skb);
>> +
>> +	seglen = hdrlen + mss;
>> +	left = skb_headlen(skb);
>> +
>> +	desc = ionic_tx_tso_next(q, &elem);
>> +	start = true;
>> +
>> +	/* Chop skb->data up into desc segments */
>> +
>> +	while (left > 0) {
>> +		len = min(seglen, left);
>> +		frag_left = seglen - len;
>> +		desc_addr = ionic_tx_map_single(q, skb->data + offset, len);
>> +		if (dma_mapping_error(dev, desc_addr))
>> +			goto err_out_abort;
>> +		desc_len = len;
>> +		desc_nsge = 0;
>> +		left -= len;
>> +		offset += len;
>> +		if (nfrags > 0 && frag_left > 0)
>> +			continue;
>> +		done = (nfrags == 0 && left == 0);
>> +		ionic_tx_tso_post(q, desc, skb,
>> +				  desc_addr, desc_nsge, desc_len,
>> +				  hdrlen, mss,
>> +				  outer_csum,
>> +				  vlan_tci, has_vlan,
>> +				  start, done);
>> +		total_pkts++;
>> +		total_bytes += start ? len : len + hdrlen;
>> +		desc = ionic_tx_tso_next(q, &elem);
>> +		start = false;
>> +		seglen = mss;
>> +	}
>> +
>> +	/* Chop skb frags into desc segments */
>> +
>> +	for (frag = skb_shinfo(skb)->frags; len_left; frag++) {
>> +		offset = 0;
>> +		left = skb_frag_size(frag);
>> +		len_left -= left;
>> +		nfrags--;
>> +		stats->frags++;
>> +
>> +		while (left > 0) {
>> +			if (frag_left > 0) {
>> +				len = min(frag_left, left);
>> +				frag_left -= len;
>> +				elem->addr =
>> +				    cpu_to_le64(ionic_tx_map_frag(q, frag,
>> +								  offset, len));
>> +				if (dma_mapping_error(dev, elem->addr))
>> +					goto err_out_abort;
>> +				elem->len = cpu_to_le16(len);
>> +				elem++;
>> +				desc_nsge++;
>> +				left -= len;
>> +				offset += len;
>> +				if (nfrags > 0 && frag_left > 0)
>> +					continue;
>> +				done = (nfrags == 0 && left == 0);
>> +				ionic_tx_tso_post(q, desc, skb, desc_addr,
>> +						  desc_nsge, desc_len,
>> +						  hdrlen, mss, outer_csum,
>> +						  vlan_tci, has_vlan,
>> +						  start, done);
>> +				total_pkts++;
>> +				total_bytes += start ? len : len + hdrlen;
>> +				desc = ionic_tx_tso_next(q, &elem);
>> +				start = false;
>> +			} else {
>> +				len = min(mss, left);
>> +				frag_left = mss - len;
>> +				desc_addr = ionic_tx_map_frag(q, frag,
>> +							      offset, len);
>> +				if (dma_mapping_error(dev, desc_addr))
>> +					goto err_out_abort;
>> +				desc_len = len;
>> +				desc_nsge = 0;
>> +				left -= len;
>> +				offset += len;
>> +				if (nfrags > 0 && frag_left > 0)
>> +					continue;
>> +				done = (nfrags == 0 && left == 0);
>> +				ionic_tx_tso_post(q, desc, skb, desc_addr,
>> +						  desc_nsge, desc_len,
>> +						  hdrlen, mss, outer_csum,
>> +						  vlan_tci, has_vlan,
>> +						  start, done);
>> +				total_pkts++;
>> +				total_bytes += start ? len : len + hdrlen;
>> +				desc = ionic_tx_tso_next(q, &elem);
>> +				start = false;
>> +			}
>> +		}
>> +	}
>> +
>> +	stats->pkts += total_pkts;
>> +	stats->bytes += total_bytes;
>> +	stats->tso++;
>> +
>> +	return 0;
>> +
>> +err_out_abort:
>> +	while (rewind->desc != q->head->desc) {
>> +		ionic_tx_clean(q, rewind, NULL, NULL);
>> +		rewind = rewind->next;
>> +	}
>> +	q->head = abort;
>> +
>> +	return -ENOMEM;
>> +}
> There's definitely a function for helping drivers which can't do full
> TSO slice up the packet, but I can't find it now 😫😫
>
> Eric would definitely know.
>
> Did you have a look? Would it be useful here?

Yes, obviously this could use some work for clarity and supportability, 
and I think for performance as well.  But since it works, I've been 
concentrating on getting other parts of the driver working before coming 
back to this.  If there are some tools that can help clean this up, I 
would be interested to see them.

sln




^ permalink raw reply

* Re: [PATCH -next] net: mlx5: Kconfig: Fix MLX5_CORE_EN dependencies
From: Saeed Mahameed @ 2019-08-30 21:43 UTC (permalink / raw)
  To: davem@davemloft.net, maowenan@huawei.com, leon@kernel.org
  Cc: kernel-janitors@vger.kernel.org, netdev@vger.kernel.org,
	linux-rdma@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <20190827031251.98881-1-maowenan@huawei.com>

On Tue, 2019-08-27 at 11:12 +0800, Mao Wenan wrote:
> When MLX5_CORE_EN=y and PCI_HYPERV_INTERFACE is not set, below errors

The issue happens when PCI_HYPERV_INTERFACE is a module and mlx5_core
is built-in.

> are found:
> drivers/net/ethernet/mellanox/mlx5/core/en_main.o: In function
> `mlx5e_nic_enable':
> en_main.c:(.text+0xb649): undefined reference to
> `mlx5e_hv_vhca_stats_create'
> drivers/net/ethernet/mellanox/mlx5/core/en_main.o: In function
> `mlx5e_nic_disable':
> en_main.c:(.text+0xb8c4): undefined reference to
> `mlx5e_hv_vhca_stats_destroy'
> 
> This because CONFIG_PCI_HYPERV_INTERFACE is newly introduced by
> 'commit 348dd93e40c1
> ("PCI: hv: Add a Hyper-V PCI interface driver for software
> backchannel interface"),
> Fix this by making MLX5_CORE_EN imply PCI_HYPERV_INTERFACE.
> 

the imply should be in MLX5_CORE not MLX5_CORE_EN since the
implementation also involves MLX5_CORE. 

I will prepare a patch with these fixups.

Thanks,
Saeed.


^ permalink raw reply

* Re: [PATCH v6 net-next 16/19] ionic: Add netdev-event handling
From: Shannon Nelson @ 2019-08-30 21:36 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: netdev, davem
In-Reply-To: <20190829163738.64e7fe42@cakuba.netronome.com>

On 8/29/19 4:37 PM, Jakub Kicinski wrote:
> On Thu, 29 Aug 2019 11:27:17 -0700, Shannon Nelson wrote:
>> When the netdev gets a new name from userland, pass that name
>> down to the NIC for internal tracking.
>>
>> Signed-off-by: Shannon Nelson <snelson@pensando.io>
> There is a precedent in ACPI for telling the FW what OS is running but
> how is the interface name useful for the firmware I can't really tell.
It is so we can correlate the host's interface name with the internal 
port data for internal logging.

sln


^ permalink raw reply

* Re: linux-next: build failure after merge of the net-next tree
From: David Miller @ 2019-08-30 21:35 UTC (permalink / raw)
  To: sfr; +Cc: netdev, linux-next, linux-kernel, weifeng.voon, boon.leong.ong
In-Reply-To: <20190829200546.7b9af296@canb.auug.org.au>

From: Stephen Rothwell <sfr@canb.auug.org.au>
Date: Thu, 29 Aug 2019 20:05:46 +1000

> From: Stephen Rothwell <sfr@canb.auug.org.au>
> Date: Thu, 29 Aug 2019 19:49:27 +1000
> Subject: [PATCH] net: stmmac: depend on COMMON_CLK
> 
> Fixes: 190f73ab4c43 ("net: stmmac: setup higher frequency clk support for EHL & TGL")
> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

Applied.

^ permalink raw reply

* Re: [PATCH bpf-next 00/13] bpf: adding map batch processing support
From: Jakub Kicinski @ 2019-08-30 21:35 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Alexei Starovoitov, bpf@vger.kernel.org, netdev@vger.kernel.org,
	Brian Vazquez, Daniel Borkmann, Kernel Team, Quentin Monnet
In-Reply-To: <a3422ffd-e9f2-af77-a92d-81393a9f4fc7@fb.com>

On Fri, 30 Aug 2019 07:25:54 +0000, Yonghong Song wrote:
> On 8/29/19 11:39 AM, Jakub Kicinski wrote:
> > On Wed, 28 Aug 2019 23:45:02 -0700, Yonghong Song wrote:  
> >> Brian Vazquez has proposed BPF_MAP_DUMP command to look up more than one
> >> map entries per syscall.
> >>    https://lore.kernel.org/bpf/CABCgpaU3xxX6CMMxD+1knApivtc2jLBHysDXw-0E9bQEL0qC3A@mail.gmail.com/T/#t
> >>
> >> During discussion, we found more use cases can be supported in a similar
> >> map operation batching framework. For example, batched map lookup and delete,
> >> which can be really helpful for bcc.
> >>    https://github.com/iovisor/bcc/blob/master/tools/tcptop.py#L233-L243
> >>    https://github.com/iovisor/bcc/blob/master/tools/slabratetop.py#L129-L138
> >>      
> >> Also, in bcc, we have API to delete all entries in a map.
> >>    https://github.com/iovisor/bcc/blob/master/src/cc/api/BPFTable.h#L257-L264
> >>
> >> For map update, batched operations also useful as sometimes applications need
> >> to populate initial maps with more than one entry. For example, the below
> >> example is from kernel/samples/bpf/xdp_redirect_cpu_user.c:
> >>    https://github.com/torvalds/linux/blob/master/samples/bpf/xdp_redirect_cpu_user.c#L543-L550
> >>
> >> This patch addresses all the above use cases. To make uapi stable, it also
> >> covers other potential use cases. Four bpf syscall subcommands are introduced:
> >>      BPF_MAP_LOOKUP_BATCH
> >>      BPF_MAP_LOOKUP_AND_DELETE_BATCH
> >>      BPF_MAP_UPDATE_BATCH
> >>      BPF_MAP_DELETE_BATCH
> >>
> >> In userspace, application can iterate through the whole map one batch
> >> as a time, e.g., bpf_map_lookup_batch() in the below:
> >>      p_key = NULL;
> >>      p_next_key = &key;
> >>      while (true) {
> >>         err = bpf_map_lookup_batch(fd, p_key, &p_next_key, keys, values,
> >>                                    &batch_size, elem_flags, flags);
> >>         if (err) ...
> >>         if (p_next_key) break; // done
> >>         if (!p_key) p_key = p_next_key;
> >>      }
> >> Please look at individual patches for details of new syscall subcommands
> >> and examples of user codes.
> >>
> >> The testing is also done in a qemu VM environment:
> >>        measure_lookup: max_entries 1000000, batch 10, time 342ms
> >>        measure_lookup: max_entries 1000000, batch 1000, time 295ms
> >>        measure_lookup: max_entries 1000000, batch 1000000, time 270ms
> >>        measure_lookup: max_entries 1000000, no batching, time 1346ms
> >>        measure_lookup_delete: max_entries 1000000, batch 10, time 433ms
> >>        measure_lookup_delete: max_entries 1000000, batch 1000, time 363ms
> >>        measure_lookup_delete: max_entries 1000000, batch 1000000, time 357ms
> >>        measure_lookup_delete: max_entries 1000000, not batch, time 1894ms
> >>        measure_delete: max_entries 1000000, batch, time 220ms
> >>        measure_delete: max_entries 1000000, not batch, time 1289ms
> >> For a 1M entry hash table, batch size of 10 can reduce cpu time
> >> by 70%. Please see patch "tools/bpf: measure map batching perf"
> >> for details of test codes.  
> > 
> > Hi Yonghong!
> > 
> > great to see this, we have been looking at implementing some way to
> > speed up map walks as well.
> > 
> > The direction we were looking in, after previous discussions [1],
> > however, was to provide a BPF program which can run the logic entirely
> > within the kernel.
> > 
> > We have a rough PoC on the FW side (we can offload the program which
> > walks the map, which is pretty neat), but the kernel verifier side
> > hasn't really progressed. It will soon.
> > 
> > The rough idea is that the user space provides two programs, "filter"
> > and "dumper":
> > 
> > 	bpftool map exec id XYZ filter pinned /some/prog \
> > 				dumper pinned /some/other_prog
> > 
> > Both programs get this context:
> > 
> > struct map_op_ctx {
> > 	u64 key;
> > 	u64 value;
> > }
> > 
> > We need a per-map implementation of the exec side, but roughly maps
> > would do:
> > 
> > 	LIST_HEAD(deleted);
> > 
> > 	for entry in map {
> > 		struct map_op_ctx {
> > 			.key	= entry->key,
> > 			.value	= entry->value,
> > 		};
> > 
> > 		act = BPF_PROG_RUN(filter, &map_op_ctx);
> > 		if (act & ~ACT_BITS)
> > 			return -EINVAL;
> > 
> > 		if (act & DELETE) {
> > 			map_unlink(entry);
> > 			list_add(entry, &deleted);
> > 		}
> > 		if (act & STOP)
> > 			break;
> > 	}
> > 
> > 	synchronize_rcu();
> > 
> > 	for entry in deleted {
> > 		struct map_op_ctx {
> > 			.key	= entry->key,
> > 			.value	= entry->value,
> > 		};
> > 		
> > 		BPF_PROG_RUN(dumper, &map_op_ctx);
> > 		map_free(entry);
> > 	}
> > 
> > The filter program can't perform any map operations other than lookup,
> > otherwise we won't be able to guarantee that we'll walk the entire map
> > (if the filter program deletes some entries in a unfortunate order).  
> 
> Looks like you will provide a new program type and per-map 
> implementation of above code. My patch set indeed avoided per-map 
> implementation for all of lookup/delete/get-next-key...

Indeed, the simple batched ops are undeniably lower LoC.

> > If user space just wants a pure dump it can simply load a program which
> > dumps the entries into a perf ring.  
> 
> percpu perf ring is not really ideal for user space which simply just
> want to get some key/value pairs back. Some kind of generate non-per-cpu
> ring buffer might be better for such cases.

I don't think it had to be per-cpu, but I may be blissfully ignorant
about the perf ring details :) bpf_perf_event_output() takes flags,
which are effectively selecting the "output CPU", no?

> > I'm bringing this up because that mechanism should cover what is
> > achieved with this patch set and much more.  
> 
> The only case it did not cover is batched update. But that may not
> be super critical.

Right, my other concern (which admittedly is slightly pedantic) is the
potential loss of modifications. the lookup_and_delete() operation does
not guarantee that the result returned to user space is the "final"
value. We'd need to wait for a RCU grace period between lookup and dump.
The "map is definitely unused now"/RCU guarantee had came up in the
past.

> Your approach give each element an action choice through another bpf 
> program. This indeed powerful. My use case is simpler than your use case
> below, hence the implementation.

Agreed, the simplicity is tempting. 

> > In particular for networking workloads where old flows have to be
> > pruned from the map periodically it's far more efficient to communicate
> > to user space only the flows which timed out (the delete batching from
> > this set won't help at all).  
> 
> Maybe LRU map will help in this case? It is designed for such
> use cases.

LRU map would be perfect if it dumped the entry somewhere before it got
reused... Perhaps we need to attach a program to the LRU map that'd get
run when entries get reaped. That'd be cool 🤔 We could trivially reuse
the "dumper" prog type for this.

> > With a 2M entry map and this patch set we still won't be able to prune
> > once a second on one core.
> > 
> > [1]
> > https://lore.kernel.org/netdev/20190813130921.10704-4-quentin.monnet@netronome.com/

^ permalink raw reply

* Re: [PATCH] ncsi-netlink: support sending NC-SI commands over Netlink interface
From: Terry Duncan @ 2019-08-30 21:25 UTC (permalink / raw)
  To: Ben Wei, sam@mendozajonas.com, davem@davemloft.net,
	netdev@vger.kernel.org, openbmc@lists.ozlabs.org,
	Justin.Lee1@Dell.com
In-Reply-To: <CH2PR15MB36860EECD2EA6D63BEA70110A3A40@CH2PR15MB3686.namprd15.prod.outlook.com>

On 8/22/19 5:02 PM, Ben Wei wrote:
> This patch extends ncsi-netlink command line utility to send NC-SI command to kernel driver
> via NCSI_CMD_SEND_CMD command.
> 
> New command line option -o (opcode) is used to specify NC-SI command and optional payload.
> 

Thank you for posting this Ben.
Something looks off on this next line but it looks fine in your pull 
request in the github.com/sammj/ncsi-netlink repo.

> +static int send_cb(struct nl_msg *msg, void *arg) { #define
> +ETHERNET_HEADER_SIZE 16
> +

Do you have plans to upstream your yocto recipe for this repo?

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox