Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [EXT] INFO: trying to register non-static key in del_timer_sync (2)
From: Andrey Konovalov @ 2019-08-13 13:36 UTC (permalink / raw)
  To: Ganapathi Bhat
  Cc: Dmitry Vyukov, syzbot, amitkarwar@gmail.com, davem@davemloft.net,
	huxinming820@gmail.com, kvalo@codeaurora.org,
	linux-kernel@vger.kernel.org, linux-usb@vger.kernel.org,
	linux-wireless@vger.kernel.org, netdev@vger.kernel.org,
	nishants@marvell.com, syzkaller-bugs@googlegroups.com
In-Reply-To: <MN2PR18MB263710E8F1F8FFA06B2EDB3CA0EC0@MN2PR18MB2637.namprd18.prod.outlook.com>

On Wed, Jun 12, 2019 at 6:03 PM Ganapathi Bhat <gbhat@marvell.com> wrote:
>
> Hi Dmitry,
>
> We have a patch to fix this: https://patchwork.kernel.org/patch/10990275/

Hi Ganapathi,

Has this patch been accepted anywhere? This bug is still open on syzbot.

Thanks!

^ permalink raw reply

* Re: KASAN: slab-out-of-bounds Read in p54u_load_firmware_cb
From: Andrey Konovalov @ 2019-08-13 13:27 UTC (permalink / raw)
  Cc: syzbot, Christian Lamparter, David S. Miller, Kalle Valo,
	Kernel development list, USB list, linux-wireless, netdev,
	syzkaller-bugs, Alan Stern
In-Reply-To: <Pine.LNX.4.44L0.1906201544001.1346-100000@iolanthe.rowland.org>

On Thu, Jun 20, 2019 at 9:46 PM Alan Stern <stern@rowland.harvard.edu> wrote:
>
> On Wed, 19 Jun 2019, syzbot wrote:
>
> > syzbot has found a reproducer for the following crash on:
> >
> > HEAD commit:    9939f56e usb-fuzzer: main usb gadget fuzzer driver
> > git tree:       https://github.com/google/kasan.git usb-fuzzer
> > console output: https://syzkaller.appspot.com/x/log.txt?x=135e29faa00000
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=df134eda130bb43a
> > dashboard link: https://syzkaller.appspot.com/bug?extid=6d237e74cdc13f036473
> > compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
> > syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=175d946ea00000
> >
> > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > Reported-by: syzbot+6d237e74cdc13f036473@syzkaller.appspotmail.com
> >
> > usb 3-1: Direct firmware load for isl3887usb failed with error -2
> > usb 3-1: Firmware not found.
> > ==================================================================
> > BUG: KASAN: slab-out-of-bounds in p54u_load_firmware_cb.cold+0x97/0x13d
> > drivers/net/wireless/intersil/p54/p54usb.c:936
> > Read of size 8 at addr ffff8881c9cf7588 by task kworker/1:5/2759
> >
> > CPU: 1 PID: 2759 Comm: kworker/1:5 Not tainted 5.2.0-rc5+ #11
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> > Google 01/01/2011
> > Workqueue: events request_firmware_work_func
> > Call Trace:
> >   __dump_stack lib/dump_stack.c:77 [inline]
> >   dump_stack+0xca/0x13e lib/dump_stack.c:113
> >   print_address_description+0x67/0x231 mm/kasan/report.c:188
> >   __kasan_report.cold+0x1a/0x32 mm/kasan/report.c:317
> >   kasan_report+0xe/0x20 mm/kasan/common.c:614
> >   p54u_load_firmware_cb.cold+0x97/0x13d
> > drivers/net/wireless/intersil/p54/p54usb.c:936
> >   request_firmware_work_func+0x126/0x242
> > drivers/base/firmware_loader/main.c:785
> >   process_one_work+0x905/0x1570 kernel/workqueue.c:2269
> >   worker_thread+0x96/0xe20 kernel/workqueue.c:2415
> >   kthread+0x30b/0x410 kernel/kthread.c:255
> >   ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352
> >
> > Allocated by task 1612:
> >   save_stack+0x1b/0x80 mm/kasan/common.c:71
> >   set_track mm/kasan/common.c:79 [inline]
> >   __kasan_kmalloc mm/kasan/common.c:489 [inline]
> >   __kasan_kmalloc.constprop.0+0xbf/0xd0 mm/kasan/common.c:462
> >   kmalloc include/linux/slab.h:547 [inline]
> >   syslog_print kernel/printk/printk.c:1346 [inline]
> >   do_syslog kernel/printk/printk.c:1519 [inline]
> >   do_syslog+0x4f4/0x12e0 kernel/printk/printk.c:1493
> >   kmsg_read+0x8a/0xb0 fs/proc/kmsg.c:40
> >   proc_reg_read+0x1c1/0x280 fs/proc/inode.c:221
> >   __vfs_read+0x76/0x100 fs/read_write.c:425
> >   vfs_read+0x18e/0x3d0 fs/read_write.c:461
> >   ksys_read+0x127/0x250 fs/read_write.c:587
> >   do_syscall_64+0xb7/0x560 arch/x86/entry/common.c:301
> >   entry_SYSCALL_64_after_hwframe+0x49/0xbe
> >
> > Freed by task 1612:
> >   save_stack+0x1b/0x80 mm/kasan/common.c:71
> >   set_track mm/kasan/common.c:79 [inline]
> >   __kasan_slab_free+0x130/0x180 mm/kasan/common.c:451
> >   slab_free_hook mm/slub.c:1421 [inline]
> >   slab_free_freelist_hook mm/slub.c:1448 [inline]
> >   slab_free mm/slub.c:2994 [inline]
> >   kfree+0xd7/0x280 mm/slub.c:3949
> >   syslog_print kernel/printk/printk.c:1405 [inline]
> >   do_syslog kernel/printk/printk.c:1519 [inline]
> >   do_syslog+0xff3/0x12e0 kernel/printk/printk.c:1493
> >   kmsg_read+0x8a/0xb0 fs/proc/kmsg.c:40
> >   proc_reg_read+0x1c1/0x280 fs/proc/inode.c:221
> >   __vfs_read+0x76/0x100 fs/read_write.c:425
> >   vfs_read+0x18e/0x3d0 fs/read_write.c:461
> >   ksys_read+0x127/0x250 fs/read_write.c:587
> >   do_syscall_64+0xb7/0x560 arch/x86/entry/common.c:301
> >   entry_SYSCALL_64_after_hwframe+0x49/0xbe
> >
> > The buggy address belongs to the object at ffff8881c9cf7180
> >   which belongs to the cache kmalloc-1k of size 1024
> > The buggy address is located 8 bytes to the right of
> >   1024-byte region [ffff8881c9cf7180, ffff8881c9cf7580)
> > The buggy address belongs to the page:
> > page:ffffea0007273d00 refcount:1 mapcount:0 mapping:ffff8881dac02a00
> > index:0x0 compound_mapcount: 0
> > flags: 0x200000000010200(slab|head)
> > raw: 0200000000010200 dead000000000100 dead000000000200 ffff8881dac02a00
> > raw: 0000000000000000 00000000000e000e 00000001ffffffff 0000000000000000
> > page dumped because: kasan: bad access detected
> >
> > Memory state around the buggy address:
> >   ffff8881c9cf7480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> >   ffff8881c9cf7500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > > ffff8881c9cf7580: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
> >                        ^
> >   ffff8881c9cf7600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> >   ffff8881c9cf7680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > ==================================================================
>
> Isn't this the same as syzkaller bug 200d4bb11b23d929335f ?  Doesn't
> the same patch fix it?

#syz dup: KASAN: use-after-free Read in p54u_load_firmware_cb

^ permalink raw reply

* Re: [PATCH 2/2] net: gmii2rgmii: Switch priv field in mdio device structure
From: Andrew Lunn @ 2019-08-13 13:23 UTC (permalink / raw)
  To: Harini Katakam
  Cc: Harini Katakam, Florian Fainelli, Heiner Kallweit, David Miller,
	Michal Simek, netdev, linux-arm-kernel, linux-kernel,
	radhey.shyam.pandey
In-Reply-To: <CAFcVEC+DyVhLzbMdSDsadivbnZJxSEg-0kUF5_Q+mtSbBnmhSA@mail.gmail.com>

On Tue, Aug 13, 2019 at 04:46:40PM +0530, Harini Katakam wrote:
> Hi Andrew,
> 
> On Thu, Aug 1, 2019 at 9:36 AM Andrew Lunn <andrew@lunn.ch> wrote:
> >
> > On Wed, Jul 31, 2019 at 03:06:19PM +0530, Harini Katakam wrote:
> > > Use the priv field in mdio device structure instead of the one in
> > > phy device structure. The phy device priv field may be used by the
> > > external phy driver and should not be overwritten.
> >
> > Hi Harini
> >
> > I _think_ you could use dev_set_drvdata(&mdiodev->dev) in xgmiitorgmii_probe() and
> > dev_get_drvdata(&phydev->mdiomdio.dev) in _read_status()
> 
> Thanks for the review. This works if I do:
> dev_set_drvdata(&priv->phy_dev->mdio.dev->dev) in probe
> and then
> dev_get_drvdata(&phydev->mdio.dev) in _read_status()
> 
> i.e mdiodev in gmii2rgmii probe and priv->phy_dev->mdio are not the same.
> 
> If this is acceptable, I can send a v2.

Hi Harini

I think this is better, making use of the central driver
infrastructure, rather than inventing something new.

The kernel does have a few helper, spi_get_drvdata, pci_get_drvdata,
hci_get_drvdata. So maybe had add phydev_get_drvdata(struct phy_device
*phydev)?

	Thanks
		Andrew

^ permalink raw reply

* Re: [PATCH net-next v2 6/9] net: macsec: hardware offloading infrastructure
From: Andrew Lunn @ 2019-08-13 13:17 UTC (permalink / raw)
  To: Antoine Tenart
  Cc: Igor Russkikh, davem@davemloft.net, sd@queasysnail.net,
	f.fainelli@gmail.com, hkallweit1@gmail.com,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	thomas.petazzoni@bootlin.com, alexandre.belloni@bootlin.com,
	allan.nielsen@microchip.com, camelia.groza@nxp.com,
	Simon Edelhaus, Pavel Belous
In-Reply-To: <20190813085817.GA3200@kwain>

On Tue, Aug 13, 2019 at 10:58:17AM +0200, Antoine Tenart wrote:
> I think this question is linked to the use of a MACsec virtual interface
> when using h/w offloading. The starting point for me was that I wanted
> to reuse the data structures and the API exposed to the userspace by the
> s/w implementation of MACsec. I then had two choices: keeping the exact
> same interface for the user (having a virtual MACsec interface), or
> registering the MACsec genl ops onto the real net devices (and making
> the s/w implementation a virtual net dev and a provider of the MACsec
> "offloading" ops).
> 
> The advantages of the first option were that nearly all the logic of the
> s/w implementation could be kept and especially that it would be
> transparent for the user to use both implementations of MACsec.

Hi Antoine

We have always talked about offloading operations to the hardware,
accelerating what the linux stack can do by making use of hardware
accelerators. The basic user API should not change because of
acceleration. Those are the general guidelines.

It would however be interesting to get comments from those who did the
software implementation and what they think of this architecture. I've
no personal experience with MACSec, so it is hard for me to say if the
current architecture makes sense when using accelerators.

	Andrew

^ permalink raw reply

* [RFC bpf-next 3/3] tools: bpftool: add "bpftool map count" to count entries in map
From: Quentin Monnet @ 2019-08-13 13:09 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann
  Cc: bpf, netdev, oss-drivers, Quentin Monnet
In-Reply-To: <20190813130921.10704-1-quentin.monnet@netronome.com>

Add a "map count" subcommand for counting the number of entries in a
map.

Because of the variety of BPF map types, it is not entirely clear what
counts as an "entry" to dump. We could count all entries for which we
have keys; but then for all array derivatives, it would simply come down
to printing the maximum number of entries (already accessible through
"bpftool map show").

Several map types set errno to ENOENT when they consider there is no
value associated to a key, so we could maybe use that... But then there
are also some map types that simply reject lookup attempts with
EONOTSUPP (xskmap, sock_map, sock_hash): Not being able to lookup a
value in such maps does not mean they have no values.

Instead of trying to enforce a definition for a "map entry", the
selected approach in this patch consists in dumping several counter, and
letting the user decide how to interpret them. Values printed are:

  - Max number of entries
  - Number of key found
  - Number of successful lookups
  - Number of failed lookups, broken down into the most frequent values
    for errno (ENOENT, EOPNOTSUPP, EINVAL, EPERM).

Not all possible values for errno are included (e.g. ENOMEM or EFAULT,
for example), they can be added in the future if necessary.

Below are some sample output with different types of maps.

Array map:

        # bpftool map count id 11
        max entries:            2
        keys found:             2
        successful lookups:     2

Empty prog_array map:

        # bpftool map count id 13
        max entries:            5
        keys found:             5
        successful lookups:     0
        failed lookups: 5, of which:
          - errno set to ENOENT:        5

Empty xskmap:

        # bpftool map count id 14
        max entries:            5
        keys found:             5
        successful lookups:     0
        failed lookups: 5, of which:
          - errno set to EOPNOTSUPP:    5

JSON for the array map:

        # bpftool map count id 11
        {
            "max_entries": 2,
            "n_keys": 2,
            "n_lookup_success": 2,
            "lookup_failures": {
                "enoent": 0,
                "eopnotsupp": 0,
                "einval": 0,
                "eperm": 0,
            }
        }

Queue map containing 3 items:

        # bpftool map count id 12
        failed to get next key, interrupting count: Invalid argument
        max entries:            5
        keys found:             0
        successful lookups:     0

Note that counting entries for queue and stack maps is not supported
(beyond max_entries), as these types do not support cycling over the
keys.

This commit also adds relevant documentation and bash completion.

Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
 .../bpf/bpftool/Documentation/bpftool-map.rst | 15 +++
 tools/bpf/bpftool/bash-completion/bpftool     |  4 +-
 tools/bpf/bpftool/map.c                       | 97 ++++++++++++++++++-
 3 files changed, 113 insertions(+), 3 deletions(-)

diff --git a/tools/bpf/bpftool/Documentation/bpftool-map.rst b/tools/bpf/bpftool/Documentation/bpftool-map.rst
index 61d1d270eb5e..ccc19bdd2ca3 100644
--- a/tools/bpf/bpftool/Documentation/bpftool-map.rst
+++ b/tools/bpf/bpftool/Documentation/bpftool-map.rst
@@ -25,6 +25,7 @@ MAP COMMANDS
 |	**bpftool** **map create**     *FILE* **type** *TYPE* **key** *KEY_SIZE* **value** *VALUE_SIZE* \
 |		**entries** *MAX_ENTRIES* **name** *NAME* [**flags** *FLAGS*] [**dev** *NAME*]
 |	**bpftool** **map dump**       *MAP*
+|	**bpftool** **map count**      *MAP*
 |	**bpftool** **map update**     *MAP* [**key** *DATA*] [**value** *VALUE*] [*UPDATE_FLAGS*]
 |	**bpftool** **map lookup**     *MAP* [**key** *DATA*]
 |	**bpftool** **map getnext**    *MAP* [**key** *DATA*]
@@ -67,6 +68,20 @@ DESCRIPTION
 	**bpftool map dump**    *MAP*
 		  Dump all entries in a given *MAP*.
 
+	**bpftool map count**   *MAP*
+		  Count the number of entries in a given *MAP*. Several values
+		  are printed: the maximum number of entries, the number of
+		  keys found, the number of successful lookups with those keys.
+		  The report for failed lookups is broken down to give values
+		  for the most frequent **errno** values.
+
+		  Note that the counters may not be accurate if the map is
+		  being modified (for example by a running BPF program). For
+		  example, if an element gets removed while being dumped, and
+		  then passed in as the "previous key" while cycling over map
+		  keys, the dump will restart and bpftool will count the
+		  entries multiple times.
+
 	**bpftool map update**  *MAP* [**key** *DATA*] [**value** *VALUE*] [*UPDATE_FLAGS*]
 		  Update map entry for a given *KEY*.
 
diff --git a/tools/bpf/bpftool/bash-completion/bpftool b/tools/bpf/bpftool/bash-completion/bpftool
index df16c5415444..764c88bfe9da 100644
--- a/tools/bpf/bpftool/bash-completion/bpftool
+++ b/tools/bpf/bpftool/bash-completion/bpftool
@@ -449,7 +449,7 @@ _bpftool()
         map)
             local MAP_TYPE='id pinned'
             case $command in
-                show|list|dump|peek|pop|dequeue)
+                show|list|dump|count|peek|pop|dequeue)
                     case $prev in
                         $command)
                             COMPREPLY=( $( compgen -W "$MAP_TYPE" -- "$cur" ) )
@@ -642,7 +642,7 @@ _bpftool()
                     [[ $prev == $object ]] && \
                         COMPREPLY=( $( compgen -W 'delete dump getnext help \
                             lookup pin event_pipe show list update create \
-                            peek push enqueue pop dequeue' -- \
+                            peek push enqueue pop dequeue count' -- \
                             "$cur" ) )
                     ;;
             esac
diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c
index cead639b3ab1..918d08d1676e 100644
--- a/tools/bpf/bpftool/map.c
+++ b/tools/bpf/bpftool/map.c
@@ -822,6 +822,98 @@ static int do_dump(int argc, char **argv)
 	return err;
 }
 
+static int do_count(int argc, char **argv)
+{
+	unsigned int num_keys = 0, num_lookups = 0;
+	unsigned int err_cnts[1024] = {};
+	struct bpf_map_info info = {};
+	void *key, *value, *prev_key;
+	__u32 len = sizeof(info);
+	int err, fd;
+
+	if (!REQ_ARGS(2))
+		return -1;
+
+	fd = map_parse_fd_and_info(&argc, &argv, &info, &len);
+	if (fd < 0)
+		return -1;
+
+	key = malloc(info.key_size);
+	value = alloc_value(&info);
+	if (!key || !value) {
+		p_err("mem alloc failed");
+		err = -1;
+		goto exit_free;
+	}
+
+	prev_key = NULL;
+	while (true) {
+		int res;
+
+		err = bpf_map_get_next_key(fd, prev_key, key);
+		if (err) {
+			if (errno == ENOENT)
+				err = 0;
+			else
+				p_info("failed to get next key, interrupting count: %s",
+				       strerror(errno));
+			break;
+		}
+
+		num_keys++;
+		res = bpf_map_lookup_elem(fd, key, value);
+		if (res) {
+			if (errno < (int)ARRAY_SIZE(err_cnts))
+				err_cnts[errno]++;
+		} else {
+			num_lookups++;
+		}
+		prev_key = key;
+	}
+
+	if (json_output) {
+		jsonw_start_object(json_wtr);	/* root */
+		jsonw_uint_field(json_wtr, "max_entries", info.max_entries);
+		jsonw_uint_field(json_wtr, "n_keys", num_keys);
+		jsonw_uint_field(json_wtr, "n_lookup_success", num_lookups);
+		jsonw_name(json_wtr, "lookup_failures");
+		jsonw_start_object(json_wtr);	/* lookup_failures */
+		jsonw_uint_field(json_wtr, "enoent", err_cnts[ENOENT]);
+		jsonw_uint_field(json_wtr, "eopnotsupp", err_cnts[EOPNOTSUPP]);
+		jsonw_uint_field(json_wtr, "einval", err_cnts[EINVAL]);
+		jsonw_uint_field(json_wtr, "eperm", err_cnts[EPERM]);
+		jsonw_end_object(json_wtr);	/* lookup_failures */
+		jsonw_end_object(json_wtr);	/* root */
+	} else {
+		printf("max entries:\t\t%u\n", info.max_entries);
+		printf("keys found:\t\t%u\n", num_keys);
+		printf("successful lookups:\t%u\n", num_lookups);
+		if (num_lookups != num_keys) {
+			printf("failed lookups:\t%u, of which:\n",
+			       num_keys - num_lookups);
+			if (err_cnts[ENOENT])
+				printf("  - errno set to ENOENT:\t%u\n",
+				       err_cnts[ENOENT]);
+			if (err_cnts[EOPNOTSUPP])
+				printf("  - errno set to EOPNOTSUPP:\t%u\n",
+				       err_cnts[EOPNOTSUPP]);
+			if (err_cnts[EINVAL])
+				printf("  - errno set to EINVAL:\t%u\n",
+				       err_cnts[EINVAL]);
+			if (err_cnts[EPERM])
+				printf("  - errno set to EPERM:\t\t%u\n",
+				       err_cnts[EPERM]);
+		}
+	}
+
+exit_free:
+	free(key);
+	free(value);
+	close(fd);
+
+	return err;
+}
+
 static int alloc_key_value(struct bpf_map_info *info, void **key, void **value)
 {
 	*key = NULL;
@@ -1250,6 +1342,7 @@ static int do_help(int argc, char **argv)
 		"                              entries MAX_ENTRIES name NAME [flags FLAGS] \\\n"
 		"                              [dev NAME]\n"
 		"       %s %s dump       MAP\n"
+		"       %s %s count      MAP\n"
 		"       %s %s update     MAP [key DATA] [value VALUE] [UPDATE_FLAGS]\n"
 		"       %s %s lookup     MAP [key DATA]\n"
 		"       %s %s getnext    MAP [key DATA]\n"
@@ -1279,7 +1372,8 @@ static int do_help(int argc, char **argv)
 		bin_name, argv[-2], bin_name, argv[-2], bin_name, argv[-2],
 		bin_name, argv[-2], bin_name, argv[-2], bin_name, argv[-2],
 		bin_name, argv[-2], bin_name, argv[-2], bin_name, argv[-2],
-		bin_name, argv[-2], bin_name, argv[-2], bin_name, argv[-2]);
+		bin_name, argv[-2], bin_name, argv[-2], bin_name, argv[-2],
+		bin_name, argv[-2]);
 
 	return 0;
 }
@@ -1301,6 +1395,7 @@ static const struct cmd cmds[] = {
 	{ "enqueue",	do_update },
 	{ "pop",	do_pop_dequeue },
 	{ "dequeue",	do_pop_dequeue },
+	{ "count",	do_count },
 	{ 0 }
 };
 
-- 
2.17.1


^ permalink raw reply related

* [RFC bpf-next 1/3] tools: bpftool: clean up dump_map_elem() return value
From: Quentin Monnet @ 2019-08-13 13:09 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann
  Cc: bpf, netdev, oss-drivers, Quentin Monnet
In-Reply-To: <20190813130921.10704-1-quentin.monnet@netronome.com>

The code for dumping a map entry (as part of a full map dump) was moved
to a specific function dump_map_elem() in commit 18a781daa93e
("tools/bpf: bpftool, split the function do_dump()"). The "num_elems"
variable was moved in that function, incremented on success, and
returned to be immediately added to the counter in do_dump().

Returning the count of elements dumped, which is either 0 or 1, is not
really consistent with the rest of the function, especially because
"dump_map_elem()" name is not explicit about returning a counter.
Furthermore, the counter is not incremented when the entry is dumped in
JSON. This has no visible effect, because the number of elements
successfully dumped is not printed for JSON output.

Still, let's remove "num_elems" from the function and make it return 0
or -1 in case of success or failure, respectively. This is more correct,
and more consistent with the rest of the code.

It is unclear if an error value should indeed be returned for maps of
maps or maps of progs, but this has no effect on the output either, so
we just leave the current behaviour unchanged.

Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
 tools/bpf/bpftool/map.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c
index bfbbc6b4cb83..206ee46189d9 100644
--- a/tools/bpf/bpftool/map.c
+++ b/tools/bpf/bpftool/map.c
@@ -686,7 +686,6 @@ static int dump_map_elem(int fd, void *key, void *value,
 			 struct bpf_map_info *map_info, struct btf *btf,
 			 json_writer_t *btf_wtr)
 {
-	int num_elems = 0;
 	int lookup_errno;

 	if (!bpf_map_lookup_elem(fd, key, value)) {
@@ -704,9 +703,8 @@ static int dump_map_elem(int fd, void *key, void *value,
 			} else {
 				print_entry_plain(map_info, key, value);
 			}
-			num_elems++;
 		}
-		return num_elems;
+		return 0;
 	}

 	/* lookup error handling */
@@ -714,7 +712,7 @@ static int dump_map_elem(int fd, void *key, void *value,

 	if (map_is_map_of_maps(map_info->type) ||
 	    map_is_map_of_progs(map_info->type))
-		return 0;
+		return -1;

 	if (json_output) {
 		jsonw_start_object(json_wtr);
@@ -738,7 +736,7 @@ static int dump_map_elem(int fd, void *key, void *value,
 				  msg ? : strerror(lookup_errno));
 	}

-	return 0;
+	return -1;
 }

 static int do_dump(int argc, char **argv)
@@ -800,7 +798,8 @@ static int do_dump(int argc, char **argv)
 				err = 0;
 			break;
 		}
-		num_elems += dump_map_elem(fd, key, value, &info, btf, btf_wtr);
+		if (!dump_map_elem(fd, key, value, &info, btf, btf_wtr))
+			num_elems++;
 		prev_key = key;
 	}

-- 
2.17.1

^ permalink raw reply related

* [RFC bpf-next 2/3] tools: bpftool: make comment more explicit for count of dumped entries
From: Quentin Monnet @ 2019-08-13 13:09 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann
  Cc: bpf, netdev, oss-drivers, Quentin Monnet
In-Reply-To: <20190813130921.10704-1-quentin.monnet@netronome.com>

The counter printed at the end of plain map dump does not reflect the
exact number of entries in the map, but the number of entries bpftool
managed to dump (some of them could not be read, or made no sense to
dump (map-in-map...)).

Edit slightly the message to make this more explicit.

Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
 tools/bpf/bpftool/map.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c
index 206ee46189d9..cead639b3ab1 100644
--- a/tools/bpf/bpftool/map.c
+++ b/tools/bpf/bpftool/map.c
@@ -809,7 +809,7 @@ static int do_dump(int argc, char **argv)
 		jsonw_end_array(btf_wtr);
 		jsonw_destroy(&btf_wtr);
 	} else {
-		printf("Found %u element%s\n", num_elems,
+		printf("Found %u element%s to dump\n", num_elems,
 		       num_elems != 1 ? "s" : "");
 	}
 
-- 
2.17.1


^ permalink raw reply related

* [RFC bpf-next 0/3] tools: bpftool: add subcommand to count map entries
From: Quentin Monnet @ 2019-08-13 13:09 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann
  Cc: bpf, netdev, oss-drivers, Quentin Monnet

This series adds a "bpftool map count" subcommand to count the number of
entries present in a BPF map. This results from a customer request for a
tool to count the number of entries in BPF maps used in production (for
example, to know how many free entries are left in a given map).

The first two commits actually contain some clean-up in preparation for the
new subcommand.

The third commit adds the new subcommand. Because what data should count as
an entry is not entirely clear for all map types, we actually dump several
counters, and leave it to the users to interpret the values.

Sending as a RFC because I'm looking for feedback on the approach. Is
printing several values the good thing to do? Also, note that some map
types such as queue/stack maps do not support any type of counting, this
would need to be implemented in the kernel I believe.

More generally, we have a use case where (hash) maps are under pressure
(many additions/deletions from the BPF program), and counting the entries
by iterating other the different keys is not at all reliable. Would that
make sense to add a new bpf() subcommand to count the entries on the kernel
side instead of cycling over the entries in bpftool? If so, we would need
to agree on what makes an entry for each kind of map.

Note that we are also facing similar issues for purging map from their
entries (deleting all entries at once). We can iterate on the keys and
delete elements one by one, but this is very inefficient when entries are
being added/removed in parallel from the BPF program, and having another
dedicated command accessible from the bpf() system call might help here as
well.

Quentin Monnet (3):
  tools: bpftool: clean up dump_map_elem() return value
  tools: bpftool: make comment more explicit for count of dumped entries
  tools: bpftool: add "bpftool map count" to count entries in map

 .../bpf/bpftool/Documentation/bpftool-map.rst |  15 +++
 tools/bpf/bpftool/bash-completion/bpftool     |   4 +-
 tools/bpf/bpftool/map.c                       | 110 ++++++++++++++++--
 3 files changed, 119 insertions(+), 10 deletions(-)

-- 
2.17.1

^ permalink raw reply

* Re: [PATCH] net: ethernet: mediatek: Add MT7628/88 SoC support
From: Stefan Roese @ 2019-08-13 13:09 UTC (permalink / raw)
  To: Daniel Golle
  Cc: netdev, René van Dorst, Felix Fietkau, Sean Wang,
	linux-mediatek, John Crispin
In-Reply-To: <20190717121506.GD18996@makrotopia.org>

On 17.07.19 14:15, Daniel Golle wrote:
> On Wed, Jul 17, 2019 at 01:02:43PM +0200, Stefan Roese wrote:
>> This patch adds support for the MediaTek MT7628/88 SoCs to the common
>> MediaTek ethernet driver. Some minor changes are needed for this and
>> a bigger change, as the MT7628 does not support QDMA (only PDMA).
> 
> The Ethernet core found in MT7628/88 is identical to that found in
> Ralink Rt5350F SoC. Wouldn't it hence make sense to indicate that
> in the compatible string of this driver as well? In OpenWrt we are
> using "ralink,rt5350-eth".

Okay. I'll use this ralink compatible instead in the next version.

Thanks,
Stefan

^ permalink raw reply

* [patch net-next] selftests: netdevsim: add devlink params tests
From: Jiri Pirko @ 2019-08-13 13:04 UTC (permalink / raw)
  To: netdev; +Cc: davem, jakub.kicinski, mlxsw

From: Jiri Pirko <jiri@mellanox.com>

Test recently added netdevsim devlink param implementation.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 .../drivers/net/netdevsim/devlink.sh          | 62 ++++++++++++++++++-
 1 file changed, 61 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/drivers/net/netdevsim/devlink.sh b/tools/testing/selftests/drivers/net/netdevsim/devlink.sh
index 9d8baf5d14b3..858ebdc8d8a3 100755
--- a/tools/testing/selftests/drivers/net/netdevsim/devlink.sh
+++ b/tools/testing/selftests/drivers/net/netdevsim/devlink.sh
@@ -3,7 +3,7 @@
 
 lib_dir=$(dirname $0)/../../../net/forwarding
 
-ALL_TESTS="fw_flash_test"
+ALL_TESTS="fw_flash_test params_test"
 NUM_NETIFS=0
 source $lib_dir/lib.sh
 
@@ -30,6 +30,66 @@ fw_flash_test()
 	log_test "fw flash test"
 }
 
+param_get()
+{
+	local name=$1
+
+	devlink dev param show $DL_HANDLE name $name -j | \
+		jq -e -r '.[][][].values[] | select(.cmode == "driverinit").value'
+}
+
+param_set()
+{
+	local name=$1
+	local value=$2
+
+	devlink dev param set $DL_HANDLE name $name cmode driverinit value $value
+}
+
+check_value()
+{
+	local name=$1
+	local phase_name=$2
+	local expected_param_value=$3
+	local expected_debugfs_value=$4
+	local value
+
+	value=$(param_get $name)
+	check_err $? "Failed to get $name param value"
+	[ "$value" == "$expected_param_value" ]
+	check_err $? "Unexpected $phase_name $name param value"
+	value=$(<$DEBUGFS_DIR/$name)
+	check_err $? "Failed to get $name debugfs value"
+	[ "$value" == "$expected_debugfs_value" ]
+	check_err $? "Unexpected $phase_name $name debugfs value"
+}
+
+params_test()
+{
+	RET=0
+
+	local max_macs
+	local test1
+
+	check_value max_macs initial 32 32
+	check_value test1 initial true Y
+
+	param_set max_macs 16
+	check_err $? "Failed to set max_macs param value"
+	param_set test1 false
+	check_err $? "Failed to set test1 param value"
+
+	check_value max_macs post-set 16 32
+	check_value test1 post-set false Y
+
+	devlink dev reload $DL_HANDLE
+
+	check_value max_macs post-reload 16 16
+	check_value test1 post-reload false N
+
+	log_test "params test"
+}
+
 setup_prepare()
 {
 	modprobe netdevsim
-- 
2.21.0


^ permalink raw reply related

* Re: [PATCH net-next] net: can: Fix compiling warning
From: Dan Carpenter @ 2019-08-13 12:48 UTC (permalink / raw)
  To: Kees Cook, Nicolai Stange
  Cc: Oliver Hartkopp, Patrick Bellasi, linux-sparse, Mao Wenan, davem,
	netdev, linux-kernel, kernel-janitors, Ingo Molnar
In-Reply-To: <201908121001.0AC0A90@keescook>

On Mon, Aug 12, 2019 at 10:19:27AM -0700, Kees Cook wrote:
> On Wed, Aug 07, 2019 at 01:50:42PM +0300, Dan Carpenter wrote:
> > On Tue, Aug 06, 2019 at 06:41:44PM +0200, Oliver Hartkopp wrote:
> > > I compiled the code (the original version), but I do not get that "Should it
> > > be static?" warning:
> > > 
> > > user@box:~/net-next$ make C=1
> > >   CALL    scripts/checksyscalls.sh
> > >   CALL    scripts/atomic/check-atomics.sh
> > >   DESCEND  objtool
> > >   CHK     include/generated/compile.h
> > >   CHECK   net/can/af_can.c
> > > ./include/linux/sched.h:609:43: error: bad integer constant expression
> > > ./include/linux/sched.h:609:73: error: invalid named zero-width bitfield
> > > `value'
> > > ./include/linux/sched.h:610:43: error: bad integer constant expression
> > > ./include/linux/sched.h:610:67: error: invalid named zero-width bitfield
> > > `bucket_id'
> > >   CC [M]  net/can/af_can.o
> > 
> > The sched.h errors suppress Sparse warnings so it's broken/useless now.
> > The code looks like this:
> > 
> > include/linux/sched.h
> >    613  struct uclamp_se {
> >    614          unsigned int value              : bits_per(SCHED_CAPACITY_SCALE);
> >    615          unsigned int bucket_id          : bits_per(UCLAMP_BUCKETS);
> >    616          unsigned int active             : 1;
> >    617          unsigned int user_defined       : 1;
> >    618  };
> > 
> > bits_per() is zero and Sparse doesn't like zero sized bitfields.
> 
> I just noticed these sparse warnings too -- what's happening here? Are
> they _supposed_ to be 0-width fields? It doesn't look like it to me:

I'm sorr, I don't even know what code I was looking at before.  I think
my cscope database was stale?  You're right.  Sparse doesn't think it's
zero, it knows that it is 11 and 3.

What's happening is that it's failing the test in in
bad_integer_constant_expression():

	if (!(expr->flags & CEF_ICE))

The ICE in CEF_ICE stands for Integer Constant Expression.  The rule
here is that enums are not constant expressions in c99.  See the
explanation in commit 274c154704db ("constexpr: introduce additional
expression constness tracking flags").

I don't think the CEF_ICE is set properly in evaluate_conditional_expression().
If conditional is constant and it's true and the ->cond_true expression
is constant then the result should be constant as well.  It shouldn't
matter if the cond_false is constant.  But instead it is ANDing all
three sub expressions:

	expr->flags = (expr->conditional->flags & (*true)->flags &
			expr->cond_false->flags & ~CEF_CONST_MASK);

Or actually in this case it's doing:

	if (expr->conditional->flags & (CEF_ACE | CEF_ADDR))
		expr->flags = (*true)->flags & expr->cond_false->flags & ~CEF_CONST_MASK;

But it's the same problem because it's should ignore cond_false.

regards,
dan carpenter

^ permalink raw reply

* Re: KMSAN: uninit-value in smsc75xx_bind
From: Oliver Neukum @ 2019-08-13 12:43 UTC (permalink / raw)
  To: syzbot, davem, glider, syzkaller-bugs, steve.glendinning,
	linux-kernel, linux-usb, netdev
In-Reply-To: <0000000000009f4316058fab3bd7@google.com>

Am Freitag, den 09.08.2019, 01:48 -0700 schrieb syzbot:
> Hello,
> 
> syzbot found the following crash on:
> 
> HEAD commit:    beaab8a3 fix KASAN build
> git tree:       kmsan

[..]
> Call Trace:
>   __dump_stack lib/dump_stack.c:77 [inline]
>   dump_stack+0x191/0x1f0 lib/dump_stack.c:113
>   kmsan_report+0x162/0x2d0 mm/kmsan/kmsan_report.c:109
>   __msan_warning+0x75/0xe0 mm/kmsan/kmsan_instr.c:294
>   smsc75xx_wait_ready drivers/net/usb/smsc75xx.c:976 [inline]
>   smsc75xx_bind+0x541/0x12d0 drivers/net/usb/smsc75xx.c:1483

> 
> Local variable description: ----buf.i93@smsc75xx_bind
> Variable was created at:
>   __smsc75xx_read_reg drivers/net/usb/smsc75xx.c:83 [inline]
>   smsc75xx_wait_ready drivers/net/usb/smsc75xx.c:969 [inline]
>   smsc75xx_bind+0x44c/0x12d0 drivers/net/usb/smsc75xx.c:1483
>   usbnet_probe+0x10d3/0x3950 drivers/net/usb/usbnet.c:1722

Hi,

this looks like a false positive to me.
The offending code is likely this:

        if (size) {
                buf = kmalloc(size, GFP_KERNEL);
                if (!buf)
                        goto out;
        }

        err = usb_control_msg(dev->udev, usb_rcvctrlpipe(dev->udev, 0),
                              cmd, reqtype, value, index, buf, size,
                              USB_CTRL_GET_TIMEOUT);

which uses 'buf' uninitialized. But it is used for input.
What is happening here?

	Regards
		Oliver




^ permalink raw reply

* Re: [PATCH 12/16] arm64: prefer __section from compiler_attributes.h
From: Miguel Ojeda @ 2019-08-13 12:36 UTC (permalink / raw)
  To: Will Deacon
  Cc: Nick Desaulniers, Andrew Morton, Sedat Dilek, Josh Poimboeuf, yhs,
	clang-built-linux, Catalin Marinas, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Andrey Konovalov,
	Greg Kroah-Hartman, Enrico Weigelt, Suzuki K Poulose,
	Thomas Gleixner, Masayoshi Mizuma, Shaokun Zhang, Alexios Zavras,
	Allison Randal, Linux ARM, linux-kernel, Network Development, bpf
In-Reply-To: <20190813082744.xmzmm4j675rqiz47@willie-the-truck>

On Tue, Aug 13, 2019 at 10:27 AM Will Deacon <will@kernel.org> wrote:
>
> Hi Nick,
>
> On Mon, Aug 12, 2019 at 02:50:45PM -0700, Nick Desaulniers wrote:
> > GCC unescapes escaped string section names while Clang does not. Because
> > __section uses the `#` stringification operator for the section name, it
> > doesn't need to be escaped.
> >
> > This antipattern was found with:
> > $ grep -e __section\(\" -e __section__\(\" -r
> >
> > Reported-by: Sedat Dilek <sedat.dilek@gmail.com>
> > Suggested-by: Josh Poimboeuf <jpoimboe@redhat.com>
> > Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
> > ---
> >  arch/arm64/include/asm/cache.h     | 2 +-
> >  arch/arm64/kernel/smp_spin_table.c | 2 +-
> >  2 files changed, 2 insertions(+), 2 deletions(-)
>
> Does this fix a build issue, or is it just cosmetic or do we end up with
> duplicate sections or something else?

This should be cosmetic -- basically we are trying to move all users
of current available __attribute__s in compiler_attributes.h to the
__attr forms. I am also adding (slowly) new attributes that are
already used but we don't have them yet in __attr form.

> Happy to route it via arm64, just having trouble working out whether it's
> 5.3 material!

As you prefer! Those that are not taken by a maintainer I will pick up
and send via compiler-attributes.

I would go for 5.4, since there is no particular rush anyway.

Cheers,
Miguel

^ permalink raw reply

* Re: libbpf distro packaging
From: Jiri Olsa @ 2019-08-13 12:24 UTC (permalink / raw)
  To: Julia Kartseva
  Cc: labbott@redhat.com, acme@kernel.org,
	debian-kernel@lists.debian.org, netdev@vger.kernel.org,
	Andrii Nakryiko, Andrey Ignatov, Alexei Starovoitov,
	Yonghong Song, jolsa@kernel.org
In-Reply-To: <3FBEC3F8-5C3C-40F9-AF6E-C355D8F62722@fb.com>

On Mon, Aug 12, 2019 at 07:04:12PM +0000, Julia Kartseva wrote:
> I would like to bring up libbpf publishing discussion started at [1].
> The present state of things is that libbpf is built from kernel tree, e.g. [2]
> For Debian and [3] for Fedora whereas the better way would be having a
> package built from github mirror. The advantages of the latter:
> - Consistent, ABI matching versioning across distros
> - The mirror has integration tests
> - No need in kernel tree to build a package
> - Changes can be merged directly to github w/o waiting them to be merged
> through bpf-next -> net-next -> main
> There is a PR introducing a libbpf.spec which can be used as a starting point: [4]
> Any comments regarding the spec itself can be posted there.
> In the future it may be used as a source of truth.
> Please consider switching libbpf packaging to the github mirror instead
> of the kernel tree.
> Thanks
> 
> [1] https://lists.iovisor.org/g/iovisor-dev/message/1521
> [2] https://packages.debian.org/sid/libbpf4.19
> [3] http://rpmfind.net/linux/RPM/fedora/devel/rawhide/x86_64/l/libbpf-5.3.0-0.rc2.git0.1.fc31.x86_64.html
> [4] https://github.com/libbpf/libbpf/pull/64

hi,
Fedora has libbpf as kernel-tools subpackage, so I think
we'd need to create new package and deprecate the current

but I like the ABI stability by using github .. how's actually
the sync (in both directions) with kernel sources going on?

thanks,
jirka

^ permalink raw reply

* Re: [PATCH 00/16] treewide: prefer __section from compiler_attributes.h
From: Miguel Ojeda @ 2019-08-13 12:18 UTC (permalink / raw)
  To: Nick Desaulniers
  Cc: Andrew Morton, Sedat Dilek, Josh Poimboeuf, yhs,
	clang-built-linux, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Network Development, bpf
In-Reply-To: <20190812215052.71840-17-ndesaulniers@google.com>

On Mon, Aug 12, 2019 at 11:53 PM Nick Desaulniers
<ndesaulniers@google.com> wrote:
>
> GCC unescapes escaped string section names while Clang does not. Because
> __section uses the `#` stringification operator for the section name, it
> doesn't need to be escaped.

Thanks a lot Nick, this takes a weight off my mind. One __attribute__
less to go.

I guess I can take the series myself, since the changes are not that
big to other parts of the kernel as long as I get Acks; and anyway I
plan to do other attributes over time.

Cheers,
Miguel

^ permalink raw reply

* [RFC PATCH bpf-next 14/14] bpf, hashtab: Compare keys in long
From: Toshiaki Makita @ 2019-08-13 12:05 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko
  Cc: Toshiaki Makita, netdev, bpf, William Tu
In-Reply-To: <20190813120558.6151-1-toshiaki.makita1@gmail.com>

memcmp() is generally slow. Compare keys in long if possible.
This improves xdp_flow performance.
This is included in this series just to demonstrate to what extent
xdp_flow performance can increase.

Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
---
 kernel/bpf/hashtab.c | 27 +++++++++++++++++++++++++--
 1 file changed, 25 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 22066a6..8b5ffd4 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -417,6 +417,29 @@ static inline struct hlist_nulls_head *select_bucket(struct bpf_htab *htab, u32
 	return &__select_bucket(htab, hash)->head;
 }
 
+/* key1 must be aligned to sizeof long */
+static bool key_equal(void *key1, void *key2, u32 size)
+{
+	/* Check for key1 */
+	BUILD_BUG_ON(!IS_ALIGNED(offsetof(struct htab_elem, key),
+				 sizeof(long)));
+
+	if (IS_ALIGNED((unsigned long)key2 | (unsigned long)size,
+		       sizeof(long))) {
+		unsigned long *lkey1, *lkey2;
+
+		for (lkey1 = key1, lkey2 = key2; size > 0;
+		     lkey1++, lkey2++, size -= sizeof(long)) {
+			if (*lkey1 != *lkey2)
+				return false;
+		}
+
+		return true;
+	}
+
+	return !memcmp(key1, key2, size);
+}
+
 /* this lookup function can only be called with bucket lock taken */
 static struct htab_elem *lookup_elem_raw(struct hlist_nulls_head *head, u32 hash,
 					 void *key, u32 key_size)
@@ -425,7 +448,7 @@ static struct htab_elem *lookup_elem_raw(struct hlist_nulls_head *head, u32 hash
 	struct htab_elem *l;
 
 	hlist_nulls_for_each_entry_rcu(l, n, head, hash_node)
-		if (l->hash == hash && !memcmp(&l->key, key, key_size))
+		if (l->hash == hash && key_equal(&l->key, key, key_size))
 			return l;
 
 	return NULL;
@@ -444,7 +467,7 @@ static struct htab_elem *lookup_nulls_elem_raw(struct hlist_nulls_head *head,
 
 again:
 	hlist_nulls_for_each_entry_rcu(l, n, head, hash_node)
-		if (l->hash == hash && !memcmp(&l->key, key, key_size))
+		if (l->hash == hash && key_equal(&l->key, key, key_size))
 			return l;
 
 	if (unlikely(get_nulls_value(n) != (hash & (n_buckets - 1))))
-- 
1.8.3.1


^ permalink raw reply related

* [RFC PATCH bpf-next 13/14] i40e: prefetch xdp->data before running XDP prog
From: Toshiaki Makita @ 2019-08-13 12:05 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko
  Cc: Toshiaki Makita, netdev, bpf, William Tu
In-Reply-To: <20190813120558.6151-1-toshiaki.makita1@gmail.com>

XDP progs are likely to read/write xdp->data.
This improves the performance of xdp_flow.
This is included in this series just to demonstrate to what extent
xdp_flow performance can increase.

Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index f162252..ea775ae 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -2207,6 +2207,7 @@ static struct sk_buff *i40e_run_xdp(struct i40e_ring *rx_ring,
 	if (!xdp_prog)
 		goto xdp_out;
 
+	prefetchw(xdp->data);
 	prefetchw(xdp->data_hard_start); /* xdp_frame write */
 
 	act = bpf_prog_run_xdp(xdp_prog, xdp);
-- 
1.8.3.1


^ permalink raw reply related

* [RFC PATCH bpf-next 12/14] bpf, selftest: Add test for xdp_flow
From: Toshiaki Makita @ 2019-08-13 12:05 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko
  Cc: Toshiaki Makita, netdev, bpf, William Tu
In-Reply-To: <20190813120558.6151-1-toshiaki.makita1@gmail.com>

Check if TC flower offloading to XDP works.

Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
---
 tools/testing/selftests/bpf/Makefile         |   1 +
 tools/testing/selftests/bpf/test_xdp_flow.sh | 103 +++++++++++++++++++++++++++
 2 files changed, 104 insertions(+)
 create mode 100755 tools/testing/selftests/bpf/test_xdp_flow.sh

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 3bd0f4a..886702a 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -50,6 +50,7 @@ TEST_PROGS := test_kmod.sh \
 	test_xdp_redirect.sh \
 	test_xdp_meta.sh \
 	test_xdp_veth.sh \
+	test_xdp_flow.sh \
 	test_offload.py \
 	test_sock_addr.sh \
 	test_tunnel.sh \
diff --git a/tools/testing/selftests/bpf/test_xdp_flow.sh b/tools/testing/selftests/bpf/test_xdp_flow.sh
new file mode 100755
index 0000000..cb06f3e
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_xdp_flow.sh
@@ -0,0 +1,103 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+#
+# Create 2 namespaces with 2 veth peers, and
+# forward packets in-between using xdp_flow
+#
+# NS1(veth11)        NS2(veth22)
+#      |                  |
+#      |                  |
+#   (veth1)            (veth2)
+#      ^                  ^
+#      |     xdp_flow     |
+#      --------------------
+
+# Kselftest framework requirement - SKIP code is 4.
+ksft_skip=4
+
+TESTNAME=xdp_flow
+
+_cleanup()
+{
+	set +e
+	ip link del veth1 2> /dev/null
+	ip link del veth2 2> /dev/null
+	ip netns del ns1 2> /dev/null
+	ip netns del ns2 2> /dev/null
+}
+
+cleanup_skip()
+{
+	echo "selftests: $TESTNAME [SKIP]"
+	_cleanup
+
+	exit $ksft_skip
+}
+
+cleanup()
+{
+	if [ "$?" = 0 ]; then
+		echo "selftests: $TESTNAME [PASS]"
+	else
+		echo "selftests: $TESTNAME [FAILED]"
+	fi
+	_cleanup
+}
+
+if [ $(id -u) -ne 0 ]; then
+	echo "selftests: $TESTNAME [SKIP] Need root privileges"
+	exit $ksft_skip
+fi
+
+if ! ip link set dev lo xdp off > /dev/null 2>&1; then
+	echo "selftests: $TESTNAME [SKIP] Could not run test without the ip xdp support"
+	exit $ksft_skip
+fi
+
+set -e
+
+trap cleanup_skip EXIT
+
+ip netns add ns1
+ip netns add ns2
+
+ip link add veth1 type veth peer name veth11 netns ns1
+ip link add veth2 type veth peer name veth22 netns ns2
+
+ip link set veth1 up
+ip link set veth2 up
+
+ip -n ns1 addr add 10.1.1.11/24 dev veth11
+ip -n ns2 addr add 10.1.1.22/24 dev veth22
+
+ip -n ns1 link set dev veth11 up
+ip -n ns2 link set dev veth22 up
+
+ip -n ns1 link set dev veth11 xdp obj xdp_dummy.o sec xdp_dummy
+ip -n ns2 link set dev veth22 xdp obj xdp_dummy.o sec xdp_dummy
+
+ethtool -K veth1 tc-offload-xdp on
+ethtool -K veth2 tc-offload-xdp on
+
+trap cleanup EXIT
+
+# Adding clsact or ingress will trigger loading bpf prog in UMH
+tc qdisc add dev veth1 clsact
+tc qdisc add dev veth2 clsact
+
+# Adding filter will have UMH populate flow table map
+# 'skip_sw' can be accepted only when 'tc-offload-xdp' is enabled on veth
+tc filter add dev veth1 ingress protocol ip flower skip_sw \
+	dst_ip 10.1.1.0/24 action mirred egress redirect dev veth2
+tc filter add dev veth2 ingress protocol ip flower skip_sw \
+	dst_ip 10.1.1.0/24 action mirred egress redirect dev veth1
+
+# ARP is not supported so don't add 'skip_sw'
+tc filter add dev veth1 ingress protocol arp flower \
+	arp_tip 10.1.1.0/24 action mirred egress redirect dev veth2
+tc filter add dev veth2 ingress protocol arp flower \
+	arp_sip 10.1.1.0/24 action mirred egress redirect dev veth1
+
+ip netns exec ns1 ping -c 1 -W 1 10.1.1.22
+
+exit 0
-- 
1.8.3.1


^ permalink raw reply related

* [RFC PATCH bpf-next 11/14] xdp_flow: Implement vlan_push action
From: Toshiaki Makita @ 2019-08-13 12:05 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko
  Cc: Toshiaki Makita, netdev, bpf, William Tu
In-Reply-To: <20190813120558.6151-1-toshiaki.makita1@gmail.com>

This is another example action.

Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
---
 net/xdp_flow/xdp_flow_kern_bpf.c | 23 +++++++++++++++++++++--
 net/xdp_flow/xdp_flow_kern_mod.c |  5 +++++
 2 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/net/xdp_flow/xdp_flow_kern_bpf.c b/net/xdp_flow/xdp_flow_kern_bpf.c
index 8f3d359..51e181b 100644
--- a/net/xdp_flow/xdp_flow_kern_bpf.c
+++ b/net/xdp_flow/xdp_flow_kern_bpf.c
@@ -90,10 +90,29 @@ static inline int action_redirect(struct xdp_flow_action *action)
 static inline int action_vlan_push(struct xdp_md *ctx,
 				   struct xdp_flow_action *action)
 {
+	struct vlan_ethhdr *vehdr;
+	void *data, *data_end;
+	__be16 proto, tci;
+
 	account_action(XDP_FLOW_ACTION_VLAN_PUSH);
 
-	// TODO: implement this
-	return XDP_ABORTED;
+	proto = action->vlan.proto;
+	tci = action->vlan.tci;
+
+	if (bpf_xdp_adjust_head(ctx, -VLAN_HLEN))
+		return XDP_DROP;
+
+	data_end = (void *)(long)ctx->data_end;
+	data = (void *)(long)ctx->data;
+	if (data + VLAN_ETH_HLEN > data_end)
+		return XDP_DROP;
+
+	__builtin_memmove(data, data + VLAN_HLEN, ETH_ALEN * 2);
+	vehdr = data;
+	vehdr->h_vlan_proto = proto;
+	vehdr->h_vlan_TCI = tci;
+
+	return _XDP_CONTINUE;
 }
 
 static inline int action_vlan_pop(struct xdp_md *ctx,
diff --git a/net/xdp_flow/xdp_flow_kern_mod.c b/net/xdp_flow/xdp_flow_kern_mod.c
index caa4968..52dc64e 100644
--- a/net/xdp_flow/xdp_flow_kern_mod.c
+++ b/net/xdp_flow/xdp_flow_kern_mod.c
@@ -55,6 +55,11 @@ static int xdp_flow_parse_actions(struct xdp_flow_actions *actions,
 			action->ifindex = act->dev->ifindex;
 			break;
 		case FLOW_ACTION_VLAN_PUSH:
+			action->id = XDP_FLOW_ACTION_VLAN_PUSH;
+			action->vlan.tci = act->vlan.vid |
+					   (act->vlan.prio << VLAN_PRIO_SHIFT);
+			action->vlan.proto = act->vlan.proto;
+			break;
 		case FLOW_ACTION_VLAN_POP:
 		case FLOW_ACTION_VLAN_MANGLE:
 		case FLOW_ACTION_MANGLE:
-- 
1.8.3.1


^ permalink raw reply related

* [RFC PATCH bpf-next 10/14] xdp_flow: Implement redirect action
From: Toshiaki Makita @ 2019-08-13 12:05 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko
  Cc: Toshiaki Makita, netdev, bpf, William Tu
In-Reply-To: <20190813120558.6151-1-toshiaki.makita1@gmail.com>

Add a devmap for XDP_REDIRECT and use it for redirect action.

Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
---
 net/xdp_flow/umh_bpf.h           |   1 +
 net/xdp_flow/xdp_flow_kern_bpf.c |  14 +++-
 net/xdp_flow/xdp_flow_kern_mod.c |   3 +
 net/xdp_flow/xdp_flow_umh.c      | 164 +++++++++++++++++++++++++++++++++++++--
 4 files changed, 175 insertions(+), 7 deletions(-)

diff --git a/net/xdp_flow/umh_bpf.h b/net/xdp_flow/umh_bpf.h
index 4e4633f..a279d0a1 100644
--- a/net/xdp_flow/umh_bpf.h
+++ b/net/xdp_flow/umh_bpf.h
@@ -4,6 +4,7 @@
 
 #include "msgfmt.h"
 
+#define MAX_PORTS 65536
 #define MAX_FLOWS 1024
 #define MAX_FLOW_MASKS 255
 #define FLOW_MASKS_TAIL 255
diff --git a/net/xdp_flow/xdp_flow_kern_bpf.c b/net/xdp_flow/xdp_flow_kern_bpf.c
index ceb8a92..8f3d359 100644
--- a/net/xdp_flow/xdp_flow_kern_bpf.c
+++ b/net/xdp_flow/xdp_flow_kern_bpf.c
@@ -22,6 +22,13 @@ struct bpf_map_def SEC("maps") debug_stats = {
 	.max_entries = 256,
 };
 
+struct bpf_map_def SEC("maps") output_map = {
+	.type = BPF_MAP_TYPE_DEVMAP,
+	.key_size = sizeof(int),
+	.value_size = sizeof(int),
+	.max_entries = MAX_PORTS,
+};
+
 struct bpf_map_def SEC("maps") flow_masks_head = {
 	.type = BPF_MAP_TYPE_ARRAY,
 	.key_size = sizeof(u32),
@@ -71,10 +78,13 @@ static inline int action_drop(void)
 
 static inline int action_redirect(struct xdp_flow_action *action)
 {
+	int tx_port;
+
 	account_action(XDP_FLOW_ACTION_REDIRECT);
 
-	// TODO: implement this
-	return XDP_ABORTED;
+	tx_port = action->ifindex;
+
+	return bpf_redirect_map(&output_map, tx_port, 0);
 }
 
 static inline int action_vlan_push(struct xdp_md *ctx,
diff --git a/net/xdp_flow/xdp_flow_kern_mod.c b/net/xdp_flow/xdp_flow_kern_mod.c
index 891b18c..caa4968 100644
--- a/net/xdp_flow/xdp_flow_kern_mod.c
+++ b/net/xdp_flow/xdp_flow_kern_mod.c
@@ -51,6 +51,9 @@ static int xdp_flow_parse_actions(struct xdp_flow_actions *actions,
 			action->id = XDP_FLOW_ACTION_DROP;
 			break;
 		case FLOW_ACTION_REDIRECT:
+			action->id = XDP_FLOW_ACTION_REDIRECT;
+			action->ifindex = act->dev->ifindex;
+			break;
 		case FLOW_ACTION_VLAN_PUSH:
 		case FLOW_ACTION_VLAN_POP:
 		case FLOW_ACTION_VLAN_MANGLE:
diff --git a/net/xdp_flow/xdp_flow_umh.c b/net/xdp_flow/xdp_flow_umh.c
index 9a4769b..cbb766a 100644
--- a/net/xdp_flow/xdp_flow_umh.c
+++ b/net/xdp_flow/xdp_flow_umh.c
@@ -18,6 +18,7 @@
 extern char xdp_flow_bpf_start;
 extern char xdp_flow_bpf_end;
 int progfile_fd;
+int output_map_fd;
 
 #define zalloc(size) calloc(1, (size))
 
@@ -40,12 +41,22 @@ struct netdev_info {
 	struct netdev_info_key key;
 	struct hlist_node node;
 	struct bpf_object *obj;
+	int devmap_idx;
 	int free_slot_top;
 	int free_slots[MAX_FLOW_MASKS];
 };
 
 DEFINE_HASHTABLE(netdev_info_table, 16);
 
+struct devmap_idx_node {
+	int devmap_idx;
+	struct hlist_node node;
+};
+
+DEFINE_HASHTABLE(devmap_idx_table, 16);
+
+int max_devmap_idx;
+
 static int libbpf_err(int err, char *errbuf)
 {
 	libbpf_strerror(err, errbuf, ERRBUF_SIZE);
@@ -90,6 +101,15 @@ static int setup(void)
 		goto err;
 	}
 
+	output_map_fd = bpf_create_map(BPF_MAP_TYPE_DEVMAP, sizeof(int),
+				       sizeof(int), MAX_PORTS, 0);
+	if (output_map_fd < 0) {
+		err = -errno;
+		pr_err("map creation for output_map failed: %s\n",
+		       strerror(errno));
+		goto err;
+	}
+
 	return 0;
 err:
 	close(progfile_fd);
@@ -97,10 +117,23 @@ static int setup(void)
 	return err;
 }
 
-static int load_bpf(int ifindex, struct bpf_object **objp)
+static void delete_output_map_elem(int idx)
+{
+	char errbuf[ERRBUF_SIZE];
+	int err;
+
+	err = bpf_map_delete_elem(output_map_fd, &idx);
+	if (err) {
+		libbpf_err(err, errbuf);
+		pr_warn("Failed to delete idx %d from output_map: %s\n",
+			idx, errbuf);
+	}
+}
+
+static int load_bpf(int ifindex, int devmap_idx, struct bpf_object **objp)
 {
 	int prog_fd, flow_tables_fd, flow_meta_fd, flow_masks_head_fd, err;
-	struct bpf_map *flow_tables, *flow_masks_head;
+	struct bpf_map *output_map, *flow_tables, *flow_masks_head;
 	int zero = 0, flow_masks_tail = FLOW_MASKS_TAIL;
 	struct bpf_object_open_attr attr = {};
 	char path[256], errbuf[ERRBUF_SIZE];
@@ -133,6 +166,27 @@ static int load_bpf(int ifindex, struct bpf_object **objp)
 	bpf_object__for_each_program(prog, obj)
 		bpf_program__set_type(prog, attr.prog_type);
 
+	output_map = bpf_object__find_map_by_name(obj, "output_map");
+	if (!output_map) {
+		pr_err("Cannot find output_map\n");
+		err = -ENOENT;
+		goto err_obj;
+	}
+
+	err = bpf_map__reuse_fd(output_map, output_map_fd);
+	if (err) {
+		err = libbpf_err(err, errbuf);
+		pr_err("Failed to reuse output_map fd: %s\n", errbuf);
+		goto err_obj;
+	}
+
+	if (bpf_map_update_elem(output_map_fd, &devmap_idx, &ifindex, 0)) {
+		err = -errno;
+		pr_err("Failed to insert idx %d if %d into output_map: %s\n",
+		       devmap_idx, ifindex, strerror(errno));
+		goto err_obj;
+	}
+
 	flow_meta_fd = bpf_create_map(BPF_MAP_TYPE_HASH,
 				      sizeof(struct xdp_flow_key),
 				      sizeof(struct xdp_flow_actions),
@@ -222,6 +276,8 @@ static int load_bpf(int ifindex, struct bpf_object **objp)
 
 	return prog_fd;
 err:
+	delete_output_map_elem(devmap_idx);
+err_obj:
 	bpf_object__close(obj);
 	return err;
 }
@@ -272,6 +328,56 @@ static struct netdev_info *get_netdev_info(const struct mbox_request *req)
 	return netdev_info;
 }
 
+static struct devmap_idx_node *find_devmap_idx(int devmap_idx)
+{
+	struct devmap_idx_node *node;
+
+	hash_for_each_possible(devmap_idx_table, node, node, devmap_idx) {
+		if (node->devmap_idx == devmap_idx)
+			return node;
+	}
+
+	return NULL;
+}
+
+static int get_new_devmap_idx(void)
+{
+	int offset;
+
+	for (offset = 0; offset < MAX_PORTS; offset++) {
+		int devmap_idx = max_devmap_idx++;
+
+		if (max_devmap_idx >= MAX_PORTS)
+			max_devmap_idx -= MAX_PORTS;
+
+		if (!find_devmap_idx(devmap_idx)) {
+			struct devmap_idx_node *node;
+
+			node = malloc(sizeof(*node));
+			if (!node) {
+				pr_err("malloc for devmap_idx failed\n");
+				return -ENOMEM;
+			}
+			node->devmap_idx = devmap_idx;
+			hash_add(devmap_idx_table, &node->node, devmap_idx);
+
+			return devmap_idx;
+		}
+	}
+
+	return -ENOSPC;
+}
+
+static void delete_devmap_idx(int devmap_idx)
+{
+	struct devmap_idx_node *node = find_devmap_idx(devmap_idx);
+
+	if (node) {
+		hash_del(&node->node);
+		free(node);
+	}
+}
+
 static void init_flow_masks_free_slot(struct netdev_info *netdev_info)
 {
 	int i;
@@ -325,11 +431,11 @@ static void delete_flow_masks_free_slot(struct netdev_info *netdev_info,
 
 static int handle_load(const struct mbox_request *req, __u32 *prog_id)
 {
+	int err, prog_fd, devmap_idx = -1;
 	struct netdev_info *netdev_info;
 	struct bpf_prog_info info = {};
 	struct netdev_info_key key;
 	__u32 len = sizeof(info);
-	int err, prog_fd;
 
 	err = get_netdev_info_key(req, &key);
 	if (err)
@@ -346,12 +452,19 @@ static int handle_load(const struct mbox_request *req, __u32 *prog_id)
 	}
 	netdev_info->key.ifindex = key.ifindex;
 
+	devmap_idx = get_new_devmap_idx();
+	if (devmap_idx < 0) {
+		err = devmap_idx;
+		goto err_netdev_info;
+	}
+	netdev_info->devmap_idx = devmap_idx;
+
 	init_flow_masks_free_slot(netdev_info);
 
-	prog_fd = load_bpf(req->ifindex, &netdev_info->obj);
+	prog_fd = load_bpf(req->ifindex, devmap_idx, &netdev_info->obj);
 	if (prog_fd < 0) {
 		err = prog_fd;
-		goto err_netdev_info;
+		goto err_devmap_idx;
 	}
 
 	err = bpf_obj_get_info_by_fd(prog_fd, &info, &len);
@@ -366,6 +479,8 @@ static int handle_load(const struct mbox_request *req, __u32 *prog_id)
 	return 0;
 err_obj:
 	bpf_object__close(netdev_info->obj);
+err_devmap_idx:
+	delete_devmap_idx(devmap_idx);
 err_netdev_info:
 	free(netdev_info);
 
@@ -382,12 +497,45 @@ static int handle_unload(const struct mbox_request *req)
 
 	hash_del(&netdev_info->node);
 	bpf_object__close(netdev_info->obj);
+	delete_output_map_elem(netdev_info->devmap_idx);
+	delete_devmap_idx(netdev_info->devmap_idx);
 	free(netdev_info);
 	pr_debug("XDP program for if %d was closed\n", req->ifindex);
 
 	return 0;
 }
 
+static int convert_ifindex_to_devmap_idx(struct mbox_request *req)
+{
+	int i;
+
+	for (i = 0; i < req->flow.actions.num_actions; i++) {
+		struct xdp_flow_action *action = &req->flow.actions.actions[i];
+
+		if (action->id == XDP_FLOW_ACTION_REDIRECT) {
+			struct netdev_info *netdev_info;
+			struct netdev_info_key key;
+			int err;
+
+			err = get_netdev_info_key(req, &key);
+			if (err)
+				return err;
+			key.ifindex = action->ifindex;
+
+			netdev_info = find_netdev_info(&key);
+			if (!netdev_info) {
+				pr_err("Cannot redirect to ifindex %d. Please setup xdp_flow on ifindex %d in advance.\n",
+				       key.ifindex, key.ifindex);
+				return -ENOENT;
+			}
+
+			action->ifindex = netdev_info->devmap_idx;
+		}
+	}
+
+	return 0;
+}
+
 static int get_table_fd(const struct netdev_info *netdev_info,
 			const char *table_name)
 {
@@ -784,6 +932,11 @@ static int handle_replace(struct mbox_request *req)
 	if (IS_ERR(netdev_info))
 		return PTR_ERR(netdev_info);
 
+	/* TODO: Use XDP_TX for redirect action when possible */
+	err = convert_ifindex_to_devmap_idx(req);
+	if (err)
+		return err;
+
 	err = flow_table_insert_elem(netdev_info, &req->flow);
 	if (err)
 		return err;
@@ -875,6 +1028,7 @@ int main(void)
 		return -1;
 	loop();
 	close(progfile_fd);
+	close(output_map_fd);
 
 	return 0;
 }
-- 
1.8.3.1


^ permalink raw reply related

* [RFC PATCH bpf-next 09/14] xdp_flow: Add netdev feature for enabling TC flower offload to XDP
From: Toshiaki Makita @ 2019-08-13 12:05 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko
  Cc: Toshiaki Makita, netdev, bpf, William Tu
In-Reply-To: <20190813120558.6151-1-toshiaki.makita1@gmail.com>

The usage would be like this:

 $ ethtool -K eth0 tc-offload-xdp on
 $ tc qdisc add dev eth0 clsact
 $ tc filter add dev eth0 ingress protocol ip flower skip_sw ...

Then the filters offloaded to XDP are marked as "in_hw".

If the tc flow block is created when tc-offload-xdp is enabled on the
device, the block is internally marked as xdp and only can be offloaded
to XDP.
The reason not to allow HW-offload and XDP-offload at the same time is
to avoid the situation where offloading to only one of them succeeds.
If we allow offloading to both, users cannot know which offload
succeeded.

NOTE: This makes flows offloaded to XDP look as if they are HW
offloaded, since they will be marked as "in_hw". This could be confusing.
Maybe we can add another status "in_xdp"? Then we can allow both of HW-
and XDP-offload at the same time.

Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
---
 include/linux/netdev_features.h  |  2 ++
 include/net/pkt_cls.h            |  5 +++
 include/net/sch_generic.h        |  1 +
 net/core/dev.c                   |  2 ++
 net/core/ethtool.c               |  1 +
 net/sched/cls_api.c              | 67 +++++++++++++++++++++++++++++++++++++---
 net/xdp_flow/xdp_flow_kern_mod.c |  6 ++++
 7 files changed, 80 insertions(+), 4 deletions(-)

diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
index 4b19c54..ddd201e 100644
--- a/include/linux/netdev_features.h
+++ b/include/linux/netdev_features.h
@@ -80,6 +80,7 @@ enum {
 
 	NETIF_F_GRO_HW_BIT,		/* Hardware Generic receive offload */
 	NETIF_F_HW_TLS_RECORD_BIT,	/* Offload TLS record */
+	NETIF_F_XDP_TC_BIT,		/* Offload TC to XDP */
 
 	/*
 	 * Add your fresh new feature above and remember to update
@@ -150,6 +151,7 @@ enum {
 #define NETIF_F_GSO_UDP_L4	__NETIF_F(GSO_UDP_L4)
 #define NETIF_F_HW_TLS_TX	__NETIF_F(HW_TLS_TX)
 #define NETIF_F_HW_TLS_RX	__NETIF_F(HW_TLS_RX)
+#define NETIF_F_XDP_TC		__NETIF_F(XDP_TC)
 
 /* Finds the next feature with the highest number of the range of start till 0.
  */
diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index e429809..d190aae 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -610,6 +610,11 @@ static inline bool tc_can_offload_extack(const struct net_device *dev,
 	return true;
 }
 
+static inline bool tc_xdp_offload_enabled(const struct net_device *dev)
+{
+	return dev->features & NETIF_F_XDP_TC;
+}
+
 static inline bool tc_skip_hw(u32 flags)
 {
 	return (flags & TCA_CLS_FLAGS_SKIP_HW) ? true : false;
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 6b6b012..a4d90b5 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -402,6 +402,7 @@ struct tcf_block {
 	struct flow_block flow_block;
 	struct list_head owner_list;
 	bool keep_dst;
+	bool xdp;
 	unsigned int offloadcnt; /* Number of oddloaded filters */
 	unsigned int nooffloaddevcnt; /* Number of devs unable to do offload */
 	struct {
diff --git a/net/core/dev.c b/net/core/dev.c
index a45d2e4..d1f980d 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -8680,6 +8680,8 @@ int register_netdevice(struct net_device *dev)
 	 * software offloads (GSO and GRO).
 	 */
 	dev->hw_features |= NETIF_F_SOFT_FEATURES;
+	if (IS_ENABLED(CONFIG_XDP_FLOW))
+		dev->hw_features |= NETIF_F_XDP_TC;
 	dev->features |= NETIF_F_SOFT_FEATURES;
 
 	if (dev->netdev_ops->ndo_udp_tunnel_add) {
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index 6288e69..c7e61cf 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -111,6 +111,7 @@ int ethtool_op_get_ts_info(struct net_device *dev, struct ethtool_ts_info *info)
 	[NETIF_F_HW_TLS_RECORD_BIT] =	"tls-hw-record",
 	[NETIF_F_HW_TLS_TX_BIT] =	 "tls-hw-tx-offload",
 	[NETIF_F_HW_TLS_RX_BIT] =	 "tls-hw-rx-offload",
+	[NETIF_F_XDP_TC_BIT] =		 "tc-offload-xdp",
 };
 
 static const char
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 3565d9a..4c89bab 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -37,6 +37,7 @@
 #include <net/tc_act/tc_skbedit.h>
 #include <net/tc_act/tc_ct.h>
 #include <net/tc_act/tc_mpls.h>
+#include <net/flow_offload_xdp.h>
 
 extern const struct nla_policy rtm_tca_policy[TCA_MAX + 1];
 
@@ -806,7 +807,7 @@ static int tcf_block_offload_cmd(struct tcf_block *block,
 				 struct net_device *dev,
 				 struct tcf_block_ext_info *ei,
 				 enum flow_block_command command,
-				 struct netlink_ext_ack *extack)
+				 bool xdp, struct netlink_ext_ack *extack)
 {
 	struct flow_block_offload bo = {};
 	int err;
@@ -819,13 +820,39 @@ static int tcf_block_offload_cmd(struct tcf_block *block,
 	bo.extack = extack;
 	INIT_LIST_HEAD(&bo.cb_list);
 
-	err = dev->netdev_ops->ndo_setup_tc(dev, TC_SETUP_BLOCK, &bo);
+	if (xdp)
+		err = xdp_flow_setup_block(dev, &bo);
+	else
+		err = dev->netdev_ops->ndo_setup_tc(dev, TC_SETUP_BLOCK, &bo);
 	if (err < 0)
 		return err;
 
 	return tcf_block_setup(block, &bo);
 }
 
+static int tcf_block_offload_bind_xdp(struct tcf_block *block, struct Qdisc *q,
+				      struct tcf_block_ext_info *ei,
+				      struct netlink_ext_ack *extack)
+{
+	struct net_device *dev = q->dev_queue->dev;
+	int err;
+
+	if (!tc_xdp_offload_enabled(dev) && tcf_block_offload_in_use(block)) {
+		NL_SET_ERR_MSG(extack,
+			       "Bind to offloaded block failed as dev has tc-offload-xdp disabled");
+		return -EOPNOTSUPP;
+	}
+
+	err = tcf_block_offload_cmd(block, dev, ei, FLOW_BLOCK_BIND, true,
+				    extack);
+	if (err == -EOPNOTSUPP) {
+		block->nooffloaddevcnt++;
+		err = 0;
+	}
+
+	return err;
+}
+
 static int tcf_block_offload_bind(struct tcf_block *block, struct Qdisc *q,
 				  struct tcf_block_ext_info *ei,
 				  struct netlink_ext_ack *extack)
@@ -833,6 +860,15 @@ static int tcf_block_offload_bind(struct tcf_block *block, struct Qdisc *q,
 	struct net_device *dev = q->dev_queue->dev;
 	int err;
 
+	if (block->xdp)
+		return tcf_block_offload_bind_xdp(block, q, ei, extack);
+
+	if (tc_xdp_offload_enabled(dev)) {
+		NL_SET_ERR_MSG(extack,
+			       "Cannot bind to block created with tc-offload-xdp disabled");
+		return -EOPNOTSUPP;
+	}
+
 	if (!dev->netdev_ops->ndo_setup_tc)
 		goto no_offload_dev_inc;
 
@@ -844,7 +880,8 @@ static int tcf_block_offload_bind(struct tcf_block *block, struct Qdisc *q,
 		return -EOPNOTSUPP;
 	}
 
-	err = tcf_block_offload_cmd(block, dev, ei, FLOW_BLOCK_BIND, extack);
+	err = tcf_block_offload_cmd(block, dev, ei, FLOW_BLOCK_BIND, false,
+				    extack);
 	if (err == -EOPNOTSUPP)
 		goto no_offload_dev_inc;
 	if (err)
@@ -861,17 +898,35 @@ static int tcf_block_offload_bind(struct tcf_block *block, struct Qdisc *q,
 	return 0;
 }
 
+static void tcf_block_offload_unbind_xdp(struct tcf_block *block,
+					 struct net_device *dev,
+					 struct tcf_block_ext_info *ei)
+{
+	int err;
+
+	err = tcf_block_offload_cmd(block, dev, ei, FLOW_BLOCK_UNBIND, true,
+				    NULL);
+	if (err == -EOPNOTSUPP)
+		WARN_ON(block->nooffloaddevcnt-- == 0);
+}
+
 static void tcf_block_offload_unbind(struct tcf_block *block, struct Qdisc *q,
 				     struct tcf_block_ext_info *ei)
 {
 	struct net_device *dev = q->dev_queue->dev;
 	int err;
 
+	if (block->xdp) {
+		tcf_block_offload_unbind_xdp(block, dev, ei);
+		return;
+	}
+
 	tc_indr_block_call(block, dev, ei, FLOW_BLOCK_UNBIND, NULL);
 
 	if (!dev->netdev_ops->ndo_setup_tc)
 		goto no_offload_dev_dec;
-	err = tcf_block_offload_cmd(block, dev, ei, FLOW_BLOCK_UNBIND, NULL);
+	err = tcf_block_offload_cmd(block, dev, ei, FLOW_BLOCK_UNBIND, false,
+				    NULL);
 	if (err == -EOPNOTSUPP)
 		goto no_offload_dev_dec;
 	return;
@@ -1004,6 +1059,10 @@ static struct tcf_block *tcf_block_create(struct net *net, struct Qdisc *q,
 	/* Don't store q pointer for blocks which are shared */
 	if (!tcf_block_shared(block))
 		block->q = q;
+
+	if (tc_xdp_offload_enabled(q->dev_queue->dev))
+		block->xdp = true;
+
 	return block;
 }
 
diff --git a/net/xdp_flow/xdp_flow_kern_mod.c b/net/xdp_flow/xdp_flow_kern_mod.c
index fe925db..891b18c 100644
--- a/net/xdp_flow/xdp_flow_kern_mod.c
+++ b/net/xdp_flow/xdp_flow_kern_mod.c
@@ -410,6 +410,12 @@ static int xdp_flow_setup_block_cb(enum tc_setup_type type, void *type_data,
 	struct net_device *dev = cb_priv;
 	int err = 0;
 
+	if (!tc_xdp_offload_enabled(dev)) {
+		NL_SET_ERR_MSG(common->extack,
+			       "tc-offload-xdp is disabled on net device");
+		return -EOPNOTSUPP;
+	}
+
 	if (common->chain_index) {
 		NL_SET_ERR_MSG(common->extack,
 			       "xdp_flow supports only offload of chain 0");
-- 
1.8.3.1


^ permalink raw reply related

* [RFC PATCH bpf-next 08/14] xdp_flow: Implement flow replacement/deletion logic in xdp_flow kmod
From: Toshiaki Makita @ 2019-08-13 12:05 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko
  Cc: Toshiaki Makita, netdev, bpf, William Tu
In-Reply-To: <20190813120558.6151-1-toshiaki.makita1@gmail.com>

As struct flow_rule has descrete storages for flow_dissector and
key/mask containers, we need to serialize them in some way to pass them
to UMH.

Convert flow_rule into flow key form used in xdp_flow bpf prog and
pass it.

Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
---
 net/xdp_flow/xdp_flow_kern_mod.c | 334 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 331 insertions(+), 3 deletions(-)

diff --git a/net/xdp_flow/xdp_flow_kern_mod.c b/net/xdp_flow/xdp_flow_kern_mod.c
index 9cf527d..fe925db 100644
--- a/net/xdp_flow/xdp_flow_kern_mod.c
+++ b/net/xdp_flow/xdp_flow_kern_mod.c
@@ -3,13 +3,266 @@
 #include <linux/module.h>
 #include <linux/umh.h>
 #include <linux/sched/signal.h>
+#include <linux/rhashtable.h>
 #include <net/pkt_cls.h>
 #include <net/flow_offload_xdp.h>
 #include "msgfmt.h"
 
+struct xdp_flow_rule {
+	struct rhash_head ht_node;
+	unsigned long cookie;
+	struct xdp_flow_key key;
+	struct xdp_flow_key mask;
+};
+
+static const struct rhashtable_params rules_params = {
+	.key_len = sizeof(unsigned long),
+	.key_offset = offsetof(struct xdp_flow_rule, cookie),
+	.head_offset = offsetof(struct xdp_flow_rule, ht_node),
+	.automatic_shrinking = true,
+};
+
+static struct rhashtable rules;
+
 extern char xdp_flow_umh_start;
 extern char xdp_flow_umh_end;
 
+static int xdp_flow_parse_actions(struct xdp_flow_actions *actions,
+				  struct flow_action *flow_action,
+				  struct netlink_ext_ack *extack)
+{
+	const struct flow_action_entry *act;
+	int i;
+
+	if (!flow_action_has_entries(flow_action))
+		return 0;
+
+	if (flow_action->num_entries > MAX_XDP_FLOW_ACTIONS)
+		return -ENOBUFS;
+
+	flow_action_for_each(i, act, flow_action) {
+		struct xdp_flow_action *action = &actions->actions[i];
+
+		switch (act->id) {
+		case FLOW_ACTION_ACCEPT:
+			action->id = XDP_FLOW_ACTION_ACCEPT;
+			break;
+		case FLOW_ACTION_DROP:
+			action->id = XDP_FLOW_ACTION_DROP;
+			break;
+		case FLOW_ACTION_REDIRECT:
+		case FLOW_ACTION_VLAN_PUSH:
+		case FLOW_ACTION_VLAN_POP:
+		case FLOW_ACTION_VLAN_MANGLE:
+		case FLOW_ACTION_MANGLE:
+		case FLOW_ACTION_CSUM:
+			/* TODO: implement these */
+			/* fall through */
+		default:
+			NL_SET_ERR_MSG_MOD(extack, "Unsupported action");
+			return -EOPNOTSUPP;
+		}
+	}
+	actions->num_actions = flow_action->num_entries;
+
+	return 0;
+}
+
+static int xdp_flow_parse_ports(struct xdp_flow_key *key,
+				struct xdp_flow_key *mask,
+				struct flow_cls_offload *f, u8 ip_proto)
+{
+	const struct flow_rule *rule = flow_cls_offload_flow_rule(f);
+	struct flow_match_ports match;
+
+	if (!flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_PORTS))
+		return 0;
+
+	if (ip_proto != IPPROTO_TCP && ip_proto != IPPROTO_UDP) {
+		NL_SET_ERR_MSG_MOD(f->common.extack,
+				   "Only UDP and TCP keys are supported");
+		return -EINVAL;
+	}
+
+	flow_rule_match_ports(rule, &match);
+
+	key->l4port.src = match.key->src;
+	mask->l4port.src = match.mask->src;
+	key->l4port.dst = match.key->dst;
+	mask->l4port.dst = match.mask->dst;
+
+	return 0;
+}
+
+static int xdp_flow_parse_tcp(struct xdp_flow_key *key,
+			      struct xdp_flow_key *mask,
+			      struct flow_cls_offload *f, u8 ip_proto)
+{
+	const struct flow_rule *rule = flow_cls_offload_flow_rule(f);
+	struct flow_match_tcp match;
+
+	if (!flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_TCP))
+		return 0;
+
+	if (ip_proto != IPPROTO_TCP) {
+		NL_SET_ERR_MSG_MOD(f->common.extack,
+				   "TCP keys supported only for TCP");
+		return -EINVAL;
+	}
+
+	flow_rule_match_tcp(rule, &match);
+
+	key->tcp.flags = match.key->flags;
+	mask->tcp.flags = match.mask->flags;
+
+	return 0;
+}
+
+static int xdp_flow_parse_ip(struct xdp_flow_key *key,
+			     struct xdp_flow_key *mask,
+			     struct flow_cls_offload *f, __be16 n_proto)
+{
+	const struct flow_rule *rule = flow_cls_offload_flow_rule(f);
+	struct flow_match_ip match;
+
+	if (!flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_IP))
+		return 0;
+
+	if (n_proto != htons(ETH_P_IP) && n_proto != htons(ETH_P_IPV6)) {
+		NL_SET_ERR_MSG_MOD(f->common.extack,
+				   "IP keys supported only for IPv4/6");
+		return -EINVAL;
+	}
+
+	flow_rule_match_ip(rule, &match);
+
+	key->ip.ttl = match.key->ttl;
+	mask->ip.ttl = match.mask->ttl;
+	key->ip.tos = match.key->tos;
+	mask->ip.tos = match.mask->tos;
+
+	return 0;
+}
+
+static int xdp_flow_parse(struct xdp_flow_key *key, struct xdp_flow_key *mask,
+			  struct xdp_flow_actions *actions,
+			  struct flow_cls_offload *f)
+{
+	struct flow_rule *rule = flow_cls_offload_flow_rule(f);
+	struct flow_dissector *dissector = rule->match.dissector;
+	__be16 n_proto = 0, n_proto_mask = 0;
+	u16 addr_type = 0;
+	u8 ip_proto = 0;
+	int err;
+
+	if (dissector->used_keys &
+	    ~(BIT(FLOW_DISSECTOR_KEY_CONTROL) |
+	      BIT(FLOW_DISSECTOR_KEY_BASIC) |
+	      BIT(FLOW_DISSECTOR_KEY_ETH_ADDRS) |
+	      BIT(FLOW_DISSECTOR_KEY_IPV4_ADDRS) |
+	      BIT(FLOW_DISSECTOR_KEY_IPV6_ADDRS) |
+	      BIT(FLOW_DISSECTOR_KEY_PORTS) |
+	      BIT(FLOW_DISSECTOR_KEY_TCP) |
+	      BIT(FLOW_DISSECTOR_KEY_IP) |
+	      BIT(FLOW_DISSECTOR_KEY_VLAN))) {
+		NL_SET_ERR_MSG_MOD(f->common.extack, "Unsupported key");
+		return -EOPNOTSUPP;
+	}
+
+	if (flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_CONTROL)) {
+		struct flow_match_control match;
+
+		flow_rule_match_control(rule, &match);
+		addr_type = match.key->addr_type;
+	}
+
+	if (flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_BASIC)) {
+		struct flow_match_basic match;
+
+		flow_rule_match_basic(rule, &match);
+
+		n_proto = match.key->n_proto;
+		n_proto_mask = match.mask->n_proto;
+		if (n_proto == htons(ETH_P_ALL)) {
+			n_proto = 0;
+			n_proto_mask = 0;
+		}
+
+		key->eth.type = n_proto;
+		mask->eth.type = n_proto_mask;
+
+		if (match.mask->ip_proto) {
+			ip_proto = match.key->ip_proto;
+			key->ip.proto = ip_proto;
+			mask->ip.proto = match.mask->ip_proto;
+		}
+	}
+
+	if (flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_ETH_ADDRS)) {
+		struct flow_match_eth_addrs match;
+
+		flow_rule_match_eth_addrs(rule, &match);
+
+		ether_addr_copy(key->eth.dst, match.key->dst);
+		ether_addr_copy(mask->eth.dst, match.mask->dst);
+		ether_addr_copy(key->eth.src, match.key->src);
+		ether_addr_copy(mask->eth.src, match.mask->src);
+	}
+
+	if (flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_VLAN)) {
+		struct flow_match_vlan match;
+
+		flow_rule_match_vlan(rule, &match);
+
+		key->vlan.tpid = match.key->vlan_tpid;
+		mask->vlan.tpid = match.mask->vlan_tpid;
+		key->vlan.tci = htons(match.key->vlan_id |
+				      (match.key->vlan_priority <<
+				       VLAN_PRIO_SHIFT));
+		mask->vlan.tci = htons(match.mask->vlan_id |
+				       (match.mask->vlan_priority <<
+					VLAN_PRIO_SHIFT));
+	}
+
+	if (addr_type == FLOW_DISSECTOR_KEY_IPV4_ADDRS) {
+		struct flow_match_ipv4_addrs match;
+
+		flow_rule_match_ipv4_addrs(rule, &match);
+
+		key->ipv4.src = match.key->src;
+		mask->ipv4.src = match.mask->src;
+		key->ipv4.dst = match.key->dst;
+		mask->ipv4.dst = match.mask->dst;
+	}
+
+	if (addr_type == FLOW_DISSECTOR_KEY_IPV6_ADDRS) {
+		struct flow_match_ipv6_addrs match;
+
+		flow_rule_match_ipv6_addrs(rule, &match);
+
+		key->ipv6.src = match.key->src;
+		mask->ipv6.src = match.mask->src;
+		key->ipv6.dst = match.key->dst;
+		mask->ipv6.dst = match.mask->dst;
+	}
+
+	err = xdp_flow_parse_ports(key, mask, f, ip_proto);
+	if (err)
+		return err;
+	err = xdp_flow_parse_tcp(key, mask, f, ip_proto);
+	if (err)
+		return err;
+
+	err = xdp_flow_parse_ip(key, mask, f, n_proto);
+	if (err)
+		return err;
+
+	// TODO: encapsulation related tasks
+
+	return xdp_flow_parse_actions(actions, &rule->action,
+					   f->common.extack);
+}
+
 static void shutdown_umh(void)
 {
 	struct task_struct *tsk;
@@ -60,12 +313,78 @@ static int transact_umh(struct mbox_request *req, u32 *id)
 
 static int xdp_flow_replace(struct net_device *dev, struct flow_cls_offload *f)
 {
-	return -EOPNOTSUPP;
+	struct xdp_flow_rule *rule;
+	struct mbox_request *req;
+	int err;
+
+	req = kzalloc(sizeof(*req), GFP_KERNEL);
+	if (!req)
+		return -ENOMEM;
+
+	rule = kzalloc(sizeof(*rule), GFP_KERNEL);
+	if (!rule) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	req->flow.priority = f->common.prio >> 16;
+	err = xdp_flow_parse(&req->flow.key, &req->flow.mask,
+			     &req->flow.actions, f);
+	if (err)
+		goto err_parse;
+
+	rule->cookie = f->cookie;
+	rule->key = req->flow.key;
+	rule->mask = req->flow.mask;
+	err = rhashtable_insert_fast(&rules, &rule->ht_node, rules_params);
+	if (err)
+		goto err_parse;
+
+	req->cmd = XDP_FLOW_CMD_REPLACE;
+	req->ifindex = dev->ifindex;
+	err = transact_umh(req, NULL);
+	if (err)
+		goto err_umh;
+out:
+	kfree(req);
+
+	return err;
+err_umh:
+	rhashtable_remove_fast(&rules, &rule->ht_node, rules_params);
+err_parse:
+	kfree(rule);
+	goto out;
 }
 
 int xdp_flow_destroy(struct net_device *dev, struct flow_cls_offload *f)
 {
-	return -EOPNOTSUPP;
+	struct mbox_request *req;
+	struct xdp_flow_rule *rule;
+	int err;
+
+	rule = rhashtable_lookup_fast(&rules, &f->cookie, rules_params);
+	if (!rule)
+		return 0;
+
+	req = kzalloc(sizeof(*req), GFP_KERNEL);
+	if (!req)
+		return -ENOMEM;
+
+	req->flow.priority = f->common.prio >> 16;
+	req->flow.key = rule->key;
+	req->flow.mask = rule->mask;
+	req->cmd = XDP_FLOW_CMD_DELETE;
+	req->ifindex = dev->ifindex;
+	err = transact_umh(req, NULL);
+
+	kfree(req);
+
+	if (!err) {
+		rhashtable_remove_fast(&rules, &rule->ht_node, rules_params);
+		kfree(rule);
+	}
+
+	return err;
 }
 
 static int xdp_flow_setup_flower(struct net_device *dev,
@@ -267,7 +586,11 @@ static int start_umh(void)
 
 static int __init load_umh(void)
 {
-	int err = 0;
+	int err;
+
+	err = rhashtable_init(&rules, &rules_params);
+	if (err)
+		return err;
 
 	mutex_lock(&xdp_flow_ops.lock);
 	if (!xdp_flow_ops.stop) {
@@ -283,8 +606,12 @@ static int __init load_umh(void)
 	xdp_flow_ops.setup = &xdp_flow_setup;
 	xdp_flow_ops.start = &start_umh;
 	xdp_flow_ops.module = THIS_MODULE;
+
+	mutex_unlock(&xdp_flow_ops.lock);
+	return 0;
 err:
 	mutex_unlock(&xdp_flow_ops.lock);
+	rhashtable_destroy(&rules);
 	return err;
 }
 
@@ -297,6 +624,7 @@ static void __exit fini_umh(void)
 	xdp_flow_ops.setup = NULL;
 	xdp_flow_ops.setup_cb = NULL;
 	mutex_unlock(&xdp_flow_ops.lock);
+	rhashtable_destroy(&rules);
 }
 module_init(load_umh);
 module_exit(fini_umh);
-- 
1.8.3.1


^ permalink raw reply related

* [RFC PATCH bpf-next 07/14] xdp_flow: Add flow handling and basic actions in bpf prog
From: Toshiaki Makita @ 2019-08-13 12:05 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko
  Cc: Toshiaki Makita, netdev, bpf, William Tu
In-Reply-To: <20190813120558.6151-1-toshiaki.makita1@gmail.com>

BPF prog for XDP parses the packet and extracts the flow key. Then find
an entry from flow tables.
Only "accept" and "drop" actions are implemented at this point.

Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
---
 net/xdp_flow/xdp_flow_kern_bpf.c | 297 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 296 insertions(+), 1 deletion(-)

diff --git a/net/xdp_flow/xdp_flow_kern_bpf.c b/net/xdp_flow/xdp_flow_kern_bpf.c
index c101156..ceb8a92 100644
--- a/net/xdp_flow/xdp_flow_kern_bpf.c
+++ b/net/xdp_flow/xdp_flow_kern_bpf.c
@@ -1,9 +1,27 @@
 // SPDX-License-Identifier: GPL-2.0
 #define KBUILD_MODNAME "foo"
 #include <uapi/linux/bpf.h>
+#include <linux/in.h>
+#include <linux/if_ether.h>
+#include <linux/if_packet.h>
+#include <linux/if_vlan.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+#include <net/ipv6.h>
+#include <net/dsfield.h>
 #include <bpf_helpers.h>
 #include "umh_bpf.h"
 
+/* Used when the action only modifies the packet */
+#define _XDP_CONTINUE -1
+
+struct bpf_map_def SEC("maps") debug_stats = {
+	.type = BPF_MAP_TYPE_PERCPU_ARRAY,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(long),
+	.max_entries = 256,
+};
+
 struct bpf_map_def SEC("maps") flow_masks_head = {
 	.type = BPF_MAP_TYPE_ARRAY,
 	.key_size = sizeof(u32),
@@ -25,10 +43,287 @@ struct bpf_map_def SEC("maps") flow_tables = {
 	.max_entries = MAX_FLOW_MASKS,
 };
 
+static inline void account_debug(int idx)
+{
+	long *cnt;
+
+	cnt = bpf_map_lookup_elem(&debug_stats, &idx);
+	if (cnt)
+		*cnt += 1;
+}
+
+static inline void account_action(int act)
+{
+	account_debug(act + 1);
+}
+
+static inline int action_accept(void)
+{
+	account_action(XDP_FLOW_ACTION_ACCEPT);
+	return XDP_PASS;
+}
+
+static inline int action_drop(void)
+{
+	account_action(XDP_FLOW_ACTION_DROP);
+	return XDP_DROP;
+}
+
+static inline int action_redirect(struct xdp_flow_action *action)
+{
+	account_action(XDP_FLOW_ACTION_REDIRECT);
+
+	// TODO: implement this
+	return XDP_ABORTED;
+}
+
+static inline int action_vlan_push(struct xdp_md *ctx,
+				   struct xdp_flow_action *action)
+{
+	account_action(XDP_FLOW_ACTION_VLAN_PUSH);
+
+	// TODO: implement this
+	return XDP_ABORTED;
+}
+
+static inline int action_vlan_pop(struct xdp_md *ctx,
+				  struct xdp_flow_action *action)
+{
+	account_action(XDP_FLOW_ACTION_VLAN_POP);
+
+	// TODO: implement this
+	return XDP_ABORTED;
+}
+
+static inline int action_vlan_mangle(struct xdp_md *ctx,
+				     struct xdp_flow_action *action)
+{
+	account_action(XDP_FLOW_ACTION_VLAN_MANGLE);
+
+	// TODO: implement this
+	return XDP_ABORTED;
+}
+
+static inline int action_mangle(struct xdp_md *ctx,
+				struct xdp_flow_action *action)
+{
+	account_action(XDP_FLOW_ACTION_MANGLE);
+
+	// TODO: implement this
+	return XDP_ABORTED;
+}
+
+static inline int action_csum(struct xdp_md *ctx,
+			      struct xdp_flow_action *action)
+{
+	account_action(XDP_FLOW_ACTION_CSUM);
+
+	// TODO: implement this
+	return XDP_ABORTED;
+}
+
+static inline void __ether_addr_copy(u8 *dst, const u8 *src)
+{
+	u16 *a = (u16 *)dst;
+	const u16 *b = (const u16 *)src;
+
+	a[0] = b[0];
+	a[1] = b[1];
+	a[2] = b[2];
+}
+
+static inline int parse_ipv4(void *data, u64 *nh_off, void *data_end,
+			     struct xdp_flow_key *key)
+{
+	struct iphdr *iph = data + *nh_off;
+
+	if (iph + 1 > data_end)
+		return -1;
+
+	key->ipv4.src = iph->saddr;
+	key->ipv4.dst = iph->daddr;
+	key->ip.ttl = iph->ttl;
+	key->ip.tos = iph->tos;
+	*nh_off += iph->ihl * 4;
+
+	return iph->protocol;
+}
+
+static inline int parse_ipv6(void *data, u64 *nh_off, void *data_end,
+			     struct xdp_flow_key *key)
+{
+	struct ipv6hdr *ip6h = data + *nh_off;
+
+	if (ip6h + 1 > data_end)
+		return -1;
+
+	key->ipv6.src = ip6h->saddr;
+	key->ipv6.dst = ip6h->daddr;
+	key->ip.ttl = ip6h->hop_limit;
+	key->ip.tos = ipv6_get_dsfield(ip6h);
+	*nh_off += sizeof(*ip6h);
+
+	if (ip6h->nexthdr == NEXTHDR_HOP ||
+	    ip6h->nexthdr == NEXTHDR_ROUTING ||
+	    ip6h->nexthdr == NEXTHDR_FRAGMENT ||
+	    ip6h->nexthdr == NEXTHDR_AUTH ||
+	    ip6h->nexthdr == NEXTHDR_NONE ||
+	    ip6h->nexthdr == NEXTHDR_DEST)
+		return 0;
+
+	return ip6h->nexthdr;
+}
+
+#define for_each_flow_mask(entry, head, idx, cnt) \
+	for (entry = bpf_map_lookup_elem(&flow_masks, (head)), \
+	     idx = *(head), cnt = 0; \
+	     entry != NULL && cnt < MAX_FLOW_MASKS; \
+	     idx = entry->next, \
+	     entry = bpf_map_lookup_elem(&flow_masks, &idx), cnt++)
+
+static inline void flow_mask(struct xdp_flow_key *mkey,
+			     const struct xdp_flow_key *key,
+			     const struct xdp_flow_key *mask)
+{
+	long *lmkey = (long *)mkey;
+	long *lmask = (long *)mask;
+	long *lkey = (long *)key;
+	int i;
+
+	for (i = 0; i < sizeof(*mkey); i += sizeof(long))
+		*lmkey++ = *lkey++ & *lmask++;
+}
+
 SEC("xdp_flow")
 int xdp_flow_prog(struct xdp_md *ctx)
 {
-	return XDP_PASS;
+	void *data_end = (void *)(long)ctx->data_end;
+	struct xdp_flow_actions *actions = NULL;
+	void *data = (void *)(long)ctx->data;
+	int cnt, idx, action_idx, zero = 0;
+	struct xdp_flow_mask_entry *entry;
+	struct ethhdr *eth = data;
+	struct xdp_flow_key key;
+	int rc = XDP_DROP;
+	long *value;
+	u16 h_proto;
+	u32 ipproto;
+	u64 nh_off;
+	int *head;
+
+	account_debug(0);
+
+	nh_off = sizeof(*eth);
+	if (data + nh_off > data_end)
+		return XDP_DROP;
+
+	__builtin_memset(&key, 0, sizeof(key));
+	h_proto = eth->h_proto;
+	__ether_addr_copy(key.eth.dst, eth->h_dest);
+	__ether_addr_copy(key.eth.src, eth->h_source);
+
+	if (eth_type_vlan(h_proto)) {
+		struct vlan_hdr *vhdr;
+
+		vhdr = data + nh_off;
+		nh_off += sizeof(*vhdr);
+		if (data + nh_off > data_end)
+			return XDP_DROP;
+		key.vlan.tpid = h_proto;
+		key.vlan.tci = vhdr->h_vlan_TCI;
+		h_proto = vhdr->h_vlan_encapsulated_proto;
+	}
+	key.eth.type = h_proto;
+
+	if (h_proto == htons(ETH_P_IP))
+		ipproto = parse_ipv4(data, &nh_off, data_end, &key);
+	else if (h_proto == htons(ETH_P_IPV6))
+		ipproto = parse_ipv6(data, &nh_off, data_end, &key);
+	else
+		ipproto = 0;
+	if (ipproto < 0)
+		return XDP_DROP;
+	key.ip.proto = ipproto;
+
+	if (ipproto == IPPROTO_TCP) {
+		struct tcphdr *th = data + nh_off;
+
+		if (th + 1 > data_end)
+			return XDP_DROP;
+
+		key.l4port.src = th->source;
+		key.l4port.dst = th->dest;
+		key.tcp.flags = (*(__be16 *)&tcp_flag_word(th) & htons(0x0FFF));
+	} else if (ipproto == IPPROTO_UDP) {
+		struct udphdr *uh = data + nh_off;
+
+		if (uh + 1 > data_end)
+			return XDP_DROP;
+
+		key.l4port.src = uh->source;
+		key.l4port.dst = uh->dest;
+	}
+
+	head = bpf_map_lookup_elem(&flow_masks_head, &zero);
+	if (!head)
+		return XDP_PASS;
+
+	for_each_flow_mask(entry, head, idx, cnt) {
+		struct xdp_flow_key mkey;
+		void *flow_table;
+
+		flow_table = bpf_map_lookup_elem(&flow_tables, &idx);
+		if (!flow_table)
+			return XDP_ABORTED;
+
+		flow_mask(&mkey, &key, &entry->mask);
+		actions = bpf_map_lookup_elem(flow_table, &mkey);
+		if (actions)
+			break;
+	}
+
+	if (!actions)
+		return XDP_PASS;
+
+	for (action_idx = 0;
+	     action_idx < actions->num_actions &&
+	     action_idx < MAX_XDP_FLOW_ACTIONS;
+	     action_idx++) {
+		struct xdp_flow_action *action;
+		int act;
+
+		action = &actions->actions[action_idx];
+
+		switch (action->id) {
+		case XDP_FLOW_ACTION_ACCEPT:
+			return action_accept();
+		case XDP_FLOW_ACTION_DROP:
+			return action_drop();
+		case XDP_FLOW_ACTION_REDIRECT:
+			return action_redirect(action);
+		case XDP_FLOW_ACTION_VLAN_PUSH:
+			act = action_vlan_push(ctx, action);
+			break;
+		case XDP_FLOW_ACTION_VLAN_POP:
+			act = action_vlan_pop(ctx, action);
+			break;
+		case XDP_FLOW_ACTION_VLAN_MANGLE:
+			act = action_vlan_mangle(ctx, action);
+			break;
+		case XDP_FLOW_ACTION_MANGLE:
+			act = action_mangle(ctx, action);
+			break;
+		case XDP_FLOW_ACTION_CSUM:
+			act = action_csum(ctx, action);
+			break;
+		default:
+			return XDP_ABORTED;
+		}
+		if (act != _XDP_CONTINUE)
+			return act;
+	}
+
+	return XDP_ABORTED;
 }
 
 char _license[] SEC("license") = "GPL";
-- 
1.8.3.1


^ permalink raw reply related

* [RFC PATCH bpf-next 06/14] xdp_flow: Add flow entry insertion/deletion logic in UMH
From: Toshiaki Makita @ 2019-08-13 12:05 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko
  Cc: Toshiaki Makita, netdev, bpf, William Tu
In-Reply-To: <20190813120558.6151-1-toshiaki.makita1@gmail.com>

This logic will be used when xdp_flow kmod requests flow
insertion/deleteion.

On insertion, find a free entry and populate it, then update next index
pointer of its previous entry. On deletion, set the next index pointer
of the prev entry to the next index of the entry to be deleted.

Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
---
 net/xdp_flow/umh_bpf.h      |  15 ++
 net/xdp_flow/xdp_flow_umh.c | 470 +++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 483 insertions(+), 2 deletions(-)

diff --git a/net/xdp_flow/umh_bpf.h b/net/xdp_flow/umh_bpf.h
index b4fe0c6..4e4633f 100644
--- a/net/xdp_flow/umh_bpf.h
+++ b/net/xdp_flow/umh_bpf.h
@@ -15,4 +15,19 @@ struct xdp_flow_mask_entry {
 	int next;
 };
 
+static inline bool flow_equal(const struct xdp_flow_key *key1,
+			      const struct xdp_flow_key *key2)
+{
+	long *lkey1 = (long *)key1;
+	long *lkey2 = (long *)key2;
+	int i;
+
+	for (i = 0; i < sizeof(*key1); i += sizeof(long)) {
+		if (*lkey1++ != *lkey2++)
+			return false;
+	}
+
+	return true;
+}
+
 #endif
diff --git a/net/xdp_flow/xdp_flow_umh.c b/net/xdp_flow/xdp_flow_umh.c
index e35666a..9a4769b 100644
--- a/net/xdp_flow/xdp_flow_umh.c
+++ b/net/xdp_flow/xdp_flow_umh.c
@@ -19,6 +19,8 @@
 extern char xdp_flow_bpf_end;
 int progfile_fd;
 
+#define zalloc(size) calloc(1, (size))
+
 /* FIXME: syslog is used for easy debugging. As writing /dev/log can be stuck
  * due to reader side, should use another log mechanism like kmsg.
  */
@@ -38,6 +40,8 @@ struct netdev_info {
 	struct netdev_info_key key;
 	struct hlist_node node;
 	struct bpf_object *obj;
+	int free_slot_top;
+	int free_slots[MAX_FLOW_MASKS];
 };
 
 DEFINE_HASHTABLE(netdev_info_table, 16);
@@ -268,6 +272,57 @@ static struct netdev_info *get_netdev_info(const struct mbox_request *req)
 	return netdev_info;
 }
 
+static void init_flow_masks_free_slot(struct netdev_info *netdev_info)
+{
+	int i;
+
+	for (i = 0; i < MAX_FLOW_MASKS; i++)
+		netdev_info->free_slots[MAX_FLOW_MASKS - 1 - i] = i;
+	netdev_info->free_slot_top = MAX_FLOW_MASKS - 1;
+}
+
+static int get_flow_masks_free_slot(const struct netdev_info *netdev_info)
+{
+	if (netdev_info->free_slot_top < 0)
+		return -ENOBUFS;
+
+	return netdev_info->free_slots[netdev_info->free_slot_top];
+}
+
+static int add_flow_masks_free_slot(struct netdev_info *netdev_info, int slot)
+{
+	if (unlikely(netdev_info->free_slot_top >= MAX_FLOW_MASKS - 1)) {
+		pr_warn("BUG: free_slot overflow: top=%d, slot=%d\n",
+			netdev_info->free_slot_top, slot);
+		return -EOVERFLOW;
+	}
+
+	netdev_info->free_slots[++netdev_info->free_slot_top] = slot;
+
+	return 0;
+}
+
+static void delete_flow_masks_free_slot(struct netdev_info *netdev_info,
+					int slot)
+{
+	int top_slot;
+
+	if (unlikely(netdev_info->free_slot_top < 0)) {
+		pr_warn("BUG: free_slot underflow: top=%d, slot=%d\n",
+			netdev_info->free_slot_top, slot);
+		return;
+	}
+
+	top_slot = netdev_info->free_slots[netdev_info->free_slot_top];
+	if (unlikely(top_slot != slot)) {
+		pr_warn("BUG: inconsistent free_slot top: top_slot=%d, slot=%d\n",
+			top_slot, slot);
+		return;
+	}
+
+	netdev_info->free_slot_top--;
+}
+
 static int handle_load(const struct mbox_request *req, __u32 *prog_id)
 {
 	struct netdev_info *netdev_info;
@@ -291,6 +346,8 @@ static int handle_load(const struct mbox_request *req, __u32 *prog_id)
 	}
 	netdev_info->key.ifindex = key.ifindex;
 
+	init_flow_masks_free_slot(netdev_info);
+
 	prog_fd = load_bpf(req->ifindex, &netdev_info->obj);
 	if (prog_fd < 0) {
 		err = prog_fd;
@@ -331,14 +388,423 @@ static int handle_unload(const struct mbox_request *req)
 	return 0;
 }
 
+static int get_table_fd(const struct netdev_info *netdev_info,
+			const char *table_name)
+{
+	char errbuf[ERRBUF_SIZE];
+	struct bpf_map *map;
+	int map_fd;
+	int err;
+
+	map = bpf_object__find_map_by_name(netdev_info->obj, table_name);
+	if (!map) {
+		pr_err("BUG: %s map not found.\n", table_name);
+		return -ENOENT;
+	}
+
+	map_fd = bpf_map__fd(map);
+	if (map_fd < 0) {
+		err = libbpf_err(map_fd, errbuf);
+		pr_err("Invalid map fd: %s\n", errbuf);
+		return err;
+	}
+
+	return map_fd;
+}
+
+static int get_flow_masks_head_fd(const struct netdev_info *netdev_info)
+{
+	return get_table_fd(netdev_info, "flow_masks_head");
+}
+
+static int get_flow_masks_head(int head_fd, int *head)
+{
+	int err, zero = 0;
+
+	if (bpf_map_lookup_elem(head_fd, &zero, head)) {
+		err = -errno;
+		pr_err("Cannot get flow_masks_head: %s\n", strerror(errno));
+		return err;
+	}
+
+	return 0;
+}
+
+static int update_flow_masks_head(int head_fd, int head)
+{
+	int err, zero = 0;
+
+	if (bpf_map_update_elem(head_fd, &zero, &head, 0)) {
+		err = -errno;
+		pr_err("Cannot update flow_masks_head: %s\n", strerror(errno));
+		return err;
+	}
+
+	return 0;
+}
+
+static int get_flow_masks_fd(const struct netdev_info *netdev_info)
+{
+	return get_table_fd(netdev_info, "flow_masks");
+}
+
+static int get_flow_tables_fd(const struct netdev_info *netdev_info)
+{
+	return get_table_fd(netdev_info, "flow_tables");
+}
+
+static int __flow_table_insert_elem(int flow_table_fd,
+				    const struct xdp_flow *flow)
+{
+	int err = 0;
+
+	if (bpf_map_update_elem(flow_table_fd, &flow->key, &flow->actions, 0)) {
+		err = -errno;
+		pr_err("Cannot insert flow entry: %s\n",
+		       strerror(errno));
+	}
+
+	return err;
+}
+
+static void __flow_table_delete_elem(int flow_table_fd,
+				     const struct xdp_flow *flow)
+{
+	bpf_map_delete_elem(flow_table_fd, &flow->key);
+}
+
+static int flow_table_insert_elem(struct netdev_info *netdev_info,
+				  const struct xdp_flow *flow)
+{
+	int masks_fd, head_fd, flow_tables_fd, flow_table_fd, free_slot, head;
+	struct xdp_flow_mask_entry *entry, *pentry;
+	int err, cnt, idx, pidx;
+
+	masks_fd = get_flow_masks_fd(netdev_info);
+	if (masks_fd < 0)
+		return masks_fd;
+
+	head_fd = get_flow_masks_head_fd(netdev_info);
+	if (head_fd < 0)
+		return head_fd;
+
+	err = get_flow_masks_head(head_fd, &head);
+	if (err)
+		return err;
+
+	flow_tables_fd = get_flow_tables_fd(netdev_info);
+	if (flow_tables_fd < 0)
+		return flow_tables_fd;
+
+	entry = zalloc(sizeof(*entry));
+	if (!entry) {
+		pr_err("Memory allocation for flow_masks entry failed\n");
+		return -ENOMEM;
+	}
+
+	pentry = zalloc(sizeof(*pentry));
+	if (!pentry) {
+		flow_table_fd = -ENOMEM;
+		pr_err("Memory allocation for flow_masks prev entry failed\n");
+		goto err_entry;
+	}
+
+	idx = head;
+	for (cnt = 0; cnt < MAX_FLOW_MASKS; cnt++) {
+		if (idx == FLOW_MASKS_TAIL)
+			break;
+
+		if (bpf_map_lookup_elem(masks_fd, &idx, entry)) {
+			err = -errno;
+			pr_err("Cannot lookup flow_masks: %s\n",
+			       strerror(errno));
+			goto err;
+		}
+
+		if (entry->priority == flow->priority &&
+		    flow_equal(&entry->mask, &flow->mask)) {
+			__u32 id;
+
+			if (bpf_map_lookup_elem(flow_tables_fd, &idx, &id)) {
+				err = -errno;
+				pr_err("Cannot lookup flow_tables: %s\n",
+				       strerror(errno));
+				goto err;
+			}
+
+			flow_table_fd = bpf_map_get_fd_by_id(id);
+			if (flow_table_fd < 0) {
+				err = -errno;
+				pr_err("Cannot get flow_table fd by id: %s\n",
+				       strerror(errno));
+				goto err;
+			}
+
+			err = __flow_table_insert_elem(flow_table_fd, flow);
+			if (err)
+				goto out;
+
+			entry->count++;
+			if (bpf_map_update_elem(masks_fd, &idx, entry, 0)) {
+				err = -errno;
+				pr_err("Cannot update flow_masks count: %s\n",
+				       strerror(errno));
+				__flow_table_delete_elem(flow_table_fd, flow);
+				goto out;
+			}
+
+			goto out;
+		}
+
+		if (entry->priority > flow->priority)
+			break;
+
+		*pentry = *entry;
+		pidx = idx;
+		idx = entry->next;
+	}
+
+	if (unlikely(cnt == MAX_FLOW_MASKS && idx != FLOW_MASKS_TAIL)) {
+		err = -EINVAL;
+		pr_err("Cannot lookup flow_masks: Broken flow_masks list\n");
+		goto out;
+	}
+
+	/* Flow mask was not found. Create a new one */
+
+	free_slot = get_flow_masks_free_slot(netdev_info);
+	if (free_slot < 0) {
+		err = free_slot;
+		goto err;
+	}
+
+	entry->mask = flow->mask;
+	entry->priority = flow->priority;
+	entry->count = 1;
+	entry->next = idx;
+	if (bpf_map_update_elem(masks_fd, &free_slot, entry, 0)) {
+		err = -errno;
+		pr_err("Cannot update flow_masks: %s\n", strerror(errno));
+		goto err;
+	}
+
+	flow_table_fd = bpf_create_map(BPF_MAP_TYPE_HASH,
+				       sizeof(struct xdp_flow_key),
+				       sizeof(struct xdp_flow_actions),
+				       MAX_FLOWS, 0);
+	if (flow_table_fd < 0) {
+		err = -errno;
+		pr_err("map creation for flow_table failed: %s\n",
+		       strerror(errno));
+		goto err;
+	}
+
+	err = __flow_table_insert_elem(flow_table_fd, flow);
+	if (err)
+		goto out;
+
+	if (bpf_map_update_elem(flow_tables_fd, &free_slot, &flow_table_fd, 0)) {
+		err = -errno;
+		pr_err("Failed to insert flow_table into flow_tables: %s\n",
+		       strerror(errno));
+		goto out;
+	}
+
+	if (cnt == 0) {
+		err = update_flow_masks_head(head_fd, free_slot);
+		if (err)
+			goto err_flow_table;
+	} else {
+		pentry->next = free_slot;
+		/* This effectively only updates one byte of entry->next */
+		if (bpf_map_update_elem(masks_fd, &pidx, pentry, 0)) {
+			err = -errno;
+			pr_err("Cannot update flow_masks prev entry: %s\n",
+			       strerror(errno));
+			goto err_flow_table;
+		}
+	}
+	delete_flow_masks_free_slot(netdev_info, free_slot);
+out:
+	close(flow_table_fd);
+err:
+	free(pentry);
+err_entry:
+	free(entry);
+
+	return err;
+
+err_flow_table:
+	bpf_map_delete_elem(flow_tables_fd, &free_slot);
+
+	goto out;
+}
+
+static int flow_table_delete_elem(struct netdev_info *netdev_info,
+				  const struct xdp_flow *flow)
+{
+	int masks_fd, head_fd, flow_tables_fd, flow_table_fd, head;
+	struct xdp_flow_mask_entry *entry, *pentry;
+	int err, cnt, idx, pidx;
+	__u32 id;
+
+	masks_fd = get_flow_masks_fd(netdev_info);
+	if (masks_fd < 0)
+		return masks_fd;
+
+	head_fd = get_flow_masks_head_fd(netdev_info);
+	if (head_fd < 0)
+		return head_fd;
+
+	err = get_flow_masks_head(head_fd, &head);
+	if (err)
+		return err;
+
+	flow_tables_fd = get_flow_tables_fd(netdev_info);
+	if (flow_tables_fd < 0)
+		return flow_tables_fd;
+
+	entry = zalloc(sizeof(*entry));
+	if (!entry) {
+		pr_err("Memory allocation for flow_masks entry failed\n");
+		return -ENOMEM;
+	}
+
+	pentry = zalloc(sizeof(*pentry));
+	if (!pentry) {
+		err = -ENOMEM;
+		pr_err("Memory allocation for flow_masks prev entry failed\n");
+		goto err_pentry;
+	}
+
+	idx = head;
+	for (cnt = 0; cnt < MAX_FLOW_MASKS; cnt++) {
+		if (idx == FLOW_MASKS_TAIL) {
+			err = -ENOENT;
+			pr_err("Cannot lookup flow_masks: %s\n",
+			       strerror(-err));
+			goto out;
+		}
+
+		if (bpf_map_lookup_elem(masks_fd, &idx, entry)) {
+			err = -errno;
+			pr_err("Cannot lookup flow_masks: %s\n",
+			       strerror(errno));
+			goto out;
+		}
+
+		if (entry->priority > flow->priority) {
+			err = -ENOENT;
+			pr_err("Cannot lookup flow_masks: %s\n",
+			       strerror(-err));
+			goto out;
+		}
+
+		if (entry->priority == flow->priority &&
+		    flow_equal(&entry->mask, &flow->mask))
+			break;
+
+		*pentry = *entry;
+		pidx = idx;
+		idx = entry->next;
+	}
+
+	if (unlikely(cnt == MAX_FLOW_MASKS)) {
+		err = -ENOENT;
+		pr_err("Cannot lookup flow_masks: Broken flow_masks list\n");
+		goto out;
+	}
+
+	if (bpf_map_lookup_elem(flow_tables_fd, &idx, &id)) {
+		err = -errno;
+		pr_err("Cannot lookup flow_tables: %s\n",
+		       strerror(errno));
+		goto out;
+	}
+
+	flow_table_fd = bpf_map_get_fd_by_id(id);
+	if (flow_table_fd < 0) {
+		err = -errno;
+		pr_err("Cannot get flow_table fd by id: %s\n",
+		       strerror(errno));
+		goto out;
+	}
+
+	__flow_table_delete_elem(flow_table_fd, flow);
+	close(flow_table_fd);
+
+	if (--entry->count > 0) {
+		if (bpf_map_update_elem(masks_fd, &idx, entry, 0)) {
+			err = -errno;
+			pr_err("Cannot update flow_masks count: %s\n",
+			       strerror(errno));
+		}
+
+		goto out;
+	}
+
+	if (unlikely(entry->count < 0)) {
+		pr_warn("flow_masks has negative count: %d\n",
+			entry->count);
+	}
+
+	if (cnt == 0) {
+		err = update_flow_masks_head(head_fd, entry->next);
+		if (err)
+			goto out;
+	} else {
+		pentry->next = entry->next;
+		/* This effectively only updates one byte of entry->next */
+		if (bpf_map_update_elem(masks_fd, &pidx, pentry, 0)) {
+			err = -errno;
+			pr_err("Cannot update flow_masks prev entry: %s\n",
+			       strerror(errno));
+			goto out;
+		}
+	}
+
+	bpf_map_delete_elem(flow_tables_fd, &idx);
+	err = add_flow_masks_free_slot(netdev_info, idx);
+	if (err)
+		pr_err("Cannot add flow_masks free slot: %s\n", strerror(-err));
+out:
+	free(pentry);
+err_pentry:
+	free(entry);
+
+	return err;
+}
+
 static int handle_replace(struct mbox_request *req)
 {
-	return -EOPNOTSUPP;
+	struct netdev_info *netdev_info;
+	int err;
+
+	netdev_info = get_netdev_info(req);
+	if (IS_ERR(netdev_info))
+		return PTR_ERR(netdev_info);
+
+	err = flow_table_insert_elem(netdev_info, &req->flow);
+	if (err)
+		return err;
+
+	return 0;
 }
 
 static int handle_delete(const struct mbox_request *req)
 {
-	return -EOPNOTSUPP;
+	struct netdev_info *netdev_info;
+	int err;
+
+	netdev_info = get_netdev_info(req);
+	if (IS_ERR(netdev_info))
+		return PTR_ERR(netdev_info);
+
+	err = flow_table_delete_elem(netdev_info, &req->flow);
+	if (err)
+		return err;
+
+	return 0;
 }
 
 static void loop(void)
-- 
1.8.3.1


^ permalink raw reply related

* [RFC PATCH bpf-next 05/14] xdp_flow: Prepare flow tables in bpf
From: Toshiaki Makita @ 2019-08-13 12:05 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko
  Cc: Toshiaki Makita, netdev, bpf, William Tu
In-Reply-To: <20190813120558.6151-1-toshiaki.makita1@gmail.com>

Add maps for flow tables in bpf. TC flower has hash tables for each flow
mask ordered by priority. To do the same thing, prepare
hashmap-in-arraymap. As bpf does not provide ordered list, we emulate it
by an array. Each array entry has one-byte next index field to implement
a list. Also prepare a one-element array to point to the head index of
the list.

Because of the limitation of bpf maps, the outer array is implemented
using two array maps. "flow_masks" is the array to emulate the list and
its entries have the priority and mask of each flow table. For each
priority/mask, the same index entry of another map "flow_tables", which
is the hashmap-in-arraymap, points to the actual flow table.

The flow insertion logic in UMH and lookup logic in BPF will be
implemented in the following commits.

NOTE: This list emulation by array may be able to be realized by adding
ordered-list type map. In that case we also need map iteration API for
bpf progs.

Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
---
 net/xdp_flow/umh_bpf.h           | 18 +++++++++++
 net/xdp_flow/xdp_flow_kern_bpf.c | 22 +++++++++++++
 net/xdp_flow/xdp_flow_umh.c      | 70 ++++++++++++++++++++++++++++++++++++++--
 3 files changed, 108 insertions(+), 2 deletions(-)
 create mode 100644 net/xdp_flow/umh_bpf.h

diff --git a/net/xdp_flow/umh_bpf.h b/net/xdp_flow/umh_bpf.h
new file mode 100644
index 0000000..b4fe0c6
--- /dev/null
+++ b/net/xdp_flow/umh_bpf.h
@@ -0,0 +1,18 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _NET_XDP_FLOW_UMH_BPF_H
+#define _NET_XDP_FLOW_UMH_BPF_H
+
+#include "msgfmt.h"
+
+#define MAX_FLOWS 1024
+#define MAX_FLOW_MASKS 255
+#define FLOW_MASKS_TAIL 255
+
+struct xdp_flow_mask_entry {
+	struct xdp_flow_key mask;
+	__u16 priority;
+	short count;
+	int next;
+};
+
+#endif
diff --git a/net/xdp_flow/xdp_flow_kern_bpf.c b/net/xdp_flow/xdp_flow_kern_bpf.c
index 74cdb1d..c101156 100644
--- a/net/xdp_flow/xdp_flow_kern_bpf.c
+++ b/net/xdp_flow/xdp_flow_kern_bpf.c
@@ -2,6 +2,28 @@
 #define KBUILD_MODNAME "foo"
 #include <uapi/linux/bpf.h>
 #include <bpf_helpers.h>
+#include "umh_bpf.h"
+
+struct bpf_map_def SEC("maps") flow_masks_head = {
+	.type = BPF_MAP_TYPE_ARRAY,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(int),
+	.max_entries = 1,
+};
+
+struct bpf_map_def SEC("maps") flow_masks = {
+	.type = BPF_MAP_TYPE_ARRAY,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(struct xdp_flow_mask_entry),
+	.max_entries = MAX_FLOW_MASKS,
+};
+
+struct bpf_map_def SEC("maps") flow_tables = {
+	.type = BPF_MAP_TYPE_ARRAY_OF_MAPS,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(u32),
+	.max_entries = MAX_FLOW_MASKS,
+};
 
 SEC("xdp_flow")
 int xdp_flow_prog(struct xdp_md *ctx)
diff --git a/net/xdp_flow/xdp_flow_umh.c b/net/xdp_flow/xdp_flow_umh.c
index 734db00..e35666a 100644
--- a/net/xdp_flow/xdp_flow_umh.c
+++ b/net/xdp_flow/xdp_flow_umh.c
@@ -13,7 +13,7 @@
 #include <sys/resource.h>
 #include <linux/hashtable.h>
 #include <linux/err.h>
-#include "msgfmt.h"
+#include "umh_bpf.h"
 
 extern char xdp_flow_bpf_start;
 extern char xdp_flow_bpf_end;
@@ -95,11 +95,13 @@ static int setup(void)
 
 static int load_bpf(int ifindex, struct bpf_object **objp)
 {
+	int prog_fd, flow_tables_fd, flow_meta_fd, flow_masks_head_fd, err;
+	struct bpf_map *flow_tables, *flow_masks_head;
+	int zero = 0, flow_masks_tail = FLOW_MASKS_TAIL;
 	struct bpf_object_open_attr attr = {};
 	char path[256], errbuf[ERRBUF_SIZE];
 	struct bpf_program *prog;
 	struct bpf_object *obj;
-	int prog_fd, err;
 	ssize_t len;
 
 	len = snprintf(path, 256, "/proc/self/fd/%d", progfile_fd);
@@ -127,6 +129,48 @@ static int load_bpf(int ifindex, struct bpf_object **objp)
 	bpf_object__for_each_program(prog, obj)
 		bpf_program__set_type(prog, attr.prog_type);
 
+	flow_meta_fd = bpf_create_map(BPF_MAP_TYPE_HASH,
+				      sizeof(struct xdp_flow_key),
+				      sizeof(struct xdp_flow_actions),
+				      MAX_FLOWS, 0);
+	if (flow_meta_fd < 0) {
+		err = -errno;
+		pr_err("map creation for flow_tables meta failed: %s\n",
+		       strerror(errno));
+		goto err;
+	}
+
+	flow_tables_fd = bpf_create_map_in_map(BPF_MAP_TYPE_ARRAY_OF_MAPS,
+					       "flow_tables", sizeof(__u32),
+					       flow_meta_fd, MAX_FLOW_MASKS, 0);
+	if (flow_tables_fd < 0) {
+		err = -errno;
+		pr_err("map creation for flow_tables failed: %s\n",
+		       strerror(errno));
+		close(flow_meta_fd);
+		goto err;
+	}
+
+	close(flow_meta_fd);
+
+	flow_tables = bpf_object__find_map_by_name(obj, "flow_tables");
+	if (!flow_tables) {
+		pr_err("Cannot find flow_tables\n");
+		err = -ENOENT;
+		close(flow_tables_fd);
+		goto err;
+	}
+
+	err = bpf_map__reuse_fd(flow_tables, flow_tables_fd);
+	if (err) {
+		err = libbpf_err(err, errbuf);
+		pr_err("Failed to reuse flow_tables fd: %s\n", errbuf);
+		close(flow_tables_fd);
+		goto err;
+	}
+
+	close(flow_tables_fd);
+
 	err = bpf_object__load(obj);
 	if (err) {
 		err = libbpf_err(err, errbuf);
@@ -134,6 +178,28 @@ static int load_bpf(int ifindex, struct bpf_object **objp)
 		goto err;
 	}
 
+	flow_masks_head = bpf_object__find_map_by_name(obj, "flow_masks_head");
+	if (!flow_masks_head) {
+		pr_err("Cannot find flow_masks_head map\n");
+		err = -ENOENT;
+		goto err;
+	}
+
+	flow_masks_head_fd = bpf_map__fd(flow_masks_head);
+	if (flow_masks_head_fd < 0) {
+		err = libbpf_err(flow_masks_head_fd, errbuf);
+		pr_err("Invalid flow_masks_head fd: %s\n", errbuf);
+		goto err;
+	}
+
+	if (bpf_map_update_elem(flow_masks_head_fd, &zero, &flow_masks_tail,
+				0)) {
+		err = -errno;
+		pr_err("Failed to initialize flow_masks_head: %s\n",
+		       strerror(errno));
+		goto err;
+	}
+
 	prog = bpf_object__find_program_by_title(obj, "xdp_flow");
 	if (!prog) {
 		pr_err("Cannot find xdp_flow program\n");
-- 
1.8.3.1


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox