* Re: [PATCH RFC net-next] net: ipvs: Adjust gso_size for IPPROTO_TCP
From: Martin KaFai Lau @ 2018-05-03 7:01 UTC (permalink / raw)
To: Julian Anastasov
Cc: netdev, David Ahern, Tom Herbert, Eric Dumazet, Nikita Shirokov,
kernel-team, lvs-devel
In-Reply-To: <alpine.LFD.2.20.1805022143360.3301@ja.home.ssi.bg>
On Wed, May 02, 2018 at 10:30:32PM +0300, Julian Anastasov wrote:
>
> Hello,
>
> On Wed, 2 May 2018, Martin KaFai Lau wrote:
>
> > On Wed, May 02, 2018 at 09:38:43AM +0300, Julian Anastasov wrote:
> > >
> > > - initial traffic for port 21 does not use GSO. But after
> > > every packet IPVS calls maybe_update_pmtu (rt->dst.ops->update_pmtu)
> > > to report the reduced MTU. These updates are stored in fnhe_pmtu
> > > but they do not go to any route, even if we try to get fresh
> > > output route. Why? Because the local routes are not cached, so
> > > they can not use the fnhe. This is what my patch for route.c
> > > will fix. With this fix FTP-DATA gets route with reduced PMTU.
> > For IPv6, the 'if (rt6->rt6i_flags & RTF_LOCAL)' gate in
> > __ip6_rt_update_pmtu() may need to be lifted also.
>
> Probably. I completely forgot the IPv6 part
> but as I don't know the IPv6 code enough, it may take
> some time to understand what can be the problem there...
> I'm not sure whether everything started with commit 0a6b2a1dc2a2,
> so that in some configurations before that commit things
> worked and problem was not noticed.
>
> I think, we should focus on such direction for IPv6:
>
> - do we remember per-VIP PMTU for the local routes
IPv6 used not to create cache route for DST_HOST route which
is a /128 route (that includes local /128 route).
Because of this, it had a bug such that a PMTU for the DST_HOST
route will trigger dst.ops->update_pmtu() which then set
an expire on the permanent /128 route instead of a cache
route. The permanent route got unexpectedly expired/removed
later.
The fix was to allow creating /128 cache route as long as
it is not RTF_LOCAL in 653437d02f1f and 7035870d1219. The
first post spelled out the problem better:
https://patchwork.ozlabs.org/patch/456050/
Later, when we only create cache route after seeing PMTU
in 45e4fd26683c, this RTF_LOCAL checking was carried over
to __ip6_rt_update_pmtu().
Out of my head, I don't see issue removing the
RTF_LOCAL check from __ip6_rt_update_pmtu().
DavidA, what do you think?
>
> - when exactly we start to use the new PMTU, eg. what happens
> in case socket caches the route, whether route is killed via
> dst->obsolete. Or may be while the PMTU expiration is handled
> per-packet, the PMTU change is noticed only on ICMP...
Before sk can reuse its dst cache, the sk will notice
its dst cache is no longer valid by calling dst_check().
dst_check() should return NULL which is one of the side
effect of the earlier update_pmtu(). This dst_check()
is usually only called when the sk needs to do output,
so the new PMTU route (i.e. the RTF_CACHE IPv6 route)
only have effect to the later packets.
>
> - as IPVS reports the PMTU via dst.ops->update_pmtu() long
> before any large packets are sent, do we propagate the
> PMTU. Also, for IPv4 __ip_rt_update_pmtu() has some protection
> from such per-packet updates that do not change the PMTU.
>
> - if IPVS starts to send ICMP when gso_size exceeds PMTU,
> like in my draft patch, whether the PMTU is propagated
> to route and then to socket. As for the gso_size decrease,
> playing in IPVS is not very safe, at least, we need help
> from GSO experts to know how we should use it.
>
> Regards
>
> --
> Julian Anastasov <ja@ssi.bg>
^ permalink raw reply
* Re: DSA switch
From: Ran Shalit @ 2018-05-03 6:50 UTC (permalink / raw)
To: Andrew Lunn; +Cc: netdev
In-Reply-To: <20180502205620.GE24748@lunn.ch>
On Wed, May 2, 2018 at 11:56 PM, Andrew Lunn <andrew@lunn.ch> wrote:
> On Wed, May 02, 2018 at 11:20:05PM +0300, Ran Shalit wrote:
>> Hello,
>>
>> Is it possible to use switch just like external real switch,
>> connecting all ports to the same subnet ?
>
> Yes. Just bridge all ports/interfaces together and put your host IP
> address on the bridge.
>
> Andrew
Hi,
I get error on trying to add bridge.
I am trying to =understand which configuration is missing probably in my kernel,
I ran strace, but not sure , does it point to any missing configuration ?
root@dm814x-evm:~# ip link add br0 type bridge
RTNETLINK answers: Operation not supported
root@dm814x-evm:~# ./strace ip link add br0 type bridge
execve("/bin/ip", ["ip", "link", "add", "br0", "type", "bridge"], [/*
11 vars */]) = 0
brk(0) = 0x44000
uname({sys="Linux", node="dm814x-evm", ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0x400c1000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/lib/tls/v7l/fast-mult/half/libresolv.so.2", O_RDONLY) = -1
ENOENT (No such file or directory)
stat64("/lib/tls/v7l/fast-mult/half", 0xbe8bb3c0) = -1 ENOENT (No such
file or directory)
open("/lib/tls/v7l/fast-mult/libresolv.so.2", O_RDONLY) = -1 ENOENT
(No such file or directory)
stat64("/lib/tls/v7l/fast-mult", 0xbe8bb3c0) = -1 ENOENT (No such file
or directory)
open("/lib/tls/v7l/half/libresolv.so.2", O_RDONLY) = -1 ENOENT (No
such file or directory)
stat64("/lib/tls/v7l/half", 0xbe8bb3c0) = -1 ENOENT (No such file or directory)
open("/lib/tls/v7l/libresolv.so.2", O_RDONLY) = -1 ENOENT (No such
file or directory)
stat64("/lib/tls/v7l", 0xbe8bb3c0) = -1 ENOENT (No such file or directory)
open("/lib/tls/fast-mult/half/libresolv.so.2", O_RDONLY) = -1 ENOENT
(No such file or directory)
stat64("/lib/tls/fast-mult/half", 0xbe8bb3c0) = -1 ENOENT (No such
file or directory)
open("/lib/tls/fast-mult/libresolv.so.2", O_RDONLY) = -1 ENOENT (No
such file or directory)
stat64("/lib/tls/fast-mult", 0xbe8bb3c0) = -1 ENOENT (No such file or directory)
open("/lib/tls/half/libresolv.so.2", O_RDONLY) = -1 ENOENT (No such
file or directory)
stat64("/lib/tls/half", 0xbe8bb3c0) = -1 ENOENT (No such file or directory)
open("/lib/tls/libresolv.so.2", O_RDONLY) = -1 ENOENT (No such file or
directory)
stat64("/lib/tls", 0xbe8bb3c0) = -1 ENOENT (No such file or directory)
open("/lib/v7l/fast-mult/half/libresolv.so.2", O_RDONLY) = -1 ENOENT
(No such file or directory)
stat64("/lib/v7l/fast-mult/half", 0xbe8bb3c0) = -1 ENOENT (No such
file or directory)
open("/lib/v7l/fast-mult/libresolv.so.2", O_RDONLY) = -1 ENOENT (No
such file or directory)
stat64("/lib/v7l/fast-mult", 0xbe8bb3c0) = -1 ENOENT (No such file or directory)
open("/lib/v7l/half/libresolv.so.2", O_RDONLY) = -1 ENOENT (No such
file or directory)
stat64("/lib/v7l/half", 0xbe8bb3c0) = -1 ENOENT (No such file or directory)
open("/lib/v7l/libresolv.so.2", O_RDONLY) = -1 ENOENT (No such file or
directory)
stat64("/lib/v7l", 0xbe8bb3c0) = -1 ENOENT (No such file or directory)
open("/lib/fast-mult/half/libresolv.so.2", O_RDONLY) = -1 ENOENT (No
such file or directory)
stat64("/lib/fast-mult/half", 0xbe8bb3c0) = -1 ENOENT (No such file or
directory)
open("/lib/fast-mult/libresolv.so.2", O_RDONLY) = -1 ENOENT (No such
file or directory)
stat64("/lib/fast-mult", 0xbe8bb3c0) = -1 ENOENT (No such file or directory)
open("/lib/half/libresolv.so.2", O_RDONLY) = -1 ENOENT (No such file
or directory)
stat64("/lib/half", 0xbe8bb3c0) = -1 ENOENT (No such file or directory)
open("/lib/libresolv.so.2", O_RDONLY) = 3
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0(\0\1\0\0\0\234
\0\0004\0\0\0"..., 512) = 512
fstat64(3, {st_mode=S_IFREG|0755, st_size=67624, ...}) = 0
mmap2(NULL, 108588, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3,
0) = 0x40164000
mprotect(0x40174000, 28672, PROT_NONE) = 0
mmap2(0x4017b000, 8192, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0xf) = 0x4017b000
mmap2(0x4017d000, 6188, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x4017d000
close(3) = 0
open("/lib/libdl.so.2", O_RDONLY) = 3
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0(\0\1\0\0\0l\n\0\0004\0\0\0"...,
512) = 512
fstat64(3, {st_mode=S_IFREG|0755, st_size=18080, ...}) = 0
mmap2(NULL, 49364, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3,
0) = 0x400b2000
mprotect(0x400b6000, 28672, PROT_NONE) = 0
mmap2(0x400bd000, 8192, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x3) = 0x400bd000
close(3) = 0
open("/lib/libgcc_s.so.1", O_RDONLY) = 3
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0(\0\1\0\0\0x'\0\0004\0\0\0"...,
512) = 512
fstat64(3, {st_mode=S_IFREG|0644, st_size=70650, ...}) = 0
mmap2(NULL, 79984, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3,
0) = 0x400d1000
mprotect(0x400dd000, 28672, PROT_NONE) = 0
mmap2(0x400e4000, 4096, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0xb) = 0x400e4000
close(3) = 0
open("/lib/libc.so.6", O_RDONLY) = 3
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0(\0\1\0\0\0\240Q\1\0004\0\0\0"...,
512) = 512
fstat64(3, {st_mode=S_IFREG|0755, st_size=1181160, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0x400c2000
mmap2(NULL, 1217096, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE,
3, 0) = 0x4017f000
mprotect(0x4029c000, 28672, PROT_NONE) = 0
mmap2(0x402a3000, 12288, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x11c) = 0x402a3000
mmap2(0x402a6000, 8776, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x402a6000
close(3) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0x400e5000
set_tls(0x400e54a0, 0x400e5b70, 0x4002867c, 0x400e5b78, 0x40028050) = 0
mprotect(0x402a3000, 8192, PROT_READ) = 0
mprotect(0x400bd000, 4096, PROT_READ) = 0
mprotect(0x4017b000, 4096, PROT_READ) = 0
mprotect(0x40027000, 4096, PROT_READ) = 0
socket(PF_NETLINK, SOCK_RAW, 0) = 3
setsockopt(3, SOL_SOCKET, SO_SNDBUF, [32768], 4) = 0
setsockopt(3, SOL_SOCKET, SO_RCVBUF, [1048576], 4) = 0
bind(3, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12) = 0
getsockname(3, {sa_family=AF_NETLINK, pid=1274, groups=00000000}, [12]) = 0
gettimeofday({1356950670, 688093}, NULL) = 0
send(3, " \0\0\0\20\0\5\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0",
32, 0) = 32
recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0,
groups=00000000},
msg_iov(1)=[{"4\0\0\0\2\0\0\0\0\0\0\0\372\4\0\0\355\377\377\377
\0\0\0\20\0\5\0\0\0\0\0"..., 8192}], msg_controllen=0, msg_flags=0},
0) = 52
send(3, "\24\0\0\0\22\0\1\3\217l\341P\0\0\0\0\0\0\0\0", 20, 0) = 20
recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0,
groups=00000000},
msg_iov(1)=[{"\254\1\0\0\20\0\2\0\217l\341P\372\4\0\0\0\0\4\3\1\0\0\0I\0\1\0\0\0\0\0"...,
16384}], msg_controllen=0, msg_flags=0}, 0) = 2664
brk(0) = 0x44000
brk(0x65000) = 0x65000
recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0,
groups=00000000},
msg_iov(1)=[{"\24\0\0\0\3\0\2\0\217l\341P\372\4\0\0\0\0\0\0\1\0\0\0I\0\1\0\0\0\0\0"...,
16384}], msg_controllen=0, msg_flags=0}, 0) = 20
open("/usr/lib//ip/link_bridge.so", O_RDONLY) = -1 ENOENT (No such
file or directory)
sendmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0,
groups=00000000},
msg_iov(1)=[{"8\0\0\0\20\0\5\6\220l\341P\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
56}], msg_controllen=0, msg_flags=0}, 0) = 56
recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0,
groups=00000000},
msg_iov(1)=[{"L\0\0\0\2\0\0\0\220l\341P\372\4\0\0\241\377\377\3778\0\0\0\20\0\5\6\220l\341P"...,
16384}], msg_controllen=0, msg_flags=0}, 0) = 76
dup(2) = 4
fcntl64(4, F_GETFL) = 0x20002 (flags O_RDWR|O_LARGEFILE)
fstat64(4, {st_mode=S_IFCHR|0600, st_rdev=makedev(252, 0), ...}) = 0
ioctl(4, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or
TCGETS, {B115200 opost isig icanon echo ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0x4011b000
_llseek(4, 0, 0xbe8b7510, SEEK_CUR) = -1 ESPIPE (Illegal seek)
write(4, "RTNETLINK answers: Operation not"..., 43RTNETLINK answers:
Operation not supported
) = 43
close(4) = 0
munmap(0x4011b000, 4096) = 0
exit_group(2) = ?
root@dm814x-evm:~#
Thank you,
ran
^ permalink raw reply
* Re: [PATCH net-next v3 2/2] openvswitch: Support conntrack zone limit
From: Pravin Shelar @ 2018-05-03 6:49 UTC (permalink / raw)
To: Yi-Hung Wei; +Cc: Linux Kernel Network Developers
In-Reply-To: <1525123713-38891-3-git-send-email-yihung.wei@gmail.com>
On Mon, Apr 30, 2018 at 2:28 PM, Yi-Hung Wei <yihung.wei@gmail.com> wrote:
> Currently, nf_conntrack_max is used to limit the maximum number of
> conntrack entries in the conntrack table for every network namespace.
> For the VMs and containers that reside in the same namespace,
> they share the same conntrack table, and the total # of conntrack entries
> for all the VMs and containers are limited by nf_conntrack_max. In this
> case, if one of the VM/container abuses the usage the conntrack entries,
> it blocks the others from committing valid conntrack entries into the
> conntrack table. Even if we can possibly put the VM in different network
> namespace, the current nf_conntrack_max configuration is kind of rigid
> that we cannot limit different VM/container to have different # conntrack
> entries.
>
> To address the aforementioned issue, this patch proposes to have a
> fine-grained mechanism that could further limit the # of conntrack entries
> per-zone. For example, we can designate different zone to different VM,
> and set conntrack limit to each zone. By providing this isolation, a
> mis-behaved VM only consumes the conntrack entries in its own zone, and
> it will not influence other well-behaved VMs. Moreover, the users can
> set various conntrack limit to different zone based on their preference.
>
> The proposed implementation utilizes Netfilter's nf_conncount backend
> to count the number of connections in a particular zone. If the number of
> connection is above a configured limitation, ovs will return ENOMEM to the
> userspace. If userspace does not configure the zone limit, the limit
> defaults to zero that is no limitation, which is backward compatible to
> the behavior without this patch.
>
> The following high leve APIs are provided to the userspace:
> - OVS_CT_LIMIT_CMD_SET:
> * set default connection limit for all zones
> * set the connection limit for a particular zone
> - OVS_CT_LIMIT_CMD_DEL:
> * remove the connection limit for a particular zone
> - OVS_CT_LIMIT_CMD_GET:
> * get the default connection limit for all zones
> * get the connection limit for a particular zone
>
> Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com>
> ---
> net/openvswitch/Kconfig | 3 +-
> net/openvswitch/conntrack.c | 508 +++++++++++++++++++++++++++++++++++++++++++-
> net/openvswitch/conntrack.h | 9 +-
> net/openvswitch/datapath.c | 7 +-
> net/openvswitch/datapath.h | 1 +
> 5 files changed, 522 insertions(+), 6 deletions(-)
>
..
> diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
> index c5904f629091..8234964889d9 100644
> --- a/net/openvswitch/conntrack.c
> +++ b/net/openvswitch/conntrack.c
...
> +/* Call with ovs_mutex */
> +static void ct_limit_del(const struct ovs_ct_limit_info *info, u16 zone)
> +{
> + struct ovs_ct_limit *ct_limit;
> + struct hlist_head *head;
> +
> + head = ct_limit_hash_bucket(info, zone);
> + hlist_for_each_entry_rcu(ct_limit, head, hlist_node) {
better to use hlist_for_each_entry_safe()
> + if (ct_limit->zone == zone) {
> + hlist_del_rcu(&ct_limit->hlist_node);
> + kfree_rcu(ct_limit, rcu);
> + return;
> + }
> + }
> +}
> +
....
> +static int ovs_ct_check_limit(struct net *net,
> + const struct ovs_conntrack_info *info,
> + const struct nf_conntrack_tuple *tuple)
> +{
> + struct ovs_net *ovs_net = net_generic(net, ovs_net_id);
> + const struct ovs_ct_limit_info *ct_limit_info = ovs_net->ct_limit_info;
> + u32 per_zone_limit, connections;
> + u32 conncount_key[5];
> +
> + conncount_key[0] = info->zone.id;
> +
> + rcu_read_lock();
This function is call with rcu_read_lock() in datapath, so no need to
take it again.
> + per_zone_limit = ct_limit_get(ct_limit_info, info->zone.id);
> + if (per_zone_limit == OVS_CT_LIMIT_UNLIMITED) {
> + rcu_read_unlock();
> + return 0;
> + }
> +
> + connections = nf_conncount_count(net, ct_limit_info->data,
> + conncount_key, tuple, &info->zone);
> + if (connections > per_zone_limit) {
> + rcu_read_unlock();
> + return -ENOMEM;
> + }
> +
> + rcu_read_unlock();
> + return 0;
> +}
> +#endif
> +
....
>
> static void __net_exit list_vports_from_net(struct net *net, struct net *dnet,
> @@ -2469,3 +2471,4 @@ MODULE_ALIAS_GENL_FAMILY(OVS_VPORT_FAMILY);
> MODULE_ALIAS_GENL_FAMILY(OVS_FLOW_FAMILY);
> MODULE_ALIAS_GENL_FAMILY(OVS_PACKET_FAMILY);
> MODULE_ALIAS_GENL_FAMILY(OVS_METER_FAMILY);
> +MODULE_ALIAS_GENL_FAMILY(OVS_CT_LIMIT_FAMILY);
> diff --git a/net/openvswitch/datapath.h b/net/openvswitch/datapath.h
> index 523d65526766..51bd4dcb6c8b 100644
> --- a/net/openvswitch/datapath.h
> +++ b/net/openvswitch/datapath.h
> @@ -144,6 +144,7 @@ struct dp_upcall_info {
> struct ovs_net {
> struct list_head dps;
> struct work_struct dp_notify_work;
> + struct ovs_ct_limit_info *ct_limit_info;
>
Lets keep this struct and hash table inside the ovs_net to avoid
indirections in accessing the hash table. Also need to check for
IS_ENABLED(CONFIG_NETFILTER_CONNCOUNT).
^ permalink raw reply
* [PATCH net-next] cxgb4: update latest firmware version supported
From: Ganesh Goudar @ 2018-05-03 6:24 UTC (permalink / raw)
To: netdev, davem; +Cc: nirranjan, indranil, venkatesh, Ganesh Goudar
Change t4fw_version.h to update latest firmware version
number to 1.19.1.0.
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
---
drivers/net/ethernet/chelsio/cxgb4/t4fw_version.h | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/drivers/net/ethernet/chelsio/cxgb4/t4fw_version.h b/drivers/net/ethernet/chelsio/cxgb4/t4fw_version.h
index 123e2c1..4eb15ce 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/t4fw_version.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/t4fw_version.h
@@ -36,8 +36,8 @@
#define __T4FW_VERSION_H__
#define T4FW_VERSION_MAJOR 0x01
-#define T4FW_VERSION_MINOR 0x10
-#define T4FW_VERSION_MICRO 0x3F
+#define T4FW_VERSION_MINOR 0x13
+#define T4FW_VERSION_MICRO 0x01
#define T4FW_VERSION_BUILD 0x00
#define T4FW_MIN_VERSION_MAJOR 0x01
@@ -45,8 +45,8 @@
#define T4FW_MIN_VERSION_MICRO 0x00
#define T5FW_VERSION_MAJOR 0x01
-#define T5FW_VERSION_MINOR 0x10
-#define T5FW_VERSION_MICRO 0x3F
+#define T5FW_VERSION_MINOR 0x13
+#define T5FW_VERSION_MICRO 0x01
#define T5FW_VERSION_BUILD 0x00
#define T5FW_MIN_VERSION_MAJOR 0x00
@@ -54,8 +54,8 @@
#define T5FW_MIN_VERSION_MICRO 0x00
#define T6FW_VERSION_MAJOR 0x01
-#define T6FW_VERSION_MINOR 0x10
-#define T6FW_VERSION_MICRO 0x3F
+#define T6FW_VERSION_MINOR 0x13
+#define T6FW_VERSION_MICRO 0x01
#define T6FW_VERSION_BUILD 0x00
#define T6FW_MIN_VERSION_MAJOR 0x00
--
2.1.0
^ permalink raw reply related
* [PATCH v6] bpf, x86_32: add eBPF JIT compiler for ia32
From: Wang YanQing @ 2018-05-03 6:10 UTC (permalink / raw)
To: daniel
Cc: ast, illusionist.neo, tglx, mingo, hpa, davem, x86, netdev,
linux-kernel
The JIT compiler emits ia32 bit instructions. Currently, It supports eBPF
only. Classic BPF is supported because of the conversion by BPF core.
Almost all instructions from eBPF ISA supported except the following:
BPF_ALU64 | BPF_DIV | BPF_K
BPF_ALU64 | BPF_DIV | BPF_X
BPF_ALU64 | BPF_MOD | BPF_K
BPF_ALU64 | BPF_MOD | BPF_X
BPF_STX | BPF_XADD | BPF_W
BPF_STX | BPF_XADD | BPF_DW
It doesn't support BPF_JMP|BPF_CALL with BPF_PSEUDO_CALL at the moment.
IA32 has few general purpose registers, EAX|EDX|ECX|EBX|ESI|EDI. I use
EAX|EDX|ECX|EBX as temporary registers to simulate instructions in eBPF
ISA, and allocate ESI|EDI to BPF_REG_AX for constant blinding, all others
eBPF registers, R0-R10, are simulated through scratch space on stack.
The reasons behind the hardware registers allocation policy are:
1:MUL need EAX:EDX, shift operation need ECX, so they aren't fit
for general eBPF 64bit register simulation.
2:We need at least 4 registers to simulate most eBPF ISA operations
on registers operands instead of on register&memory operands.
3:We need to put BPF_REG_AX on hardware registers, or constant blinding
will degrade jit performance heavily.
Tested on PC (Intel(R) Core(TM) i5-5200U CPU).
Testing results on i5-5200U:
1) test_bpf: Summary: 349 PASSED, 0 FAILED, [319/341 JIT'ed]
2) test_progs: Summary: 83 PASSED, 0 FAILED.
3) test_lpm: OK
4) test_lru_map: OK
5) test_verifier: Summary: 828 PASSED, 0 FAILED.
Above tests are all done in following two conditions separately:
1:bpf_jit_enable=1 and bpf_jit_harden=0
2:bpf_jit_enable=1 and bpf_jit_harden=2
Below are some numbers for this jit implementation:
Note:
I run test_progs in kselftest 100 times continuously for every condition,
the numbers are in format: total/times=avg.
The numbers that test_bpf reports show almost the same relation.
a:jit_enable=0 and jit_harden=0 b:jit_enable=1 and jit_harden=0
test_pkt_access:PASS:ipv4:15622/100=156 test_pkt_access:PASS:ipv4:10674/100=106
test_pkt_access:PASS:ipv6:9130/100=91 test_pkt_access:PASS:ipv6:4855/100=48
test_xdp:PASS:ipv4:240198/100=2401 test_xdp:PASS:ipv4:138912/100=1389
test_xdp:PASS:ipv6:137326/100=1373 test_xdp:PASS:ipv6:68542/100=685
test_l4lb:PASS:ipv4:61100/100=611 test_l4lb:PASS:ipv4:37302/100=373
test_l4lb:PASS:ipv6:101000/100=1010 test_l4lb:PASS:ipv6:55030/100=550
c:jit_enable=1 and jit_harden=2
test_pkt_access:PASS:ipv4:10558/100=105
test_pkt_access:PASS:ipv6:5092/100=50
test_xdp:PASS:ipv4:131902/100=1319
test_xdp:PASS:ipv6:77932/100=779
test_l4lb:PASS:ipv4:38924/100=389
test_l4lb:PASS:ipv6:57520/100=575
The numbers show we get 30%~50% improvement.
See Documentation/networking/filter.txt for more information.
Signed-off-by: Wang YanQing <udknight@gmail.com>
---
Changes v5-v6:
1:Add do {} while (0) to RETPOLINE_RAX_BPF_JIT for
consistence reason.
2:Clean up non-standard comments, reported by Daniel Borkmann.
3:Fix a memory leak issue, repoted by Daniel Borkmann.
Changes v4-v5:
1:Delete is_on_stack, BPF_REG_AX is the only one
on real hardware registers, so just check with
it.
2:Apply commit 1612a981b766 ("bpf, x64: fix JIT emission
for dead code"), suggested by Daniel Borkmann.
Changes v3-v4:
1:Fix changelog in commit.
I install llvm-6.0, then test_progs willn't report errors.
I submit another patch:
"bpf: fix misaligned access for BPF_PROG_TYPE_PERF_EVENT program type on x86_32 platform"
to fix another problem, after that patch, test_verifier willn't report errors too.
2:Fix clear r0[1] twice unnecessarily in *BPF_IND|BPF_ABS* simulation.
Changes v2-v3:
1:Move BPF_REG_AX to real hardware registers for performance reason.
3:Using bpf_load_pointer instead of bpf_jit32.S, suggested by Daniel Borkmann.
4:Delete partial codes in 1c2a088a6626, suggested by Daniel Borkmann.
5:Some bug fixes and comments improvement.
Changes v1-v2:
1:Fix bug in emit_ia32_neg64.
2:Fix bug in emit_ia32_arsh_r64.
3:Delete filename in top level comment, suggested by Thomas Gleixner.
4:Delete unnecessary boiler plate text, suggested by Thomas Gleixner.
5:Rewrite some words in changelog.
6:CodingSytle improvement and a little more comments.
arch/x86/Kconfig | 2 +-
arch/x86/include/asm/nospec-branch.h | 30 +-
arch/x86/net/Makefile | 9 +-
arch/x86/net/bpf_jit_comp32.c | 2553 ++++++++++++++++++++++++++++++++++
4 files changed, 2588 insertions(+), 6 deletions(-)
create mode 100644 arch/x86/net/bpf_jit_comp32.c
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index c07f492..d51a71d 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -138,7 +138,7 @@ config X86
select HAVE_DMA_CONTIGUOUS
select HAVE_DYNAMIC_FTRACE
select HAVE_DYNAMIC_FTRACE_WITH_REGS
- select HAVE_EBPF_JIT if X86_64
+ select HAVE_EBPF_JIT
select HAVE_EFFICIENT_UNALIGNED_ACCESS
select HAVE_EXIT_THREAD
select HAVE_FENTRY if X86_64 || DYNAMIC_FTRACE
diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
index f928ad9..2cd344d 100644
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -291,16 +291,20 @@ static inline void indirect_branch_prediction_barrier(void)
* lfence
* jmp spec_trap
* do_rop:
- * mov %rax,(%rsp)
+ * mov %rax,(%rsp) for x86_64
+ * mov %edx,(%esp) for x86_32
* retq
*
* Without retpolines configured:
*
- * jmp *%rax
+ * jmp *%rax for x86_64
+ * jmp *%edx for x86_32
*/
#ifdef CONFIG_RETPOLINE
+#ifdef CONFIG_X86_64
# define RETPOLINE_RAX_BPF_JIT_SIZE 17
# define RETPOLINE_RAX_BPF_JIT() \
+do { \
EMIT1_off32(0xE8, 7); /* callq do_rop */ \
/* spec_trap: */ \
EMIT2(0xF3, 0x90); /* pause */ \
@@ -308,11 +312,31 @@ static inline void indirect_branch_prediction_barrier(void)
EMIT2(0xEB, 0xF9); /* jmp spec_trap */ \
/* do_rop: */ \
EMIT4(0x48, 0x89, 0x04, 0x24); /* mov %rax,(%rsp) */ \
- EMIT1(0xC3); /* retq */
+ EMIT1(0xC3); /* retq */ \
+} while (0)
#else
+# define RETPOLINE_EDX_BPF_JIT() \
+do { \
+ EMIT1_off32(0xE8, 7); /* call do_rop */ \
+ /* spec_trap: */ \
+ EMIT2(0xF3, 0x90); /* pause */ \
+ EMIT3(0x0F, 0xAE, 0xE8); /* lfence */ \
+ EMIT2(0xEB, 0xF9); /* jmp spec_trap */ \
+ /* do_rop: */ \
+ EMIT3(0x89, 0x14, 0x24); /* mov %edx,(%esp) */ \
+ EMIT1(0xC3); /* ret */ \
+} while (0)
+#endif
+#else /* !CONFIG_RETPOLINE */
+
+#ifdef CONFIG_X86_64
# define RETPOLINE_RAX_BPF_JIT_SIZE 2
# define RETPOLINE_RAX_BPF_JIT() \
EMIT2(0xFF, 0xE0); /* jmp *%rax */
+#else
+# define RETPOLINE_EDX_BPF_JIT() \
+ EMIT2(0xFF, 0xE2) /* jmp *%edx */
+#endif
#endif
#endif /* _ASM_X86_NOSPEC_BRANCH_H_ */
diff --git a/arch/x86/net/Makefile b/arch/x86/net/Makefile
index fefb4b6..f54c9d4 100644
--- a/arch/x86/net/Makefile
+++ b/arch/x86/net/Makefile
@@ -1,6 +1,11 @@
#
# Arch-specific network modules
#
-OBJECT_FILES_NON_STANDARD_bpf_jit.o += y
-obj-$(CONFIG_BPF_JIT) += bpf_jit.o bpf_jit_comp.o
+
+ifeq ($(CONFIG_X86_32),y)
+ obj-$(CONFIG_BPF_JIT) += bpf_jit_comp32.o
+else
+ OBJECT_FILES_NON_STANDARD_bpf_jit.o += y
+ obj-$(CONFIG_BPF_JIT) += bpf_jit.o bpf_jit_comp.o
+endif
diff --git a/arch/x86/net/bpf_jit_comp32.c b/arch/x86/net/bpf_jit_comp32.c
new file mode 100644
index 0000000..61e6134
--- /dev/null
+++ b/arch/x86/net/bpf_jit_comp32.c
@@ -0,0 +1,2553 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Just-In-Time compiler for eBPF filters on IA32 (32bit x86)
+ *
+ * Author: Wang YanQing (udknight@gmail.com)
+ * The code based on code and ideas from:
+ * Eric Dumazet (eric.dumazet@gmail.com)
+ * and from:
+ * Shubham Bansal <illusionist.neo@gmail.com>
+ */
+
+#include <linux/netdevice.h>
+#include <linux/filter.h>
+#include <linux/if_vlan.h>
+#include <asm/cacheflush.h>
+#include <asm/set_memory.h>
+#include <asm/nospec-branch.h>
+#include <linux/bpf.h>
+
+/*
+ * eBPF prog stack layout:
+ *
+ * high
+ * original ESP => +-----+
+ * | | callee saved registers
+ * +-----+
+ * | ... | eBPF JIT scratch space
+ * BPF_FP,IA32_EBP => +-----+
+ * | ... | eBPF prog stack
+ * +-----+
+ * |RSVD | JIT scratchpad
+ * current ESP => +-----+
+ * | |
+ * | ... | Function call stack
+ * | |
+ * +-----+
+ * low
+ *
+ * The callee saved registers:
+ *
+ * high
+ * original ESP => +------------------+ \
+ * | ebp | |
+ * current EBP => +------------------+ } callee saved registers
+ * | ebx,esi,edi | |
+ * +------------------+ /
+ * low
+ */
+
+static u8 *emit_code(u8 *ptr, u32 bytes, unsigned int len)
+{
+ if (len == 1)
+ *ptr = bytes;
+ else if (len == 2)
+ *(u16 *)ptr = bytes;
+ else {
+ *(u32 *)ptr = bytes;
+ barrier();
+ }
+ return ptr + len;
+}
+
+#define EMIT(bytes, len) \
+ do { prog = emit_code(prog, bytes, len); cnt += len; } while (0)
+
+#define EMIT1(b1) EMIT(b1, 1)
+#define EMIT2(b1, b2) EMIT((b1) + ((b2) << 8), 2)
+#define EMIT3(b1, b2, b3) EMIT((b1) + ((b2) << 8) + ((b3) << 16), 3)
+#define EMIT4(b1, b2, b3, b4) \
+ EMIT((b1) + ((b2) << 8) + ((b3) << 16) + ((b4) << 24), 4)
+
+#define EMIT1_off32(b1, off) \
+ do { EMIT1(b1); EMIT(off, 4); } while (0)
+#define EMIT2_off32(b1, b2, off) \
+ do { EMIT2(b1, b2); EMIT(off, 4); } while (0)
+#define EMIT3_off32(b1, b2, b3, off) \
+ do { EMIT3(b1, b2, b3); EMIT(off, 4); } while (0)
+#define EMIT4_off32(b1, b2, b3, b4, off) \
+ do { EMIT4(b1, b2, b3, b4); EMIT(off, 4); } while (0)
+
+#define jmp_label(label, jmp_insn_len) (label - cnt - jmp_insn_len)
+
+static bool is_imm8(int value)
+{
+ return value <= 127 && value >= -128;
+}
+
+static bool is_simm32(s64 value)
+{
+ return value == (s64) (s32) value;
+}
+
+#define STACK_OFFSET(k) (k)
+#define TCALL_CNT (MAX_BPF_JIT_REG + 0) /* Tail Call Count */
+
+#define IA32_EAX (0x0)
+#define IA32_EBX (0x3)
+#define IA32_ECX (0x1)
+#define IA32_EDX (0x2)
+#define IA32_ESI (0x6)
+#define IA32_EDI (0x7)
+#define IA32_EBP (0x5)
+#define IA32_ESP (0x4)
+
+/*
+ * List of x86 cond jumps opcodes (. + s8)
+ * Add 0x10 (and an extra 0x0f) to generate far jumps (. + s32)
+ */
+#define IA32_JB 0x72
+#define IA32_JAE 0x73
+#define IA32_JE 0x74
+#define IA32_JNE 0x75
+#define IA32_JBE 0x76
+#define IA32_JA 0x77
+#define IA32_JL 0x7C
+#define IA32_JGE 0x7D
+#define IA32_JLE 0x7E
+#define IA32_JG 0x7F
+
+/*
+ * Map eBPF registers to IA32 32bit registers or stack scratch space.
+ *
+ * 1. All the registers, R0-R10, are mapped to scratch space on stack.
+ * 2. We need two 64 bit temp registers to do complex operations on eBPF
+ * registers.
+ * 3. For performance reason, the BPF_REG_AX for blinding constant, is
+ * mapped to real hardware register pair, IA32_ESI and IA32_EDI.
+ *
+ * As the eBPF registers are all 64 bit registers and IA32 has only 32 bit
+ * registers, we have to map each eBPF registers with two IA32 32 bit regs
+ * or scratch memory space and we have to build eBPF 64 bit register from those.
+ *
+ * We use IA32_EAX, IA32_EDX, IA32_ECX, IA32_EBX as temporary registers.
+ */
+static const u8 bpf2ia32[][2] = {
+ /* Return value from in-kernel function, and exit value from eBPF */
+ [BPF_REG_0] = {STACK_OFFSET(0), STACK_OFFSET(4)},
+
+ /* The arguments from eBPF program to in-kernel function */
+ /* Stored on stack scratch space */
+ [BPF_REG_1] = {STACK_OFFSET(8), STACK_OFFSET(12)},
+ [BPF_REG_2] = {STACK_OFFSET(16), STACK_OFFSET(20)},
+ [BPF_REG_3] = {STACK_OFFSET(24), STACK_OFFSET(28)},
+ [BPF_REG_4] = {STACK_OFFSET(32), STACK_OFFSET(36)},
+ [BPF_REG_5] = {STACK_OFFSET(40), STACK_OFFSET(44)},
+
+ /* Callee saved registers that in-kernel function will preserve */
+ /* Stored on stack scratch space */
+ [BPF_REG_6] = {STACK_OFFSET(48), STACK_OFFSET(52)},
+ [BPF_REG_7] = {STACK_OFFSET(56), STACK_OFFSET(60)},
+ [BPF_REG_8] = {STACK_OFFSET(64), STACK_OFFSET(68)},
+ [BPF_REG_9] = {STACK_OFFSET(72), STACK_OFFSET(76)},
+
+ /* Read only Frame Pointer to access Stack */
+ [BPF_REG_FP] = {STACK_OFFSET(80), STACK_OFFSET(84)},
+
+ /* Temporary register for blinding constants. */
+ [BPF_REG_AX] = {IA32_ESI, IA32_EDI},
+
+ /* Tail call count. Stored on stack scratch space. */
+ [TCALL_CNT] = {STACK_OFFSET(88), STACK_OFFSET(92)},
+};
+
+#define dst_lo dst[0]
+#define dst_hi dst[1]
+#define src_lo src[0]
+#define src_hi src[1]
+
+#define STACK_ALIGNMENT 8
+/*
+ * Stack space for BPF_REG_1, BPF_REG_2, BPF_REG_3, BPF_REG_4,
+ * BPF_REG_5, BPF_REG_6, BPF_REG_7, BPF_REG_8, BPF_REG_9,
+ * BPF_REG_FP, BPF_REG_AX and Tail call counts.
+ */
+#define SCRATCH_SIZE 96
+
+/* Total stack size used in JITed code */
+#define _STACK_SIZE \
+ (stack_depth + \
+ + SCRATCH_SIZE + \
+ + 4 /* Extra space for skb_copy_bits buffer */)
+
+#define STACK_SIZE ALIGN(_STACK_SIZE, STACK_ALIGNMENT)
+
+/* Get the offset of eBPF REGISTERs stored on scratch space. */
+#define STACK_VAR(off) (off)
+
+/* Offset of skb_copy_bits buffer */
+#define SKB_BUFFER STACK_VAR(SCRATCH_SIZE)
+
+/* Encode 'dst_reg' register into IA32 opcode 'byte' */
+static u8 add_1reg(u8 byte, u32 dst_reg)
+{
+ return byte + dst_reg;
+}
+
+/* Encode 'dst_reg' and 'src_reg' registers into IA32 opcode 'byte' */
+static u8 add_2reg(u8 byte, u32 dst_reg, u32 src_reg)
+{
+ return byte + dst_reg + (src_reg << 3);
+}
+
+static void jit_fill_hole(void *area, unsigned int size)
+{
+ /* Fill whole space with int3 instructions */
+ memset(area, 0xcc, size);
+}
+
+static inline void emit_ia32_mov_i(const u8 dst, const u32 val, bool dstk,
+ u8 **pprog)
+{
+ u8 *prog = *pprog;
+ int cnt = 0;
+
+ if (dstk) {
+ if (val == 0) {
+ /* xor eax,eax */
+ EMIT2(0x33, add_2reg(0xC0, IA32_EAX, IA32_EAX));
+ /* mov dword ptr [ebp+off],eax */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, IA32_EAX),
+ STACK_VAR(dst));
+ } else {
+ EMIT3_off32(0xC7, add_1reg(0x40, IA32_EBP),
+ STACK_VAR(dst), val);
+ }
+ } else {
+ if (val == 0)
+ EMIT2(0x33, add_2reg(0xC0, dst, dst));
+ else
+ EMIT2_off32(0xC7, add_1reg(0xC0, dst),
+ val);
+ }
+ *pprog = prog;
+}
+
+/* dst = imm (4 bytes)*/
+static inline void emit_ia32_mov_r(const u8 dst, const u8 src, bool dstk,
+ bool sstk, u8 **pprog)
+{
+ u8 *prog = *pprog;
+ int cnt = 0;
+ u8 sreg = sstk ? IA32_EAX : src;
+
+ if (sstk)
+ /* mov eax,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EAX), STACK_VAR(src));
+ if (dstk)
+ /* mov dword ptr [ebp+off],eax */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, sreg), STACK_VAR(dst));
+ else
+ /* mov dst,sreg */
+ EMIT2(0x89, add_2reg(0xC0, dst, sreg));
+
+ *pprog = prog;
+}
+
+/* dst = src */
+static inline void emit_ia32_mov_r64(const bool is64, const u8 dst[],
+ const u8 src[], bool dstk,
+ bool sstk, u8 **pprog)
+{
+ emit_ia32_mov_r(dst_lo, src_lo, dstk, sstk, pprog);
+ if (is64)
+ /* complete 8 byte move */
+ emit_ia32_mov_r(dst_hi, src_hi, dstk, sstk, pprog);
+ else
+ /* zero out high 4 bytes */
+ emit_ia32_mov_i(dst_hi, 0, dstk, pprog);
+}
+
+/* Sign extended move */
+static inline void emit_ia32_mov_i64(const bool is64, const u8 dst[],
+ const u32 val, bool dstk, u8 **pprog)
+{
+ u32 hi = 0;
+
+ if (is64 && (val & (1<<31)))
+ hi = (u32)~0;
+ emit_ia32_mov_i(dst_lo, val, dstk, pprog);
+ emit_ia32_mov_i(dst_hi, hi, dstk, pprog);
+}
+
+/*
+ * ALU operation (32 bit)
+ * dst = dst * src
+ */
+static inline void emit_ia32_mul_r(const u8 dst, const u8 src, bool dstk,
+ bool sstk, u8 **pprog)
+{
+ u8 *prog = *pprog;
+ int cnt = 0;
+ u8 sreg = sstk ? IA32_ECX : src;
+
+ if (sstk)
+ /* mov ecx,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_ECX), STACK_VAR(src));
+
+ if (dstk)
+ /* mov eax,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EAX), STACK_VAR(dst));
+ else
+ /* mov eax,dst */
+ EMIT2(0x8B, add_2reg(0xC0, dst, IA32_EAX));
+
+
+ EMIT2(0xF7, add_1reg(0xE0, sreg));
+
+ if (dstk)
+ /* mov dword ptr [ebp+off],eax */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, IA32_EAX),
+ STACK_VAR(dst));
+ else
+ /* mov dst,eax */
+ EMIT2(0x89, add_2reg(0xC0, dst, IA32_EAX));
+
+ *pprog = prog;
+}
+
+static inline void emit_ia32_to_le_r64(const u8 dst[], s32 val,
+ bool dstk, u8 **pprog)
+{
+ u8 *prog = *pprog;
+ int cnt = 0;
+ u8 dreg_lo = dstk ? IA32_EAX : dst_lo;
+ u8 dreg_hi = dstk ? IA32_EDX : dst_hi;
+
+ if (dstk && val != 64) {
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EAX),
+ STACK_VAR(dst_lo));
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EDX),
+ STACK_VAR(dst_hi));
+ }
+ switch (val) {
+ case 16:
+ /*
+ * Emit 'movzwl eax,ax' to zero extend 16-bit
+ * into 64 bit
+ */
+ EMIT2(0x0F, 0xB7);
+ EMIT1(add_2reg(0xC0, dreg_lo, dreg_lo));
+ /* xor dreg_hi,dreg_hi */
+ EMIT2(0x33, add_2reg(0xC0, dreg_hi, dreg_hi));
+ break;
+ case 32:
+ /* xor dreg_hi,dreg_hi */
+ EMIT2(0x33, add_2reg(0xC0, dreg_hi, dreg_hi));
+ break;
+ case 64:
+ /* nop */
+ break;
+ }
+
+ if (dstk && val != 64) {
+ /* mov dword ptr [ebp+off],dreg_lo */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, dreg_lo),
+ STACK_VAR(dst_lo));
+ /* mov dword ptr [ebp+off],dreg_hi */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, dreg_hi),
+ STACK_VAR(dst_hi));
+ }
+ *pprog = prog;
+}
+
+static inline void emit_ia32_to_be_r64(const u8 dst[], s32 val,
+ bool dstk, u8 **pprog)
+{
+ u8 *prog = *pprog;
+ int cnt = 0;
+ u8 dreg_lo = dstk ? IA32_EAX : dst_lo;
+ u8 dreg_hi = dstk ? IA32_EDX : dst_hi;
+
+ if (dstk) {
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EAX),
+ STACK_VAR(dst_lo));
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EDX),
+ STACK_VAR(dst_hi));
+ }
+ switch (val) {
+ case 16:
+ /* Emit 'ror %ax, 8' to swap lower 2 bytes */
+ EMIT1(0x66);
+ EMIT3(0xC1, add_1reg(0xC8, dreg_lo), 8);
+
+ EMIT2(0x0F, 0xB7);
+ EMIT1(add_2reg(0xC0, dreg_lo, dreg_lo));
+
+ /* xor dreg_hi,dreg_hi */
+ EMIT2(0x33, add_2reg(0xC0, dreg_hi, dreg_hi));
+ break;
+ case 32:
+ /* Emit 'bswap eax' to swap lower 4 bytes */
+ EMIT1(0x0F);
+ EMIT1(add_1reg(0xC8, dreg_lo));
+
+ /* xor dreg_hi,dreg_hi */
+ EMIT2(0x33, add_2reg(0xC0, dreg_hi, dreg_hi));
+ break;
+ case 64:
+ /* Emit 'bswap eax' to swap lower 4 bytes */
+ EMIT1(0x0F);
+ EMIT1(add_1reg(0xC8, dreg_lo));
+
+ /* Emit 'bswap edx' to swap lower 4 bytes */
+ EMIT1(0x0F);
+ EMIT1(add_1reg(0xC8, dreg_hi));
+
+ /* mov ecx,dreg_hi */
+ EMIT2(0x89, add_2reg(0xC0, IA32_ECX, dreg_hi));
+ /* mov dreg_hi,dreg_lo */
+ EMIT2(0x89, add_2reg(0xC0, dreg_hi, dreg_lo));
+ /* mov dreg_lo,ecx */
+ EMIT2(0x89, add_2reg(0xC0, dreg_lo, IA32_ECX));
+
+ break;
+ }
+ if (dstk) {
+ /* mov dword ptr [ebp+off],dreg_lo */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, dreg_lo),
+ STACK_VAR(dst_lo));
+ /* mov dword ptr [ebp+off],dreg_hi */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, dreg_hi),
+ STACK_VAR(dst_hi));
+ }
+ *pprog = prog;
+}
+
+/*
+ * ALU operation (32 bit)
+ * dst = dst (div|mod) src
+ */
+static inline void emit_ia32_div_mod_r(const u8 op, const u8 dst, const u8 src,
+ bool dstk, bool sstk, u8 **pprog)
+{
+ u8 *prog = *pprog;
+ int cnt = 0;
+
+ if (sstk)
+ /* mov ecx,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_ECX),
+ STACK_VAR(src));
+ else if (src != IA32_ECX)
+ /* mov ecx,src */
+ EMIT2(0x8B, add_2reg(0xC0, src, IA32_ECX));
+
+ if (dstk)
+ /* mov eax,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EAX),
+ STACK_VAR(dst));
+ else
+ /* mov eax,dst */
+ EMIT2(0x8B, add_2reg(0xC0, dst, IA32_EAX));
+
+ /* xor edx,edx */
+ EMIT2(0x31, add_2reg(0xC0, IA32_EDX, IA32_EDX));
+ /* div ecx */
+ EMIT2(0xF7, add_1reg(0xF0, IA32_ECX));
+
+ if (op == BPF_MOD) {
+ if (dstk)
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, IA32_EDX),
+ STACK_VAR(dst));
+ else
+ EMIT2(0x89, add_2reg(0xC0, dst, IA32_EDX));
+ } else {
+ if (dstk)
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, IA32_EAX),
+ STACK_VAR(dst));
+ else
+ EMIT2(0x89, add_2reg(0xC0, dst, IA32_EAX));
+ }
+ *pprog = prog;
+}
+
+/*
+ * ALU operation (32 bit)
+ * dst = dst (shift) src
+ */
+static inline void emit_ia32_shift_r(const u8 op, const u8 dst, const u8 src,
+ bool dstk, bool sstk, u8 **pprog)
+{
+ u8 *prog = *pprog;
+ int cnt = 0;
+ u8 dreg = dstk ? IA32_EAX : dst;
+ u8 b2;
+
+ if (dstk)
+ /* mov eax,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EAX), STACK_VAR(dst));
+
+ if (sstk)
+ /* mov ecx,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_ECX), STACK_VAR(src));
+ else if (src != IA32_ECX)
+ /* mov ecx,src */
+ EMIT2(0x8B, add_2reg(0xC0, src, IA32_ECX));
+
+ switch (op) {
+ case BPF_LSH:
+ b2 = 0xE0; break;
+ case BPF_RSH:
+ b2 = 0xE8; break;
+ case BPF_ARSH:
+ b2 = 0xF8; break;
+ default:
+ return;
+ }
+ EMIT2(0xD3, add_1reg(b2, dreg));
+
+ if (dstk)
+ /* mov dword ptr [ebp+off],dreg */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, dreg), STACK_VAR(dst));
+ *pprog = prog;
+}
+
+/*
+ * ALU operation (32 bit)
+ * dst = dst (op) src
+ */
+static inline void emit_ia32_alu_r(const bool is64, const bool hi, const u8 op,
+ const u8 dst, const u8 src, bool dstk,
+ bool sstk, u8 **pprog)
+{
+ u8 *prog = *pprog;
+ int cnt = 0;
+ u8 sreg = sstk ? IA32_EAX : src;
+ u8 dreg = dstk ? IA32_EDX : dst;
+
+ if (sstk)
+ /* mov eax,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EAX), STACK_VAR(src));
+
+ if (dstk)
+ /* mov eax,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EDX), STACK_VAR(dst));
+
+ switch (BPF_OP(op)) {
+ /* dst = dst + src */
+ case BPF_ADD:
+ if (hi && is64)
+ EMIT2(0x11, add_2reg(0xC0, dreg, sreg));
+ else
+ EMIT2(0x01, add_2reg(0xC0, dreg, sreg));
+ break;
+ /* dst = dst - src */
+ case BPF_SUB:
+ if (hi && is64)
+ EMIT2(0x19, add_2reg(0xC0, dreg, sreg));
+ else
+ EMIT2(0x29, add_2reg(0xC0, dreg, sreg));
+ break;
+ /* dst = dst | src */
+ case BPF_OR:
+ EMIT2(0x09, add_2reg(0xC0, dreg, sreg));
+ break;
+ /* dst = dst & src */
+ case BPF_AND:
+ EMIT2(0x21, add_2reg(0xC0, dreg, sreg));
+ break;
+ /* dst = dst ^ src */
+ case BPF_XOR:
+ EMIT2(0x31, add_2reg(0xC0, dreg, sreg));
+ break;
+ }
+
+ if (dstk)
+ /* mov dword ptr [ebp+off],dreg */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, dreg),
+ STACK_VAR(dst));
+ *pprog = prog;
+}
+
+/* ALU operation (64 bit) */
+static inline void emit_ia32_alu_r64(const bool is64, const u8 op,
+ const u8 dst[], const u8 src[],
+ bool dstk, bool sstk,
+ u8 **pprog)
+{
+ u8 *prog = *pprog;
+
+ emit_ia32_alu_r(is64, false, op, dst_lo, src_lo, dstk, sstk, &prog);
+ if (is64)
+ emit_ia32_alu_r(is64, true, op, dst_hi, src_hi, dstk, sstk,
+ &prog);
+ else
+ emit_ia32_mov_i(dst_hi, 0, dstk, &prog);
+ *pprog = prog;
+}
+
+/*
+ * ALU operation (32 bit)
+ * dst = dst (op) val
+ */
+static inline void emit_ia32_alu_i(const bool is64, const bool hi, const u8 op,
+ const u8 dst, const s32 val, bool dstk,
+ u8 **pprog)
+{
+ u8 *prog = *pprog;
+ int cnt = 0;
+ u8 dreg = dstk ? IA32_EAX : dst;
+ u8 sreg = IA32_EDX;
+
+ if (dstk)
+ /* mov eax,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EAX), STACK_VAR(dst));
+
+ if (!is_imm8(val))
+ /* mov edx,imm32*/
+ EMIT2_off32(0xC7, add_1reg(0xC0, IA32_EDX), val);
+
+ switch (op) {
+ /* dst = dst + val */
+ case BPF_ADD:
+ if (hi && is64) {
+ if (is_imm8(val))
+ EMIT3(0x83, add_1reg(0xD0, dreg), val);
+ else
+ EMIT2(0x11, add_2reg(0xC0, dreg, sreg));
+ } else {
+ if (is_imm8(val))
+ EMIT3(0x83, add_1reg(0xC0, dreg), val);
+ else
+ EMIT2(0x01, add_2reg(0xC0, dreg, sreg));
+ }
+ break;
+ /* dst = dst - val */
+ case BPF_SUB:
+ if (hi && is64) {
+ if (is_imm8(val))
+ EMIT3(0x83, add_1reg(0xD8, dreg), val);
+ else
+ EMIT2(0x19, add_2reg(0xC0, dreg, sreg));
+ } else {
+ if (is_imm8(val))
+ EMIT3(0x83, add_1reg(0xE8, dreg), val);
+ else
+ EMIT2(0x29, add_2reg(0xC0, dreg, sreg));
+ }
+ break;
+ /* dst = dst | val */
+ case BPF_OR:
+ if (is_imm8(val))
+ EMIT3(0x83, add_1reg(0xC8, dreg), val);
+ else
+ EMIT2(0x09, add_2reg(0xC0, dreg, sreg));
+ break;
+ /* dst = dst & val */
+ case BPF_AND:
+ if (is_imm8(val))
+ EMIT3(0x83, add_1reg(0xE0, dreg), val);
+ else
+ EMIT2(0x21, add_2reg(0xC0, dreg, sreg));
+ break;
+ /* dst = dst ^ val */
+ case BPF_XOR:
+ if (is_imm8(val))
+ EMIT3(0x83, add_1reg(0xF0, dreg), val);
+ else
+ EMIT2(0x31, add_2reg(0xC0, dreg, sreg));
+ break;
+ case BPF_NEG:
+ EMIT2(0xF7, add_1reg(0xD8, dreg));
+ break;
+ }
+
+ if (dstk)
+ /* mov dword ptr [ebp+off],dreg */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, dreg),
+ STACK_VAR(dst));
+ *pprog = prog;
+}
+
+/* ALU operation (64 bit) */
+static inline void emit_ia32_alu_i64(const bool is64, const u8 op,
+ const u8 dst[], const u32 val,
+ bool dstk, u8 **pprog)
+{
+ u8 *prog = *pprog;
+ u32 hi = 0;
+
+ if (is64 && (val & (1<<31)))
+ hi = (u32)~0;
+
+ emit_ia32_alu_i(is64, false, op, dst_lo, val, dstk, &prog);
+ if (is64)
+ emit_ia32_alu_i(is64, true, op, dst_hi, hi, dstk, &prog);
+ else
+ emit_ia32_mov_i(dst_hi, 0, dstk, &prog);
+
+ *pprog = prog;
+}
+
+/* dst = ~dst (64 bit) */
+static inline void emit_ia32_neg64(const u8 dst[], bool dstk, u8 **pprog)
+{
+ u8 *prog = *pprog;
+ int cnt = 0;
+ u8 dreg_lo = dstk ? IA32_EAX : dst_lo;
+ u8 dreg_hi = dstk ? IA32_EDX : dst_hi;
+
+ if (dstk) {
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EAX),
+ STACK_VAR(dst_lo));
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EDX),
+ STACK_VAR(dst_hi));
+ }
+
+ /* xor ecx,ecx */
+ EMIT2(0x31, add_2reg(0xC0, IA32_ECX, IA32_ECX));
+ /* sub dreg_lo,ecx */
+ EMIT2(0x2B, add_2reg(0xC0, dreg_lo, IA32_ECX));
+ /* mov dreg_lo,ecx */
+ EMIT2(0x89, add_2reg(0xC0, dreg_lo, IA32_ECX));
+
+ /* xor ecx,ecx */
+ EMIT2(0x31, add_2reg(0xC0, IA32_ECX, IA32_ECX));
+ /* sbb dreg_hi,ecx */
+ EMIT2(0x19, add_2reg(0xC0, dreg_hi, IA32_ECX));
+ /* mov dreg_hi,ecx */
+ EMIT2(0x89, add_2reg(0xC0, dreg_hi, IA32_ECX));
+
+ if (dstk) {
+ /* mov dword ptr [ebp+off],dreg_lo */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, dreg_lo),
+ STACK_VAR(dst_lo));
+ /* mov dword ptr [ebp+off],dreg_hi */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, dreg_hi),
+ STACK_VAR(dst_hi));
+ }
+ *pprog = prog;
+}
+
+/* dst = dst << src */
+static inline void emit_ia32_lsh_r64(const u8 dst[], const u8 src[],
+ bool dstk, bool sstk, u8 **pprog)
+{
+ u8 *prog = *pprog;
+ int cnt = 0;
+ static int jmp_label1 = -1;
+ static int jmp_label2 = -1;
+ static int jmp_label3 = -1;
+ u8 dreg_lo = dstk ? IA32_EAX : dst_lo;
+ u8 dreg_hi = dstk ? IA32_EDX : dst_hi;
+
+ if (dstk) {
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EAX),
+ STACK_VAR(dst_lo));
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EDX),
+ STACK_VAR(dst_hi));
+ }
+
+ if (sstk)
+ /* mov ecx,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_ECX),
+ STACK_VAR(src_lo));
+ else
+ /* mov ecx,src_lo */
+ EMIT2(0x8B, add_2reg(0xC0, src_lo, IA32_ECX));
+
+ /* cmp ecx,32 */
+ EMIT3(0x83, add_1reg(0xF8, IA32_ECX), 32);
+ /* Jumps when >= 32 */
+ if (is_imm8(jmp_label(jmp_label1, 2)))
+ EMIT2(IA32_JAE, jmp_label(jmp_label1, 2));
+ else
+ EMIT2_off32(0x0F, IA32_JAE + 0x10, jmp_label(jmp_label1, 6));
+
+ /* < 32 */
+ /* shl dreg_hi,cl */
+ EMIT2(0xD3, add_1reg(0xE0, dreg_hi));
+ /* mov ebx,dreg_lo */
+ EMIT2(0x8B, add_2reg(0xC0, dreg_lo, IA32_EBX));
+ /* shl dreg_lo,cl */
+ EMIT2(0xD3, add_1reg(0xE0, dreg_lo));
+
+ /* IA32_ECX = -IA32_ECX + 32 */
+ /* neg ecx */
+ EMIT2(0xF7, add_1reg(0xD8, IA32_ECX));
+ /* add ecx,32 */
+ EMIT3(0x83, add_1reg(0xC0, IA32_ECX), 32);
+
+ /* shr ebx,cl */
+ EMIT2(0xD3, add_1reg(0xE8, IA32_EBX));
+ /* or dreg_hi,ebx */
+ EMIT2(0x09, add_2reg(0xC0, dreg_hi, IA32_EBX));
+
+ /* goto out; */
+ if (is_imm8(jmp_label(jmp_label3, 2)))
+ EMIT2(0xEB, jmp_label(jmp_label3, 2));
+ else
+ EMIT1_off32(0xE9, jmp_label(jmp_label3, 5));
+
+ /* >= 32 */
+ if (jmp_label1 == -1)
+ jmp_label1 = cnt;
+
+ /* cmp ecx,64 */
+ EMIT3(0x83, add_1reg(0xF8, IA32_ECX), 64);
+ /* Jumps when >= 64 */
+ if (is_imm8(jmp_label(jmp_label2, 2)))
+ EMIT2(IA32_JAE, jmp_label(jmp_label2, 2));
+ else
+ EMIT2_off32(0x0F, IA32_JAE + 0x10, jmp_label(jmp_label2, 6));
+
+ /* >= 32 && < 64 */
+ /* sub ecx,32 */
+ EMIT3(0x83, add_1reg(0xE8, IA32_ECX), 32);
+ /* shl dreg_lo,cl */
+ EMIT2(0xD3, add_1reg(0xE0, dreg_lo));
+ /* mov dreg_hi,dreg_lo */
+ EMIT2(0x89, add_2reg(0xC0, dreg_hi, dreg_lo));
+
+ /* xor dreg_lo,dreg_lo */
+ EMIT2(0x33, add_2reg(0xC0, dreg_lo, dreg_lo));
+
+ /* goto out; */
+ if (is_imm8(jmp_label(jmp_label3, 2)))
+ EMIT2(0xEB, jmp_label(jmp_label3, 2));
+ else
+ EMIT1_off32(0xE9, jmp_label(jmp_label3, 5));
+
+ /* >= 64 */
+ if (jmp_label2 == -1)
+ jmp_label2 = cnt;
+ /* xor dreg_lo,dreg_lo */
+ EMIT2(0x33, add_2reg(0xC0, dreg_lo, dreg_lo));
+ /* xor dreg_hi,dreg_hi */
+ EMIT2(0x33, add_2reg(0xC0, dreg_hi, dreg_hi));
+
+ if (jmp_label3 == -1)
+ jmp_label3 = cnt;
+
+ if (dstk) {
+ /* mov dword ptr [ebp+off],dreg_lo */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, dreg_lo),
+ STACK_VAR(dst_lo));
+ /* mov dword ptr [ebp+off],dreg_hi */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, dreg_hi),
+ STACK_VAR(dst_hi));
+ }
+ /* out: */
+ *pprog = prog;
+}
+
+/* dst = dst >> src (signed)*/
+static inline void emit_ia32_arsh_r64(const u8 dst[], const u8 src[],
+ bool dstk, bool sstk, u8 **pprog)
+{
+ u8 *prog = *pprog;
+ int cnt = 0;
+ static int jmp_label1 = -1;
+ static int jmp_label2 = -1;
+ static int jmp_label3 = -1;
+ u8 dreg_lo = dstk ? IA32_EAX : dst_lo;
+ u8 dreg_hi = dstk ? IA32_EDX : dst_hi;
+
+ if (dstk) {
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EAX),
+ STACK_VAR(dst_lo));
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EDX),
+ STACK_VAR(dst_hi));
+ }
+
+ if (sstk)
+ /* mov ecx,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_ECX),
+ STACK_VAR(src_lo));
+ else
+ /* mov ecx,src_lo */
+ EMIT2(0x8B, add_2reg(0xC0, src_lo, IA32_ECX));
+
+ /* cmp ecx,32 */
+ EMIT3(0x83, add_1reg(0xF8, IA32_ECX), 32);
+ /* Jumps when >= 32 */
+ if (is_imm8(jmp_label(jmp_label1, 2)))
+ EMIT2(IA32_JAE, jmp_label(jmp_label1, 2));
+ else
+ EMIT2_off32(0x0F, IA32_JAE + 0x10, jmp_label(jmp_label1, 6));
+
+ /* < 32 */
+ /* lshr dreg_lo,cl */
+ EMIT2(0xD3, add_1reg(0xE8, dreg_lo));
+ /* mov ebx,dreg_hi */
+ EMIT2(0x8B, add_2reg(0xC0, dreg_hi, IA32_EBX));
+ /* ashr dreg_hi,cl */
+ EMIT2(0xD3, add_1reg(0xF8, dreg_hi));
+
+ /* IA32_ECX = -IA32_ECX + 32 */
+ /* neg ecx */
+ EMIT2(0xF7, add_1reg(0xD8, IA32_ECX));
+ /* add ecx,32 */
+ EMIT3(0x83, add_1reg(0xC0, IA32_ECX), 32);
+
+ /* shl ebx,cl */
+ EMIT2(0xD3, add_1reg(0xE0, IA32_EBX));
+ /* or dreg_lo,ebx */
+ EMIT2(0x09, add_2reg(0xC0, dreg_lo, IA32_EBX));
+
+ /* goto out; */
+ if (is_imm8(jmp_label(jmp_label3, 2)))
+ EMIT2(0xEB, jmp_label(jmp_label3, 2));
+ else
+ EMIT1_off32(0xE9, jmp_label(jmp_label3, 5));
+
+ /* >= 32 */
+ if (jmp_label1 == -1)
+ jmp_label1 = cnt;
+
+ /* cmp ecx,64 */
+ EMIT3(0x83, add_1reg(0xF8, IA32_ECX), 64);
+ /* Jumps when >= 64 */
+ if (is_imm8(jmp_label(jmp_label2, 2)))
+ EMIT2(IA32_JAE, jmp_label(jmp_label2, 2));
+ else
+ EMIT2_off32(0x0F, IA32_JAE + 0x10, jmp_label(jmp_label2, 6));
+
+ /* >= 32 && < 64 */
+ /* sub ecx,32 */
+ EMIT3(0x83, add_1reg(0xE8, IA32_ECX), 32);
+ /* ashr dreg_hi,cl */
+ EMIT2(0xD3, add_1reg(0xF8, dreg_hi));
+ /* mov dreg_lo,dreg_hi */
+ EMIT2(0x89, add_2reg(0xC0, dreg_lo, dreg_hi));
+
+ /* ashr dreg_hi,imm8 */
+ EMIT3(0xC1, add_1reg(0xF8, dreg_hi), 31);
+
+ /* goto out; */
+ if (is_imm8(jmp_label(jmp_label3, 2)))
+ EMIT2(0xEB, jmp_label(jmp_label3, 2));
+ else
+ EMIT1_off32(0xE9, jmp_label(jmp_label3, 5));
+
+ /* >= 64 */
+ if (jmp_label2 == -1)
+ jmp_label2 = cnt;
+ /* ashr dreg_hi,imm8 */
+ EMIT3(0xC1, add_1reg(0xF8, dreg_hi), 31);
+ /* mov dreg_lo,dreg_hi */
+ EMIT2(0x89, add_2reg(0xC0, dreg_lo, dreg_hi));
+
+ if (jmp_label3 == -1)
+ jmp_label3 = cnt;
+
+ if (dstk) {
+ /* mov dword ptr [ebp+off],dreg_lo */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, dreg_lo),
+ STACK_VAR(dst_lo));
+ /* mov dword ptr [ebp+off],dreg_hi */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, dreg_hi),
+ STACK_VAR(dst_hi));
+ }
+ /* out: */
+ *pprog = prog;
+}
+
+/* dst = dst >> src */
+static inline void emit_ia32_rsh_r64(const u8 dst[], const u8 src[], bool dstk,
+ bool sstk, u8 **pprog)
+{
+ u8 *prog = *pprog;
+ int cnt = 0;
+ static int jmp_label1 = -1;
+ static int jmp_label2 = -1;
+ static int jmp_label3 = -1;
+ u8 dreg_lo = dstk ? IA32_EAX : dst_lo;
+ u8 dreg_hi = dstk ? IA32_EDX : dst_hi;
+
+ if (dstk) {
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EAX),
+ STACK_VAR(dst_lo));
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EDX),
+ STACK_VAR(dst_hi));
+ }
+
+ if (sstk)
+ /* mov ecx,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_ECX),
+ STACK_VAR(src_lo));
+ else
+ /* mov ecx,src_lo */
+ EMIT2(0x8B, add_2reg(0xC0, src_lo, IA32_ECX));
+
+ /* cmp ecx,32 */
+ EMIT3(0x83, add_1reg(0xF8, IA32_ECX), 32);
+ /* Jumps when >= 32 */
+ if (is_imm8(jmp_label(jmp_label1, 2)))
+ EMIT2(IA32_JAE, jmp_label(jmp_label1, 2));
+ else
+ EMIT2_off32(0x0F, IA32_JAE + 0x10, jmp_label(jmp_label1, 6));
+
+ /* < 32 */
+ /* lshr dreg_lo,cl */
+ EMIT2(0xD3, add_1reg(0xE8, dreg_lo));
+ /* mov ebx,dreg_hi */
+ EMIT2(0x8B, add_2reg(0xC0, dreg_hi, IA32_EBX));
+ /* shr dreg_hi,cl */
+ EMIT2(0xD3, add_1reg(0xE8, dreg_hi));
+
+ /* IA32_ECX = -IA32_ECX + 32 */
+ /* neg ecx */
+ EMIT2(0xF7, add_1reg(0xD8, IA32_ECX));
+ /* add ecx,32 */
+ EMIT3(0x83, add_1reg(0xC0, IA32_ECX), 32);
+
+ /* shl ebx,cl */
+ EMIT2(0xD3, add_1reg(0xE0, IA32_EBX));
+ /* or dreg_lo,ebx */
+ EMIT2(0x09, add_2reg(0xC0, dreg_lo, IA32_EBX));
+
+ /* goto out; */
+ if (is_imm8(jmp_label(jmp_label3, 2)))
+ EMIT2(0xEB, jmp_label(jmp_label3, 2));
+ else
+ EMIT1_off32(0xE9, jmp_label(jmp_label3, 5));
+
+ /* >= 32 */
+ if (jmp_label1 == -1)
+ jmp_label1 = cnt;
+ /* cmp ecx,64 */
+ EMIT3(0x83, add_1reg(0xF8, IA32_ECX), 64);
+ /* Jumps when >= 64 */
+ if (is_imm8(jmp_label(jmp_label2, 2)))
+ EMIT2(IA32_JAE, jmp_label(jmp_label2, 2));
+ else
+ EMIT2_off32(0x0F, IA32_JAE + 0x10, jmp_label(jmp_label2, 6));
+
+ /* >= 32 && < 64 */
+ /* sub ecx,32 */
+ EMIT3(0x83, add_1reg(0xE8, IA32_ECX), 32);
+ /* shr dreg_hi,cl */
+ EMIT2(0xD3, add_1reg(0xE8, dreg_hi));
+ /* mov dreg_lo,dreg_hi */
+ EMIT2(0x89, add_2reg(0xC0, dreg_lo, dreg_hi));
+ /* xor dreg_hi,dreg_hi */
+ EMIT2(0x33, add_2reg(0xC0, dreg_hi, dreg_hi));
+
+ /* goto out; */
+ if (is_imm8(jmp_label(jmp_label3, 2)))
+ EMIT2(0xEB, jmp_label(jmp_label3, 2));
+ else
+ EMIT1_off32(0xE9, jmp_label(jmp_label3, 5));
+
+ /* >= 64 */
+ if (jmp_label2 == -1)
+ jmp_label2 = cnt;
+ /* xor dreg_lo,dreg_lo */
+ EMIT2(0x33, add_2reg(0xC0, dreg_lo, dreg_lo));
+ /* xor dreg_hi,dreg_hi */
+ EMIT2(0x33, add_2reg(0xC0, dreg_hi, dreg_hi));
+
+ if (jmp_label3 == -1)
+ jmp_label3 = cnt;
+
+ if (dstk) {
+ /* mov dword ptr [ebp+off],dreg_lo */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, dreg_lo),
+ STACK_VAR(dst_lo));
+ /* mov dword ptr [ebp+off],dreg_hi */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, dreg_hi),
+ STACK_VAR(dst_hi));
+ }
+ /* out: */
+ *pprog = prog;
+}
+
+/* dst = dst << val */
+static inline void emit_ia32_lsh_i64(const u8 dst[], const u32 val,
+ bool dstk, u8 **pprog)
+{
+ u8 *prog = *pprog;
+ int cnt = 0;
+ u8 dreg_lo = dstk ? IA32_EAX : dst_lo;
+ u8 dreg_hi = dstk ? IA32_EDX : dst_hi;
+
+ if (dstk) {
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EAX),
+ STACK_VAR(dst_lo));
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EDX),
+ STACK_VAR(dst_hi));
+ }
+ /* Do LSH operation */
+ if (val < 32) {
+ /* shl dreg_hi,imm8 */
+ EMIT3(0xC1, add_1reg(0xE0, dreg_hi), val);
+ /* mov ebx,dreg_lo */
+ EMIT2(0x8B, add_2reg(0xC0, dreg_lo, IA32_EBX));
+ /* shl dreg_lo,imm8 */
+ EMIT3(0xC1, add_1reg(0xE0, dreg_lo), val);
+
+ /* IA32_ECX = 32 - val */
+ /* mov ecx,val */
+ EMIT2(0xB1, val);
+ /* movzx ecx,ecx */
+ EMIT3(0x0F, 0xB6, add_2reg(0xC0, IA32_ECX, IA32_ECX));
+ /* neg ecx */
+ EMIT2(0xF7, add_1reg(0xD8, IA32_ECX));
+ /* add ecx,32 */
+ EMIT3(0x83, add_1reg(0xC0, IA32_ECX), 32);
+
+ /* shr ebx,cl */
+ EMIT2(0xD3, add_1reg(0xE8, IA32_EBX));
+ /* or dreg_hi,ebx */
+ EMIT2(0x09, add_2reg(0xC0, dreg_hi, IA32_EBX));
+ } else if (val >= 32 && val < 64) {
+ u32 value = val - 32;
+
+ /* shl dreg_lo,imm8 */
+ EMIT3(0xC1, add_1reg(0xE0, dreg_lo), value);
+ /* mov dreg_hi,dreg_lo */
+ EMIT2(0x89, add_2reg(0xC0, dreg_hi, dreg_lo));
+ /* xor dreg_lo,dreg_lo */
+ EMIT2(0x33, add_2reg(0xC0, dreg_lo, dreg_lo));
+ } else {
+ /* xor dreg_lo,dreg_lo */
+ EMIT2(0x33, add_2reg(0xC0, dreg_lo, dreg_lo));
+ /* xor dreg_hi,dreg_hi */
+ EMIT2(0x33, add_2reg(0xC0, dreg_hi, dreg_hi));
+ }
+
+ if (dstk) {
+ /* mov dword ptr [ebp+off],dreg_lo */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, dreg_lo),
+ STACK_VAR(dst_lo));
+ /* mov dword ptr [ebp+off],dreg_hi */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, dreg_hi),
+ STACK_VAR(dst_hi));
+ }
+ *pprog = prog;
+}
+
+/* dst = dst >> val */
+static inline void emit_ia32_rsh_i64(const u8 dst[], const u32 val,
+ bool dstk, u8 **pprog)
+{
+ u8 *prog = *pprog;
+ int cnt = 0;
+ u8 dreg_lo = dstk ? IA32_EAX : dst_lo;
+ u8 dreg_hi = dstk ? IA32_EDX : dst_hi;
+
+ if (dstk) {
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EAX),
+ STACK_VAR(dst_lo));
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EDX),
+ STACK_VAR(dst_hi));
+ }
+
+ /* Do RSH operation */
+ if (val < 32) {
+ /* shr dreg_lo,imm8 */
+ EMIT3(0xC1, add_1reg(0xE8, dreg_lo), val);
+ /* mov ebx,dreg_hi */
+ EMIT2(0x8B, add_2reg(0xC0, dreg_hi, IA32_EBX));
+ /* shr dreg_hi,imm8 */
+ EMIT3(0xC1, add_1reg(0xE8, dreg_hi), val);
+
+ /* IA32_ECX = 32 - val */
+ /* mov ecx,val */
+ EMIT2(0xB1, val);
+ /* movzx ecx,ecx */
+ EMIT3(0x0F, 0xB6, add_2reg(0xC0, IA32_ECX, IA32_ECX));
+ /* neg ecx */
+ EMIT2(0xF7, add_1reg(0xD8, IA32_ECX));
+ /* add ecx,32 */
+ EMIT3(0x83, add_1reg(0xC0, IA32_ECX), 32);
+
+ /* shl ebx,cl */
+ EMIT2(0xD3, add_1reg(0xE0, IA32_EBX));
+ /* or dreg_lo,ebx */
+ EMIT2(0x09, add_2reg(0xC0, dreg_lo, IA32_EBX));
+ } else if (val >= 32 && val < 64) {
+ u32 value = val - 32;
+
+ /* shr dreg_hi,imm8 */
+ EMIT3(0xC1, add_1reg(0xE8, dreg_hi), value);
+ /* mov dreg_lo,dreg_hi */
+ EMIT2(0x89, add_2reg(0xC0, dreg_lo, dreg_hi));
+ /* xor dreg_hi,dreg_hi */
+ EMIT2(0x33, add_2reg(0xC0, dreg_hi, dreg_hi));
+ } else {
+ /* xor dreg_lo,dreg_lo */
+ EMIT2(0x33, add_2reg(0xC0, dreg_lo, dreg_lo));
+ /* xor dreg_hi,dreg_hi */
+ EMIT2(0x33, add_2reg(0xC0, dreg_hi, dreg_hi));
+ }
+
+ if (dstk) {
+ /* mov dword ptr [ebp+off],dreg_lo */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, dreg_lo),
+ STACK_VAR(dst_lo));
+ /* mov dword ptr [ebp+off],dreg_hi */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, dreg_hi),
+ STACK_VAR(dst_hi));
+ }
+ *pprog = prog;
+}
+
+/* dst = dst >> val (signed) */
+static inline void emit_ia32_arsh_i64(const u8 dst[], const u32 val,
+ bool dstk, u8 **pprog)
+{
+ u8 *prog = *pprog;
+ int cnt = 0;
+ u8 dreg_lo = dstk ? IA32_EAX : dst_lo;
+ u8 dreg_hi = dstk ? IA32_EDX : dst_hi;
+
+ if (dstk) {
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EAX),
+ STACK_VAR(dst_lo));
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EDX),
+ STACK_VAR(dst_hi));
+ }
+ /* Do RSH operation */
+ if (val < 32) {
+ /* shr dreg_lo,imm8 */
+ EMIT3(0xC1, add_1reg(0xE8, dreg_lo), val);
+ /* mov ebx,dreg_hi */
+ EMIT2(0x8B, add_2reg(0xC0, dreg_hi, IA32_EBX));
+ /* ashr dreg_hi,imm8 */
+ EMIT3(0xC1, add_1reg(0xF8, dreg_hi), val);
+
+ /* IA32_ECX = 32 - val */
+ /* mov ecx,val */
+ EMIT2(0xB1, val);
+ /* movzx ecx,ecx */
+ EMIT3(0x0F, 0xB6, add_2reg(0xC0, IA32_ECX, IA32_ECX));
+ /* neg ecx */
+ EMIT2(0xF7, add_1reg(0xD8, IA32_ECX));
+ /* add ecx,32 */
+ EMIT3(0x83, add_1reg(0xC0, IA32_ECX), 32);
+
+ /* shl ebx,cl */
+ EMIT2(0xD3, add_1reg(0xE0, IA32_EBX));
+ /* or dreg_lo,ebx */
+ EMIT2(0x09, add_2reg(0xC0, dreg_lo, IA32_EBX));
+ } else if (val >= 32 && val < 64) {
+ u32 value = val - 32;
+
+ /* ashr dreg_hi,imm8 */
+ EMIT3(0xC1, add_1reg(0xF8, dreg_hi), value);
+ /* mov dreg_lo,dreg_hi */
+ EMIT2(0x89, add_2reg(0xC0, dreg_lo, dreg_hi));
+
+ /* ashr dreg_hi,imm8 */
+ EMIT3(0xC1, add_1reg(0xF8, dreg_hi), 31);
+ } else {
+ /* ashr dreg_hi,imm8 */
+ EMIT3(0xC1, add_1reg(0xF8, dreg_hi), 31);
+ /* mov dreg_lo,dreg_hi */
+ EMIT2(0x89, add_2reg(0xC0, dreg_lo, dreg_hi));
+ }
+
+ if (dstk) {
+ /* mov dword ptr [ebp+off],dreg_lo */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, dreg_lo),
+ STACK_VAR(dst_lo));
+ /* mov dword ptr [ebp+off],dreg_hi */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, dreg_hi),
+ STACK_VAR(dst_hi));
+ }
+ *pprog = prog;
+}
+
+static inline void emit_ia32_mul_r64(const u8 dst[], const u8 src[], bool dstk,
+ bool sstk, u8 **pprog)
+{
+ u8 *prog = *pprog;
+ int cnt = 0;
+
+ if (dstk)
+ /* mov eax,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EAX),
+ STACK_VAR(dst_hi));
+ else
+ /* mov eax,dst_hi */
+ EMIT2(0x8B, add_2reg(0xC0, dst_hi, IA32_EAX));
+
+ if (sstk)
+ /* mul dword ptr [ebp+off] */
+ EMIT3(0xF7, add_1reg(0x60, IA32_EBP), STACK_VAR(src_lo));
+ else
+ /* mul src_lo */
+ EMIT2(0xF7, add_1reg(0xE0, src_lo));
+
+ /* mov ecx,eax */
+ EMIT2(0x89, add_2reg(0xC0, IA32_ECX, IA32_EAX));
+
+ if (dstk)
+ /* mov eax,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EAX),
+ STACK_VAR(dst_lo));
+ else
+ /* mov eax,dst_lo */
+ EMIT2(0x8B, add_2reg(0xC0, dst_lo, IA32_EAX));
+
+ if (sstk)
+ /* mul dword ptr [ebp+off] */
+ EMIT3(0xF7, add_1reg(0x60, IA32_EBP), STACK_VAR(src_hi));
+ else
+ /* mul src_hi */
+ EMIT2(0xF7, add_1reg(0xE0, src_hi));
+
+ /* add eax,eax */
+ EMIT2(0x01, add_2reg(0xC0, IA32_ECX, IA32_EAX));
+
+ if (dstk)
+ /* mov eax,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EAX),
+ STACK_VAR(dst_lo));
+ else
+ /* mov eax,dst_lo */
+ EMIT2(0x8B, add_2reg(0xC0, dst_lo, IA32_EAX));
+
+ if (sstk)
+ /* mul dword ptr [ebp+off] */
+ EMIT3(0xF7, add_1reg(0x60, IA32_EBP), STACK_VAR(src_lo));
+ else
+ /* mul src_lo */
+ EMIT2(0xF7, add_1reg(0xE0, src_lo));
+
+ /* add ecx,edx */
+ EMIT2(0x01, add_2reg(0xC0, IA32_ECX, IA32_EDX));
+
+ if (dstk) {
+ /* mov dword ptr [ebp+off],eax */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, IA32_EAX),
+ STACK_VAR(dst_lo));
+ /* mov dword ptr [ebp+off],ecx */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, IA32_ECX),
+ STACK_VAR(dst_hi));
+ } else {
+ /* mov dst_lo,eax */
+ EMIT2(0x89, add_2reg(0xC0, dst_lo, IA32_EAX));
+ /* mov dst_hi,ecx */
+ EMIT2(0x89, add_2reg(0xC0, dst_hi, IA32_ECX));
+ }
+
+ *pprog = prog;
+}
+
+static inline void emit_ia32_mul_i64(const u8 dst[], const u32 val,
+ bool dstk, u8 **pprog)
+{
+ u8 *prog = *pprog;
+ int cnt = 0;
+ u32 hi;
+
+ hi = val & (1<<31) ? (u32)~0 : 0;
+ /* movl eax,imm32 */
+ EMIT2_off32(0xC7, add_1reg(0xC0, IA32_EAX), val);
+ if (dstk)
+ /* mul dword ptr [ebp+off] */
+ EMIT3(0xF7, add_1reg(0x60, IA32_EBP), STACK_VAR(dst_hi));
+ else
+ /* mul dst_hi */
+ EMIT2(0xF7, add_1reg(0xE0, dst_hi));
+
+ /* mov ecx,eax */
+ EMIT2(0x89, add_2reg(0xC0, IA32_ECX, IA32_EAX));
+
+ /* movl eax,imm32 */
+ EMIT2_off32(0xC7, add_1reg(0xC0, IA32_EAX), hi);
+ if (dstk)
+ /* mul dword ptr [ebp+off] */
+ EMIT3(0xF7, add_1reg(0x60, IA32_EBP), STACK_VAR(dst_lo));
+ else
+ /* mul dst_lo */
+ EMIT2(0xF7, add_1reg(0xE0, dst_lo));
+ /* add ecx,eax */
+ EMIT2(0x01, add_2reg(0xC0, IA32_ECX, IA32_EAX));
+
+ /* movl eax,imm32 */
+ EMIT2_off32(0xC7, add_1reg(0xC0, IA32_EAX), val);
+ if (dstk)
+ /* mul dword ptr [ebp+off] */
+ EMIT3(0xF7, add_1reg(0x60, IA32_EBP), STACK_VAR(dst_lo));
+ else
+ /* mul dst_lo */
+ EMIT2(0xF7, add_1reg(0xE0, dst_lo));
+
+ /* add ecx,edx */
+ EMIT2(0x01, add_2reg(0xC0, IA32_ECX, IA32_EDX));
+
+ if (dstk) {
+ /* mov dword ptr [ebp+off],eax */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, IA32_EAX),
+ STACK_VAR(dst_lo));
+ /* mov dword ptr [ebp+off],ecx */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, IA32_ECX),
+ STACK_VAR(dst_hi));
+ } else {
+ /* mov dword ptr [ebp+off],eax */
+ EMIT2(0x89, add_2reg(0xC0, dst_lo, IA32_EAX));
+ /* mov dword ptr [ebp+off],ecx */
+ EMIT2(0x89, add_2reg(0xC0, dst_hi, IA32_ECX));
+ }
+
+ *pprog = prog;
+}
+
+static int bpf_size_to_x86_bytes(int bpf_size)
+{
+ if (bpf_size == BPF_W)
+ return 4;
+ else if (bpf_size == BPF_H)
+ return 2;
+ else if (bpf_size == BPF_B)
+ return 1;
+ else if (bpf_size == BPF_DW)
+ return 4; /* imm32 */
+ else
+ return 0;
+}
+
+struct jit_context {
+ int cleanup_addr; /* Epilogue code offset */
+};
+
+/* Maximum number of bytes emitted while JITing one eBPF insn */
+#define BPF_MAX_INSN_SIZE 128
+#define BPF_INSN_SAFETY 64
+
+#define PROLOGUE_SIZE 35
+
+/*
+ * Emit prologue code for BPF program and check it's size.
+ * bpf_tail_call helper will skip it while jumping into another program.
+ */
+static void emit_prologue(u8 **pprog, u32 stack_depth)
+{
+ u8 *prog = *pprog;
+ int cnt = 0;
+ const u8 *r1 = bpf2ia32[BPF_REG_1];
+ const u8 fplo = bpf2ia32[BPF_REG_FP][0];
+ const u8 fphi = bpf2ia32[BPF_REG_FP][1];
+ const u8 *tcc = bpf2ia32[TCALL_CNT];
+
+ /* push ebp */
+ EMIT1(0x55);
+ /* mov ebp,esp */
+ EMIT2(0x89, 0xE5);
+ /* push edi */
+ EMIT1(0x57);
+ /* push esi */
+ EMIT1(0x56);
+ /* push ebx */
+ EMIT1(0x53);
+
+ /* sub esp,STACK_SIZE */
+ EMIT2_off32(0x81, 0xEC, STACK_SIZE);
+ /* sub ebp,SCRATCH_SIZE+4+12*/
+ EMIT3(0x83, add_1reg(0xE8, IA32_EBP), SCRATCH_SIZE + 16);
+ /* xor ebx,ebx */
+ EMIT2(0x31, add_2reg(0xC0, IA32_EBX, IA32_EBX));
+
+ /* Set up BPF prog stack base register */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, IA32_EBP), STACK_VAR(fplo));
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, IA32_EBX), STACK_VAR(fphi));
+
+ /* Move BPF_CTX (EAX) to BPF_REG_R1 */
+ /* mov dword ptr [ebp+off],eax */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, IA32_EAX), STACK_VAR(r1[0]));
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, IA32_EBX), STACK_VAR(r1[1]));
+
+ /* Initialize Tail Count */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, IA32_EBX), STACK_VAR(tcc[0]));
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, IA32_EBX), STACK_VAR(tcc[1]));
+
+ BUILD_BUG_ON(cnt != PROLOGUE_SIZE);
+ *pprog = prog;
+}
+
+/* Emit epilogue code for BPF program */
+static void emit_epilogue(u8 **pprog, u32 stack_depth)
+{
+ u8 *prog = *pprog;
+ const u8 *r0 = bpf2ia32[BPF_REG_0];
+ int cnt = 0;
+
+ /* mov eax,dword ptr [ebp+off]*/
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EAX), STACK_VAR(r0[0]));
+ /* mov edx,dword ptr [ebp+off]*/
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EDX), STACK_VAR(r0[1]));
+
+ /* add ebp,SCRATCH_SIZE+4+12*/
+ EMIT3(0x83, add_1reg(0xC0, IA32_EBP), SCRATCH_SIZE + 16);
+
+ /* mov ebx,dword ptr [ebp-12]*/
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EBX), -12);
+ /* mov esi,dword ptr [ebp-8]*/
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_ESI), -8);
+ /* mov edi,dword ptr [ebp-4]*/
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EDI), -4);
+
+ EMIT1(0xC9); /* leave */
+ EMIT1(0xC3); /* ret */
+ *pprog = prog;
+}
+
+/*
+ * Generate the following code:
+ * ... bpf_tail_call(void *ctx, struct bpf_array *array, u64 index) ...
+ * if (index >= array->map.max_entries)
+ * goto out;
+ * if (++tail_call_cnt > MAX_TAIL_CALL_CNT)
+ * goto out;
+ * prog = array->ptrs[index];
+ * if (prog == NULL)
+ * goto out;
+ * goto *(prog->bpf_func + prologue_size);
+ * out:
+ */
+static void emit_bpf_tail_call(u8 **pprog)
+{
+ u8 *prog = *pprog;
+ int cnt = 0;
+ const u8 *r1 = bpf2ia32[BPF_REG_1];
+ const u8 *r2 = bpf2ia32[BPF_REG_2];
+ const u8 *r3 = bpf2ia32[BPF_REG_3];
+ const u8 *tcc = bpf2ia32[TCALL_CNT];
+ u32 lo, hi;
+ static int jmp_label1 = -1;
+
+ /*
+ * if (index >= array->map.max_entries)
+ * goto out;
+ */
+ /* mov eax,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EAX), STACK_VAR(r2[0]));
+ /* mov edx,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EDX), STACK_VAR(r3[0]));
+
+ /* cmp dword ptr [eax+off],edx */
+ EMIT3(0x39, add_2reg(0x40, IA32_EAX, IA32_EDX),
+ offsetof(struct bpf_array, map.max_entries));
+ /* jbe out */
+ EMIT2(IA32_JBE, jmp_label(jmp_label1, 2));
+
+ /*
+ * if (tail_call_cnt > MAX_TAIL_CALL_CNT)
+ * goto out;
+ */
+ lo = (u32)MAX_TAIL_CALL_CNT;
+ hi = (u32)((u64)MAX_TAIL_CALL_CNT >> 32);
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_ECX), STACK_VAR(tcc[0]));
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EBX), STACK_VAR(tcc[1]));
+
+ /* cmp edx,hi */
+ EMIT3(0x83, add_1reg(0xF8, IA32_EBX), hi);
+ EMIT2(IA32_JNE, 3);
+ /* cmp ecx,lo */
+ EMIT3(0x83, add_1reg(0xF8, IA32_ECX), lo);
+
+ /* ja out */
+ EMIT2(IA32_JAE, jmp_label(jmp_label1, 2));
+
+ /* add eax,0x1 */
+ EMIT3(0x83, add_1reg(0xC0, IA32_ECX), 0x01);
+ /* adc ebx,0x0 */
+ EMIT3(0x83, add_1reg(0xD0, IA32_EBX), 0x00);
+
+ /* mov dword ptr [ebp+off],eax */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, IA32_ECX), STACK_VAR(tcc[0]));
+ /* mov dword ptr [ebp+off],edx */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, IA32_EBX), STACK_VAR(tcc[1]));
+
+ /* prog = array->ptrs[index]; */
+ /* mov edx, [eax + edx * 4 + offsetof(...)] */
+ EMIT3_off32(0x8B, 0x94, 0x90, offsetof(struct bpf_array, ptrs));
+
+ /*
+ * if (prog == NULL)
+ * goto out;
+ */
+ /* test edx,edx */
+ EMIT2(0x85, add_2reg(0xC0, IA32_EDX, IA32_EDX));
+ /* je out */
+ EMIT2(IA32_JE, jmp_label(jmp_label1, 2));
+
+ /* goto *(prog->bpf_func + prologue_size); */
+ /* mov edx, dword ptr [edx + 32] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EDX, IA32_EDX),
+ offsetof(struct bpf_prog, bpf_func));
+ /* add edx,prologue_size */
+ EMIT3(0x83, add_1reg(0xC0, IA32_EDX), PROLOGUE_SIZE);
+
+ /* mov eax,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EAX), STACK_VAR(r1[0]));
+
+ /*
+ * Now we're ready to jump into next BPF program:
+ * eax == ctx (1st arg)
+ * edx == prog->bpf_func + prologue_size
+ */
+ RETPOLINE_EDX_BPF_JIT();
+
+ if (jmp_label1 == -1)
+ jmp_label1 = cnt;
+
+ /* out: */
+ *pprog = prog;
+}
+
+/* Push the scratch stack register on top of the stack. */
+static inline void emit_push_r64(const u8 src[], u8 **pprog)
+{
+ u8 *prog = *pprog;
+ int cnt = 0;
+
+ /* mov ecx,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_ECX), STACK_VAR(src_hi));
+ /* push ecx */
+ EMIT1(0x51);
+
+ /* mov ecx,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_ECX), STACK_VAR(src_lo));
+ /* push ecx */
+ EMIT1(0x51);
+
+ *pprog = prog;
+}
+
+static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image,
+ int oldproglen, struct jit_context *ctx)
+{
+ struct bpf_insn *insn = bpf_prog->insnsi;
+ int insn_cnt = bpf_prog->len;
+ bool seen_exit = false;
+ u8 temp[BPF_MAX_INSN_SIZE + BPF_INSN_SAFETY];
+ int i, cnt = 0;
+ int proglen = 0;
+ u8 *prog = temp;
+
+ emit_prologue(&prog, bpf_prog->aux->stack_depth);
+
+ for (i = 0; i < insn_cnt; i++, insn++) {
+ const s32 imm32 = insn->imm;
+ const bool is64 = BPF_CLASS(insn->code) == BPF_ALU64;
+ const bool dstk = insn->dst_reg == BPF_REG_AX ? false : true;
+ const bool sstk = insn->src_reg == BPF_REG_AX ? false : true;
+ const u8 code = insn->code;
+ const u8 *dst = bpf2ia32[insn->dst_reg];
+ const u8 *src = bpf2ia32[insn->src_reg];
+ const u8 *r0 = bpf2ia32[BPF_REG_0];
+ s64 jmp_offset;
+ u8 jmp_cond;
+ int ilen;
+ u8 *func;
+
+ switch (code) {
+ /* ALU operations */
+ /* dst = src */
+ case BPF_ALU | BPF_MOV | BPF_K:
+ case BPF_ALU | BPF_MOV | BPF_X:
+ case BPF_ALU64 | BPF_MOV | BPF_K:
+ case BPF_ALU64 | BPF_MOV | BPF_X:
+ switch (BPF_SRC(code)) {
+ case BPF_X:
+ emit_ia32_mov_r64(is64, dst, src, dstk,
+ sstk, &prog);
+ break;
+ case BPF_K:
+ /* Sign-extend immediate value to dst reg */
+ emit_ia32_mov_i64(is64, dst, imm32,
+ dstk, &prog);
+ break;
+ }
+ break;
+ /* dst = dst + src/imm */
+ /* dst = dst - src/imm */
+ /* dst = dst | src/imm */
+ /* dst = dst & src/imm */
+ /* dst = dst ^ src/imm */
+ /* dst = dst * src/imm */
+ /* dst = dst << src */
+ /* dst = dst >> src */
+ case BPF_ALU | BPF_ADD | BPF_K:
+ case BPF_ALU | BPF_ADD | BPF_X:
+ case BPF_ALU | BPF_SUB | BPF_K:
+ case BPF_ALU | BPF_SUB | BPF_X:
+ case BPF_ALU | BPF_OR | BPF_K:
+ case BPF_ALU | BPF_OR | BPF_X:
+ case BPF_ALU | BPF_AND | BPF_K:
+ case BPF_ALU | BPF_AND | BPF_X:
+ case BPF_ALU | BPF_XOR | BPF_K:
+ case BPF_ALU | BPF_XOR | BPF_X:
+ case BPF_ALU64 | BPF_ADD | BPF_K:
+ case BPF_ALU64 | BPF_ADD | BPF_X:
+ case BPF_ALU64 | BPF_SUB | BPF_K:
+ case BPF_ALU64 | BPF_SUB | BPF_X:
+ case BPF_ALU64 | BPF_OR | BPF_K:
+ case BPF_ALU64 | BPF_OR | BPF_X:
+ case BPF_ALU64 | BPF_AND | BPF_K:
+ case BPF_ALU64 | BPF_AND | BPF_X:
+ case BPF_ALU64 | BPF_XOR | BPF_K:
+ case BPF_ALU64 | BPF_XOR | BPF_X:
+ switch (BPF_SRC(code)) {
+ case BPF_X:
+ emit_ia32_alu_r64(is64, BPF_OP(code), dst,
+ src, dstk, sstk, &prog);
+ break;
+ case BPF_K:
+ emit_ia32_alu_i64(is64, BPF_OP(code), dst,
+ imm32, dstk, &prog);
+ break;
+ }
+ break;
+ case BPF_ALU | BPF_MUL | BPF_K:
+ case BPF_ALU | BPF_MUL | BPF_X:
+ switch (BPF_SRC(code)) {
+ case BPF_X:
+ emit_ia32_mul_r(dst_lo, src_lo, dstk,
+ sstk, &prog);
+ break;
+ case BPF_K:
+ /* mov ecx,imm32*/
+ EMIT2_off32(0xC7, add_1reg(0xC0, IA32_ECX),
+ imm32);
+ emit_ia32_mul_r(dst_lo, IA32_ECX, dstk,
+ false, &prog);
+ break;
+ }
+ emit_ia32_mov_i(dst_hi, 0, dstk, &prog);
+ break;
+ case BPF_ALU | BPF_LSH | BPF_X:
+ case BPF_ALU | BPF_RSH | BPF_X:
+ case BPF_ALU | BPF_ARSH | BPF_K:
+ case BPF_ALU | BPF_ARSH | BPF_X:
+ switch (BPF_SRC(code)) {
+ case BPF_X:
+ emit_ia32_shift_r(BPF_OP(code), dst_lo, src_lo,
+ dstk, sstk, &prog);
+ break;
+ case BPF_K:
+ /* mov ecx,imm32*/
+ EMIT2_off32(0xC7, add_1reg(0xC0, IA32_ECX),
+ imm32);
+ emit_ia32_shift_r(BPF_OP(code), dst_lo,
+ IA32_ECX, dstk, false,
+ &prog);
+ break;
+ }
+ emit_ia32_mov_i(dst_hi, 0, dstk, &prog);
+ break;
+ /* dst = dst / src(imm) */
+ /* dst = dst % src(imm) */
+ case BPF_ALU | BPF_DIV | BPF_K:
+ case BPF_ALU | BPF_DIV | BPF_X:
+ case BPF_ALU | BPF_MOD | BPF_K:
+ case BPF_ALU | BPF_MOD | BPF_X:
+ switch (BPF_SRC(code)) {
+ case BPF_X:
+ emit_ia32_div_mod_r(BPF_OP(code), dst_lo,
+ src_lo, dstk, sstk, &prog);
+ break;
+ case BPF_K:
+ /* mov ecx,imm32*/
+ EMIT2_off32(0xC7, add_1reg(0xC0, IA32_ECX),
+ imm32);
+ emit_ia32_div_mod_r(BPF_OP(code), dst_lo,
+ IA32_ECX, dstk, false,
+ &prog);
+ break;
+ }
+ emit_ia32_mov_i(dst_hi, 0, dstk, &prog);
+ break;
+ case BPF_ALU64 | BPF_DIV | BPF_K:
+ case BPF_ALU64 | BPF_DIV | BPF_X:
+ case BPF_ALU64 | BPF_MOD | BPF_K:
+ case BPF_ALU64 | BPF_MOD | BPF_X:
+ goto notyet;
+ /* dst = dst >> imm */
+ /* dst = dst << imm */
+ case BPF_ALU | BPF_RSH | BPF_K:
+ case BPF_ALU | BPF_LSH | BPF_K:
+ if (unlikely(imm32 > 31))
+ return -EINVAL;
+ /* mov ecx,imm32*/
+ EMIT2_off32(0xC7, add_1reg(0xC0, IA32_ECX), imm32);
+ emit_ia32_shift_r(BPF_OP(code), dst_lo, IA32_ECX, dstk,
+ false, &prog);
+ emit_ia32_mov_i(dst_hi, 0, dstk, &prog);
+ break;
+ /* dst = dst << imm */
+ case BPF_ALU64 | BPF_LSH | BPF_K:
+ if (unlikely(imm32 > 63))
+ return -EINVAL;
+ emit_ia32_lsh_i64(dst, imm32, dstk, &prog);
+ break;
+ /* dst = dst >> imm */
+ case BPF_ALU64 | BPF_RSH | BPF_K:
+ if (unlikely(imm32 > 63))
+ return -EINVAL;
+ emit_ia32_rsh_i64(dst, imm32, dstk, &prog);
+ break;
+ /* dst = dst << src */
+ case BPF_ALU64 | BPF_LSH | BPF_X:
+ emit_ia32_lsh_r64(dst, src, dstk, sstk, &prog);
+ break;
+ /* dst = dst >> src */
+ case BPF_ALU64 | BPF_RSH | BPF_X:
+ emit_ia32_rsh_r64(dst, src, dstk, sstk, &prog);
+ break;
+ /* dst = dst >> src (signed) */
+ case BPF_ALU64 | BPF_ARSH | BPF_X:
+ emit_ia32_arsh_r64(dst, src, dstk, sstk, &prog);
+ break;
+ /* dst = dst >> imm (signed) */
+ case BPF_ALU64 | BPF_ARSH | BPF_K:
+ if (unlikely(imm32 > 63))
+ return -EINVAL;
+ emit_ia32_arsh_i64(dst, imm32, dstk, &prog);
+ break;
+ /* dst = ~dst */
+ case BPF_ALU | BPF_NEG:
+ emit_ia32_alu_i(is64, false, BPF_OP(code),
+ dst_lo, 0, dstk, &prog);
+ emit_ia32_mov_i(dst_hi, 0, dstk, &prog);
+ break;
+ /* dst = ~dst (64 bit) */
+ case BPF_ALU64 | BPF_NEG:
+ emit_ia32_neg64(dst, dstk, &prog);
+ break;
+ /* dst = dst * src/imm */
+ case BPF_ALU64 | BPF_MUL | BPF_X:
+ case BPF_ALU64 | BPF_MUL | BPF_K:
+ switch (BPF_SRC(code)) {
+ case BPF_X:
+ emit_ia32_mul_r64(dst, src, dstk, sstk, &prog);
+ break;
+ case BPF_K:
+ emit_ia32_mul_i64(dst, imm32, dstk, &prog);
+ break;
+ }
+ break;
+ /* dst = htole(dst) */
+ case BPF_ALU | BPF_END | BPF_FROM_LE:
+ emit_ia32_to_le_r64(dst, imm32, dstk, &prog);
+ break;
+ /* dst = htobe(dst) */
+ case BPF_ALU | BPF_END | BPF_FROM_BE:
+ emit_ia32_to_be_r64(dst, imm32, dstk, &prog);
+ break;
+ /* dst = imm64 */
+ case BPF_LD | BPF_IMM | BPF_DW: {
+ s32 hi, lo = imm32;
+
+ hi = insn[1].imm;
+ emit_ia32_mov_i(dst_lo, lo, dstk, &prog);
+ emit_ia32_mov_i(dst_hi, hi, dstk, &prog);
+ insn++;
+ i++;
+ break;
+ }
+ /* ST: *(u8*)(dst_reg + off) = imm */
+ case BPF_ST | BPF_MEM | BPF_H:
+ case BPF_ST | BPF_MEM | BPF_B:
+ case BPF_ST | BPF_MEM | BPF_W:
+ case BPF_ST | BPF_MEM | BPF_DW:
+ if (dstk)
+ /* mov eax,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EAX),
+ STACK_VAR(dst_lo));
+ else
+ /* mov eax,dst_lo */
+ EMIT2(0x8B, add_2reg(0xC0, dst_lo, IA32_EAX));
+
+ switch (BPF_SIZE(code)) {
+ case BPF_B:
+ EMIT(0xC6, 1); break;
+ case BPF_H:
+ EMIT2(0x66, 0xC7); break;
+ case BPF_W:
+ case BPF_DW:
+ EMIT(0xC7, 1); break;
+ }
+
+ if (is_imm8(insn->off))
+ EMIT2(add_1reg(0x40, IA32_EAX), insn->off);
+ else
+ EMIT1_off32(add_1reg(0x80, IA32_EAX),
+ insn->off);
+ EMIT(imm32, bpf_size_to_x86_bytes(BPF_SIZE(code)));
+
+ if (BPF_SIZE(code) == BPF_DW) {
+ u32 hi;
+
+ hi = imm32 & (1<<31) ? (u32)~0 : 0;
+ EMIT2_off32(0xC7, add_1reg(0x80, IA32_EAX),
+ insn->off + 4);
+ EMIT(hi, 4);
+ }
+ break;
+
+ /* STX: *(u8*)(dst_reg + off) = src_reg */
+ case BPF_STX | BPF_MEM | BPF_B:
+ case BPF_STX | BPF_MEM | BPF_H:
+ case BPF_STX | BPF_MEM | BPF_W:
+ case BPF_STX | BPF_MEM | BPF_DW:
+ if (dstk)
+ /* mov eax,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EAX),
+ STACK_VAR(dst_lo));
+ else
+ /* mov eax,dst_lo */
+ EMIT2(0x8B, add_2reg(0xC0, dst_lo, IA32_EAX));
+
+ if (sstk)
+ /* mov edx,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EDX),
+ STACK_VAR(src_lo));
+ else
+ /* mov edx,src_lo */
+ EMIT2(0x8B, add_2reg(0xC0, src_lo, IA32_EDX));
+
+ switch (BPF_SIZE(code)) {
+ case BPF_B:
+ EMIT(0x88, 1); break;
+ case BPF_H:
+ EMIT2(0x66, 0x89); break;
+ case BPF_W:
+ case BPF_DW:
+ EMIT(0x89, 1); break;
+ }
+
+ if (is_imm8(insn->off))
+ EMIT2(add_2reg(0x40, IA32_EAX, IA32_EDX),
+ insn->off);
+ else
+ EMIT1_off32(add_2reg(0x80, IA32_EAX, IA32_EDX),
+ insn->off);
+
+ if (BPF_SIZE(code) == BPF_DW) {
+ if (sstk)
+ /* mov edi,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP,
+ IA32_EDX),
+ STACK_VAR(src_hi));
+ else
+ /* mov edi,src_hi */
+ EMIT2(0x8B, add_2reg(0xC0, src_hi,
+ IA32_EDX));
+ EMIT1(0x89);
+ if (is_imm8(insn->off + 4)) {
+ EMIT2(add_2reg(0x40, IA32_EAX,
+ IA32_EDX),
+ insn->off + 4);
+ } else {
+ EMIT1(add_2reg(0x80, IA32_EAX,
+ IA32_EDX));
+ EMIT(insn->off + 4, 4);
+ }
+ }
+ break;
+
+ /* LDX: dst_reg = *(u8*)(src_reg + off) */
+ case BPF_LDX | BPF_MEM | BPF_B:
+ case BPF_LDX | BPF_MEM | BPF_H:
+ case BPF_LDX | BPF_MEM | BPF_W:
+ case BPF_LDX | BPF_MEM | BPF_DW:
+ if (sstk)
+ /* mov eax,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EAX),
+ STACK_VAR(src_lo));
+ else
+ /* mov eax,dword ptr [ebp+off] */
+ EMIT2(0x8B, add_2reg(0xC0, src_lo, IA32_EAX));
+
+ switch (BPF_SIZE(code)) {
+ case BPF_B:
+ EMIT2(0x0F, 0xB6); break;
+ case BPF_H:
+ EMIT2(0x0F, 0xB7); break;
+ case BPF_W:
+ case BPF_DW:
+ EMIT(0x8B, 1); break;
+ }
+
+ if (is_imm8(insn->off))
+ EMIT2(add_2reg(0x40, IA32_EAX, IA32_EDX),
+ insn->off);
+ else
+ EMIT1_off32(add_2reg(0x80, IA32_EAX, IA32_EDX),
+ insn->off);
+
+ if (dstk)
+ /* mov dword ptr [ebp+off],edx */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, IA32_EDX),
+ STACK_VAR(dst_lo));
+ else
+ /* mov dst_lo,edx */
+ EMIT2(0x89, add_2reg(0xC0, dst_lo, IA32_EDX));
+ switch (BPF_SIZE(code)) {
+ case BPF_B:
+ case BPF_H:
+ case BPF_W:
+ if (dstk) {
+ EMIT3(0xC7, add_1reg(0x40, IA32_EBP),
+ STACK_VAR(dst_hi));
+ EMIT(0x0, 4);
+ } else {
+ EMIT3(0xC7, add_1reg(0xC0, dst_hi), 0);
+ }
+ break;
+ case BPF_DW:
+ EMIT2_off32(0x8B,
+ add_2reg(0x80, IA32_EAX, IA32_EDX),
+ insn->off + 4);
+ if (dstk)
+ EMIT3(0x89,
+ add_2reg(0x40, IA32_EBP,
+ IA32_EDX),
+ STACK_VAR(dst_hi));
+ else
+ EMIT2(0x89,
+ add_2reg(0xC0, dst_hi, IA32_EDX));
+ break;
+ default:
+ break;
+ }
+ break;
+ /* call */
+ case BPF_JMP | BPF_CALL:
+ {
+ const u8 *r1 = bpf2ia32[BPF_REG_1];
+ const u8 *r2 = bpf2ia32[BPF_REG_2];
+ const u8 *r3 = bpf2ia32[BPF_REG_3];
+ const u8 *r4 = bpf2ia32[BPF_REG_4];
+ const u8 *r5 = bpf2ia32[BPF_REG_5];
+
+ if (insn->src_reg == BPF_PSEUDO_CALL)
+ goto notyet;
+
+ func = (u8 *) __bpf_call_base + imm32;
+ jmp_offset = func - (image + addrs[i]);
+
+ if (!imm32 || !is_simm32(jmp_offset)) {
+ pr_err("unsupported BPF func %d addr %p image %p\n",
+ imm32, func, image);
+ return -EINVAL;
+ }
+
+ /* mov eax,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EAX),
+ STACK_VAR(r1[0]));
+ /* mov edx,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EDX),
+ STACK_VAR(r1[1]));
+
+ emit_push_r64(r5, &prog);
+ emit_push_r64(r4, &prog);
+ emit_push_r64(r3, &prog);
+ emit_push_r64(r2, &prog);
+
+ EMIT1_off32(0xE8, jmp_offset + 9);
+
+ /* mov dword ptr [ebp+off],eax */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, IA32_EAX),
+ STACK_VAR(r0[0]));
+ /* mov dword ptr [ebp+off],edx */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, IA32_EDX),
+ STACK_VAR(r0[1]));
+
+ /* add esp,32 */
+ EMIT3(0x83, add_1reg(0xC0, IA32_ESP), 32);
+ break;
+ }
+ case BPF_JMP | BPF_TAIL_CALL:
+ emit_bpf_tail_call(&prog);
+ break;
+
+ /* cond jump */
+ case BPF_JMP | BPF_JEQ | BPF_X:
+ case BPF_JMP | BPF_JNE | BPF_X:
+ case BPF_JMP | BPF_JGT | BPF_X:
+ case BPF_JMP | BPF_JLT | BPF_X:
+ case BPF_JMP | BPF_JGE | BPF_X:
+ case BPF_JMP | BPF_JLE | BPF_X:
+ case BPF_JMP | BPF_JSGT | BPF_X:
+ case BPF_JMP | BPF_JSLE | BPF_X:
+ case BPF_JMP | BPF_JSLT | BPF_X:
+ case BPF_JMP | BPF_JSGE | BPF_X: {
+ u8 dreg_lo = dstk ? IA32_EAX : dst_lo;
+ u8 dreg_hi = dstk ? IA32_EDX : dst_hi;
+ u8 sreg_lo = sstk ? IA32_ECX : src_lo;
+ u8 sreg_hi = sstk ? IA32_EBX : src_hi;
+
+ if (dstk) {
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EAX),
+ STACK_VAR(dst_lo));
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EDX),
+ STACK_VAR(dst_hi));
+ }
+
+ if (sstk) {
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_ECX),
+ STACK_VAR(src_lo));
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EBX),
+ STACK_VAR(src_hi));
+ }
+
+ /* cmp dreg_hi,sreg_hi */
+ EMIT2(0x39, add_2reg(0xC0, dreg_hi, sreg_hi));
+ EMIT2(IA32_JNE, 2);
+ /* cmp dreg_lo,sreg_lo */
+ EMIT2(0x39, add_2reg(0xC0, dreg_lo, sreg_lo));
+ goto emit_cond_jmp;
+ }
+ case BPF_JMP | BPF_JSET | BPF_X: {
+ u8 dreg_lo = dstk ? IA32_EAX : dst_lo;
+ u8 dreg_hi = dstk ? IA32_EDX : dst_hi;
+ u8 sreg_lo = sstk ? IA32_ECX : src_lo;
+ u8 sreg_hi = sstk ? IA32_EBX : src_hi;
+
+ if (dstk) {
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EAX),
+ STACK_VAR(dst_lo));
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EDX),
+ STACK_VAR(dst_hi));
+ }
+
+ if (sstk) {
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_ECX),
+ STACK_VAR(src_lo));
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EBX),
+ STACK_VAR(src_hi));
+ }
+ /* and dreg_lo,sreg_lo */
+ EMIT2(0x23, add_2reg(0xC0, sreg_lo, dreg_lo));
+ /* and dreg_hi,sreg_hi */
+ EMIT2(0x23, add_2reg(0xC0, sreg_hi, dreg_hi));
+ /* or dreg_lo,dreg_hi */
+ EMIT2(0x09, add_2reg(0xC0, dreg_lo, dreg_hi));
+ goto emit_cond_jmp;
+ }
+ case BPF_JMP | BPF_JSET | BPF_K: {
+ u32 hi;
+ u8 dreg_lo = dstk ? IA32_EAX : dst_lo;
+ u8 dreg_hi = dstk ? IA32_EDX : dst_hi;
+ u8 sreg_lo = IA32_ECX;
+ u8 sreg_hi = IA32_EBX;
+
+ if (dstk) {
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EAX),
+ STACK_VAR(dst_lo));
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EDX),
+ STACK_VAR(dst_hi));
+ }
+ hi = imm32 & (1<<31) ? (u32)~0 : 0;
+
+ /* mov ecx,imm32 */
+ EMIT2_off32(0xC7, add_1reg(0xC0, IA32_ECX), imm32);
+ /* mov ebx,imm32 */
+ EMIT2_off32(0xC7, add_1reg(0xC0, IA32_EBX), hi);
+
+ /* and dreg_lo,sreg_lo */
+ EMIT2(0x23, add_2reg(0xC0, sreg_lo, dreg_lo));
+ /* and dreg_hi,sreg_hi */
+ EMIT2(0x23, add_2reg(0xC0, sreg_hi, dreg_hi));
+ /* or dreg_lo,dreg_hi */
+ EMIT2(0x09, add_2reg(0xC0, dreg_lo, dreg_hi));
+ goto emit_cond_jmp;
+ }
+ case BPF_JMP | BPF_JEQ | BPF_K:
+ case BPF_JMP | BPF_JNE | BPF_K:
+ case BPF_JMP | BPF_JGT | BPF_K:
+ case BPF_JMP | BPF_JLT | BPF_K:
+ case BPF_JMP | BPF_JGE | BPF_K:
+ case BPF_JMP | BPF_JLE | BPF_K:
+ case BPF_JMP | BPF_JSGT | BPF_K:
+ case BPF_JMP | BPF_JSLE | BPF_K:
+ case BPF_JMP | BPF_JSLT | BPF_K:
+ case BPF_JMP | BPF_JSGE | BPF_K: {
+ u32 hi;
+ u8 dreg_lo = dstk ? IA32_EAX : dst_lo;
+ u8 dreg_hi = dstk ? IA32_EDX : dst_hi;
+ u8 sreg_lo = IA32_ECX;
+ u8 sreg_hi = IA32_EBX;
+
+ if (dstk) {
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EAX),
+ STACK_VAR(dst_lo));
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EDX),
+ STACK_VAR(dst_hi));
+ }
+
+ hi = imm32 & (1<<31) ? (u32)~0 : 0;
+ /* mov ecx,imm32 */
+ EMIT2_off32(0xC7, add_1reg(0xC0, IA32_ECX), imm32);
+ /* mov ebx,imm32 */
+ EMIT2_off32(0xC7, add_1reg(0xC0, IA32_EBX), hi);
+
+ /* cmp dreg_hi,sreg_hi */
+ EMIT2(0x39, add_2reg(0xC0, dreg_hi, sreg_hi));
+ EMIT2(IA32_JNE, 2);
+ /* cmp dreg_lo,sreg_lo */
+ EMIT2(0x39, add_2reg(0xC0, dreg_lo, sreg_lo));
+
+emit_cond_jmp: /* Convert BPF opcode to x86 */
+ switch (BPF_OP(code)) {
+ case BPF_JEQ:
+ jmp_cond = IA32_JE;
+ break;
+ case BPF_JSET:
+ case BPF_JNE:
+ jmp_cond = IA32_JNE;
+ break;
+ case BPF_JGT:
+ /* GT is unsigned '>', JA in x86 */
+ jmp_cond = IA32_JA;
+ break;
+ case BPF_JLT:
+ /* LT is unsigned '<', JB in x86 */
+ jmp_cond = IA32_JB;
+ break;
+ case BPF_JGE:
+ /* GE is unsigned '>=', JAE in x86 */
+ jmp_cond = IA32_JAE;
+ break;
+ case BPF_JLE:
+ /* LE is unsigned '<=', JBE in x86 */
+ jmp_cond = IA32_JBE;
+ break;
+ case BPF_JSGT:
+ /* Signed '>', GT in x86 */
+ jmp_cond = IA32_JG;
+ break;
+ case BPF_JSLT:
+ /* Signed '<', LT in x86 */
+ jmp_cond = IA32_JL;
+ break;
+ case BPF_JSGE:
+ /* Signed '>=', GE in x86 */
+ jmp_cond = IA32_JGE;
+ break;
+ case BPF_JSLE:
+ /* Signed '<=', LE in x86 */
+ jmp_cond = IA32_JLE;
+ break;
+ default: /* to silence GCC warning */
+ return -EFAULT;
+ }
+ jmp_offset = addrs[i + insn->off] - addrs[i];
+ if (is_imm8(jmp_offset)) {
+ EMIT2(jmp_cond, jmp_offset);
+ } else if (is_simm32(jmp_offset)) {
+ EMIT2_off32(0x0F, jmp_cond + 0x10, jmp_offset);
+ } else {
+ pr_err("cond_jmp gen bug %llx\n", jmp_offset);
+ return -EFAULT;
+ }
+
+ break;
+ }
+ case BPF_JMP | BPF_JA:
+ if (insn->off == -1)
+ /* -1 jmp instructions will always jump
+ * backwards two bytes. Explicitly handling
+ * this case avoids wasting too many passes
+ * when there are long sequences of replaced
+ * dead code.
+ */
+ jmp_offset = -2;
+ else
+ jmp_offset = addrs[i + insn->off] - addrs[i];
+
+ if (!jmp_offset)
+ /* Optimize out nop jumps */
+ break;
+emit_jmp:
+ if (is_imm8(jmp_offset)) {
+ EMIT2(0xEB, jmp_offset);
+ } else if (is_simm32(jmp_offset)) {
+ EMIT1_off32(0xE9, jmp_offset);
+ } else {
+ pr_err("jmp gen bug %llx\n", jmp_offset);
+ return -EFAULT;
+ }
+ break;
+
+ case BPF_LD | BPF_ABS | BPF_W:
+ case BPF_LD | BPF_ABS | BPF_H:
+ case BPF_LD | BPF_ABS | BPF_B:
+ case BPF_LD | BPF_IND | BPF_W:
+ case BPF_LD | BPF_IND | BPF_H:
+ case BPF_LD | BPF_IND | BPF_B:
+ {
+ int size;
+ const u8 *r6 = bpf2ia32[BPF_REG_6];
+
+ /* Setting up first argument */
+ /* mov eax,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP, IA32_EAX),
+ STACK_VAR(r6[0]));
+
+ /* Setting up second argument */
+ if (BPF_MODE(code) == BPF_ABS) {
+ /* mov %edx, imm32 */
+ EMIT1_off32(0xBA, imm32);
+ } else {
+ if (sstk)
+ /* mov edx,dword ptr [ebp+off] */
+ EMIT3(0x8B, add_2reg(0x40, IA32_EBP,
+ IA32_EDX),
+ STACK_VAR(src_lo));
+ else
+ /* mov edx,src_lo */
+ EMIT2(0x8B, add_2reg(0xC0, src_lo,
+ IA32_EDX));
+ if (imm32) {
+ if (is_imm8(imm32))
+ /* add %edx,imm8 */
+ EMIT3(0x83, 0xC2, imm32);
+ else
+ /* add %edx,imm32 */
+ EMIT2_off32(0x81, 0xC2, imm32);
+ }
+ }
+
+ /* Setting up third argument */
+ switch (BPF_SIZE(code)) {
+ case BPF_W:
+ size = 4;
+ break;
+ case BPF_H:
+ size = 2;
+ break;
+ case BPF_B:
+ size = 1;
+ break;
+ default:
+ return -EINVAL;
+ }
+ /* mov ecx,val */
+ EMIT2(0xB1, size);
+ /* movzx ecx,ecx */
+ EMIT3(0x0F, 0xB6, add_2reg(0xC0, IA32_ECX, IA32_ECX));
+
+ /* mov ebx,ebp */
+ EMIT2(0x8B, add_2reg(0xC0, IA32_EBP, IA32_EBX));
+ /* add %ebx,imm8 */
+ EMIT3(0x83, add_1reg(0xC0, IA32_EBX), SKB_BUFFER);
+ /* push ebx */
+ EMIT1(0x53);
+
+ /* Setting up function pointer to call */
+ /* mov ebx,imm32*/
+ EMIT2_off32(0xC7, add_1reg(0xC0, IA32_EBX),
+ (unsigned int)bpf_load_pointer);
+
+ EMIT2(0xFF, add_1reg(0xD0, IA32_EBX));
+ /* add %esp,4 */
+ EMIT3(0x83, add_1reg(0xC0, IA32_ESP), 4);
+ /* xor edx,edx */
+ EMIT2(0x33, add_2reg(0xC0, IA32_EDX, IA32_EDX));
+
+ /* mov dword ptr [ebp+off],eax */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, IA32_EDX),
+ STACK_VAR(r0[0]));
+ /* mov dword ptr [ebp+off],edx */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, IA32_EDX),
+ STACK_VAR(r0[1]));
+
+ /*
+ * Check if return address is NULL or not.
+ * If NULL then jump to epilogue else continue
+ * to load the value from retn address
+ */
+ EMIT3(0x83, add_1reg(0xF8, IA32_EAX), 0);
+ jmp_offset = ctx->cleanup_addr - addrs[i];
+
+ switch (BPF_SIZE(code)) {
+ case BPF_W:
+ jmp_offset += 7;
+ break;
+ case BPF_H:
+ jmp_offset += 10;
+ break;
+ case BPF_B:
+ jmp_offset += 6;
+ break;
+ }
+
+ EMIT2_off32(0x0F, IA32_JE + 0x10, jmp_offset);
+ /* Load value from the address */
+ switch (BPF_SIZE(code)) {
+ case BPF_W:
+ /* mov eax,[eax] */
+ EMIT2(0x8B, 0x0);
+ /* Emit 'bswap eax' */
+ EMIT2(0x0F, add_1reg(0xC8, IA32_EAX));
+ break;
+ case BPF_H:
+ EMIT3(0x0F, 0xB7, 0x0);
+ EMIT1(0x66);
+ EMIT3(0xC1, add_1reg(0xC8, IA32_EAX), 8);
+ break;
+ case BPF_B:
+ EMIT3(0x0F, 0xB6, 0x0);
+ break;
+ }
+
+ /* mov dword ptr [ebp+off],eax */
+ EMIT3(0x89, add_2reg(0x40, IA32_EBP, IA32_EAX),
+ STACK_VAR(r0[0]));
+ break;
+ }
+ /* STX XADD: lock *(u32 *)(dst + off) += src */
+ case BPF_STX | BPF_XADD | BPF_W:
+ /* STX XADD: lock *(u64 *)(dst + off) += src */
+ case BPF_STX | BPF_XADD | BPF_DW:
+ goto notyet;
+ case BPF_JMP | BPF_EXIT:
+ if (seen_exit) {
+ jmp_offset = ctx->cleanup_addr - addrs[i];
+ goto emit_jmp;
+ }
+ seen_exit = true;
+ /* Update cleanup_addr */
+ ctx->cleanup_addr = proglen;
+ emit_epilogue(&prog, bpf_prog->aux->stack_depth);
+ break;
+notyet:
+ pr_info_once("*** NOT YET: opcode %02x ***\n", code);
+ return -EFAULT;
+ default:
+ /*
+ * This error will be seen if new instruction was added
+ * to interpreter, but not to JIT or if there is junk in
+ * bpf_prog
+ */
+ pr_err("bpf_jit: unknown opcode %02x\n", code);
+ return -EINVAL;
+ }
+
+ ilen = prog - temp;
+ if (ilen > BPF_MAX_INSN_SIZE) {
+ pr_err("bpf_jit: fatal insn size error\n");
+ return -EFAULT;
+ }
+
+ if (image) {
+ if (unlikely(proglen + ilen > oldproglen)) {
+ pr_err("bpf_jit: fatal error\n");
+ return -EFAULT;
+ }
+ memcpy(image + proglen, temp, ilen);
+ }
+ proglen += ilen;
+ addrs[i] = proglen;
+ prog = temp;
+ }
+ return proglen;
+}
+
+struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
+{
+ struct bpf_binary_header *header = NULL;
+ struct bpf_prog *tmp, *orig_prog = prog;
+ int proglen, oldproglen = 0;
+ struct jit_context ctx = {};
+ bool tmp_blinded = false;
+ u8 *image = NULL;
+ int *addrs;
+ int pass;
+ int i;
+
+ if (!prog->jit_requested)
+ return orig_prog;
+
+ tmp = bpf_jit_blind_constants(prog);
+ /*
+ * If blinding was requested and we failed during blinding,
+ * we must fall back to the interpreter.
+ */
+ if (IS_ERR(tmp))
+ return orig_prog;
+ if (tmp != prog) {
+ tmp_blinded = true;
+ prog = tmp;
+ }
+
+ addrs = kmalloc(prog->len * sizeof(*addrs), GFP_KERNEL);
+ if (!addrs) {
+ prog = orig_prog;
+ goto out;
+ }
+
+ /*
+ * Before first pass, make a rough estimation of addrs[]
+ * each BPF instruction is translated to less than 64 bytes
+ */
+ for (proglen = 0, i = 0; i < prog->len; i++) {
+ proglen += 64;
+ addrs[i] = proglen;
+ }
+ ctx.cleanup_addr = proglen;
+
+ /*
+ * JITed image shrinks with every pass and the loop iterates
+ * until the image stops shrinking. Very large BPF programs
+ * may converge on the last pass. In such case do one more
+ * pass to emit the final image.
+ */
+ for (pass = 0; pass < 20 || image; pass++) {
+ proglen = do_jit(prog, addrs, image, oldproglen, &ctx);
+ if (proglen <= 0) {
+out_image:
+ image = NULL;
+ if (header)
+ bpf_jit_binary_free(header);
+ prog = orig_prog;
+ goto out_addrs;
+ }
+ if (image) {
+ if (proglen != oldproglen) {
+ pr_err("bpf_jit: proglen=%d != oldproglen=%d\n",
+ proglen, oldproglen);
+ goto out_image;
+ }
+ break;
+ }
+ if (proglen == oldproglen) {
+ header = bpf_jit_binary_alloc(proglen, &image,
+ 1, jit_fill_hole);
+ if (!header) {
+ prog = orig_prog;
+ goto out_addrs;
+ }
+ }
+ oldproglen = proglen;
+ cond_resched();
+ }
+
+ if (bpf_jit_enable > 1)
+ bpf_jit_dump(prog->len, proglen, pass + 1, image);
+
+ if (image) {
+ bpf_jit_binary_lock_ro(header);
+ prog->bpf_func = (void *)image;
+ prog->jited = 1;
+ prog->jited_len = proglen;
+ } else {
+ prog = orig_prog;
+ }
+
+out_addrs:
+ kfree(addrs);
+out:
+ if (tmp_blinded)
+ bpf_jit_prog_release_other(prog, prog == orig_prog ?
+ tmp : orig_prog);
+ return prog;
+}
--
1.8.5.6.2.g3d8a54e.dirty
^ permalink raw reply related
* Re: INFO: rcu detected stall in __schedule
From: Dmitry Vyukov @ 2018-05-03 6:07 UTC (permalink / raw)
To: Tetsuo Handa; +Cc: syzbot, syzkaller-bugs, LKML, linux-ppp, netdev, paulus
In-Reply-To: <1356afb7-80cf-d9d9-c282-e4e819807376@I-love.SAKURA.ne.jp>
On Thu, May 3, 2018 at 7:45 AM, Tetsuo Handa
<penguin-kernel@i-love.sakura.ne.jp> wrote:
> I'm not sure whether this is a PPP bug.
>
> As of uptime = 484, RCU says that it stalled for 125 seconds.
>
> ----------
> [ 484.407032] INFO: rcu_sched self-detected stall on CPU
> [ 484.412488] 0-...!: (125000 ticks this GP) idle=f3e/1/4611686018427387906 softirq=112858/112858 fqs=0
> [ 484.422300] (t=125000 jiffies g=61626 c=61625 q=1534)
> [ 484.427663] rcu_sched kthread starved for 125000 jiffies! g61626 c61625 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x402 ->cpu=0
> ----------
>
> 484 - 125 = 359, which was about to start SND related fuzzing in that log.
>
> ----------
> 2033/05/18 03:36:31 executing program 1:
> r0 = socket(0x40000a, 0x5, 0x7)
> setsockopt$inet_int(r0, 0x0, 0x18, &(0x7f0000000000)=0x200, 0x4)
> bind$inet6(r0, &(0x7f00000000c0)={0xa, 0x0, 0x0, @loopback={0x0, 0x1}}, 0x1c)
> perf_event_open(&(0x7f0000000040)={0x2, 0x70, 0x3e5}, 0x0, 0xffffffffffffffff, 0xffffffffffffffff, 0x0)
> timer_create(0x0, &(0x7f00000001c0)={0x0, 0x15, 0x0, @thr={&(0x7f0000000440), &(0x7f0000000540)}}, &(0x7f0000000200))
> timer_getoverrun(0x0)
> perf_event_open(&(0x7f000025c000)={0x2, 0x78, 0x3e3}, 0x0, 0x0, 0xffffffffffffffff, 0x0)
> r1 = syz_open_dev$sndctrl(&(0x7f0000000200)='/dev/snd/controlC#\x00', 0x2, 0x0)
> perf_event_open(&(0x7f0000001000)={0x0, 0x70, 0x0, 0x0, 0x0, 0x0, 0x0, 0x8ce, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x7, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xfffffffffffffff8, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, @perf_bp={&(0x7f0000005000), 0x2}, 0x1000000000c}, 0x0, 0x0, 0xffffffffffffffff, 0x0)
> ioctl$SNDRV_CTL_IOCTL_SUBSCRIBE_EVENTS(r1, 0xc0045516, &(0x7f00000000c0)=0x1)
> r2 = syz_open_dev$sndpcmp(&(0x7f0000000100)='/dev/snd/pcmC#D#p\x00', 0x1, 0x4000)
> ioctl$SNDRV_SEQ_IOCTL_GET_QUEUE_CLIENT(r2, 0xc04c5349, &(0x7f0000000240)={0x200, 0xfffffffffffffcdc, 0x1})
> syz_open_dev$tun(&(0x7f00000003c0)='/dev/net/tun\x00', 0x0, 0x20402)
> ioctl$SNDRV_CTL_IOCTL_PVERSION(r1, 0xc1105517, &(0x7f0000001000)=""/250)
> ioctl$SNDRV_CTL_IOCTL_SUBSCRIBE_EVENTS(r1, 0xc0045516, &(0x7f0000000000))
>
> 2033/05/18 03:36:31 executing program 4:
> syz_emit_ethernet(0x3e, &(0x7f00000000c0)={@broadcast=[0xff, 0xff, 0xff, 0xff, 0xff, 0xff], @empty=[0x0, 0x0, 0xb00000000000000], [], {@ipv4={0x800, {{0x5, 0x4, 0x0, 0x0, 0x30, 0x0, 0x0, 0x0, 0x1, 0x0, @remote={0xac, 0x14, 0x14, 0xbb}, @dev={0xac, 0x14, 0x14}}, @icmp=@parameter_prob={0x5, 0x4, 0x0, 0x0, 0x0, 0x0, {0x5, 0x4, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, @local={0xac, 0x223, 0x14, 0xaa}, @dev={0xac, 0x14, 0x14}}}}}}}, &(0x7f0000000000)={0x0, 0x2, [0x0, 0x2e6]})
>
> 2033/05/18 03:36:31 executing program 1:
> r0 = socket$pppoe(0x18, 0x1, 0x0)
> connect$pppoe(r0, &(0x7f00000000c0)={0x18, 0x0, {0x1, @broadcast=[0xff, 0xff, 0xff, 0xff, 0xff, 0xff], 'ip6_vti0\x00'}}, 0x1e)
> r1 = socket(0x3, 0xb, 0x80000001)
> setsockopt$inet_sctp6_SCTP_ADAPTATION_LAYER(r1, 0x84, 0x7, &(0x7f0000000100)={0x2}, 0x4)
> ioctl$sock_inet_SIOCGIFADDR(r0, 0x8915, &(0x7f0000000040)={'veth1_to_bridge\x00', {0x2, 0x4e21}})
> r2 = syz_open_dev$admmidi(&(0x7f0000000000)='/dev/admmidi#\x00', 0x6, 0x8000)
> setsockopt$SO_VM_SOCKETS_BUFFER_MAX_SIZE(r2, 0x28, 0x2, &(0x7f0000000080)=0xffffffffffffff00, 0x8)
>
> [ 359.306427] snd_virmidi snd_virmidi.0: control 112:0:0:� :0 is already present
> ----------
It's the next one that caused the hang (the number in "Comm:
syz-executor1" matches with the number in "executing program 1"):
[ 359.306427] snd_virmidi snd_virmidi.0: control 112:0:0:� :0 is
already present
2033/05/18 03:36:31 executing program 1:
r0 = openat$ptmx(0xffffffffffffff9c,
&(0x7f0000000140)='/dev/ptmx\x00', 0x0, 0x0)
ioctl$TCSETS(r0, 0x40045431, &(0x7f00005befdc))
r1 = syz_open_pts(r0, 0x20201)
fcntl$setstatus(r1, 0x4, 0x2800)
ioctl$TCXONC(r1, 0x540a, 0x0)
perf_event_open(&(0x7f000025c000)={0x2, 0x70, 0x3e5, 0x0, 0x0, 0x0,
0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, @perf_bp={&(0x7f000031f000)}}, 0x0,
0x0, 0xffffffffffffffff, 0x0)
write(r1, &(0x7f0000fd6000)='z', 0x1)
r2 = openat$ipvs(0xffffffffffffff9c,
&(0x7f0000000000)='/proc/sys/net/ipv4/vs/sync_ports\x00', 0x2, 0x0)
ioctl$ifreq_SIOCGIFINDEX_team(0xffffffffffffff9c, 0x8933,
&(0x7f00000012c0)={'team0\x00', <r3=>0x0})
bind$packet(r2, &(0x7f0000001300)={0x11, 0x1f, r3, 0x1, 0x0, 0x6,
@random="31e8917e98e6"}, 0x14)
ioctl$TIOCSETD(r1, 0x5423, &(0x7f00000000c0)=0x3)
ioctl$TCFLSH(r0, 0x540b, 0x0)
close(r0)
^ permalink raw reply
* Re: [PATCH 2/2] drivers core: multi-threading device shutdown
From: Tobin C. Harding @ 2018-05-03 5:54 UTC (permalink / raw)
To: Pavel Tatashin
Cc: steven.sistare, daniel.m.jordan, linux-kernel, jeffrey.t.kirsher,
intel-wired-lan, netdev, gregkh
In-Reply-To: <20180503035931.22439-3-pasha.tatashin@oracle.com>
This code was a pleasure to read, super clean.
On Wed, May 02, 2018 at 11:59:31PM -0400, Pavel Tatashin wrote:
> When system is rebooted, halted or kexeced device_shutdown() is
> called.
>
> This function shuts down every single device by calling either:
> dev->bus->shutdown(dev)
> dev->driver->shutdown(dev)
>
> Even on a machine just with a moderate amount of devices, device_shutdown()
> may take multiple seconds to complete. Because many devices require a
> specific delays to perform this operation.
>
> Here is sample analysis of time it takes to call device_shutdown() on
> two socket Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz machine.
>
> device_shutdown 2.95s
> mlx4_shutdown 1.14s
> megasas_shutdown 0.24s
> ixgbe_shutdown 0.37s x 4 (four ixgbe devices on my machine).
> the rest 0.09s
>
> In mlx4 we spent the most time, but that is because there is a 1 second
> sleep:
> mlx4_shutdown
> mlx4_unload_one
> mlx4_free_ownership
> msleep(1000)
>
> With megasas we spend quoter of second, but sometimes longer (up-to 0.5s)
> in this path:
>
> megasas_shutdown
> megasas_flush_cache
> megasas_issue_blocked_cmd
> wait_event_timeout
>
> Finally, with ixgbe_shutdown() it takes 0.37 for each device, but that time
> is spread all over the place, with bigger offenders:
>
> ixgbe_shutdown
> __ixgbe_shutdown
> ixgbe_close_suspend
> ixgbe_down
> ixgbe_init_hw_generic
> ixgbe_reset_hw_X540
> msleep(100); 0.104483472
> ixgbe_get_san_mac_addr_generic 0.048414851
> ixgbe_get_wwn_prefix_generic 0.048409893
> ixgbe_start_hw_X540
> ixgbe_start_hw_generic
> ixgbe_clear_hw_cntrs_generic 0.048581502
> ixgbe_setup_fc_generic 0.024225800
>
> All the ixgbe_*generic functions end-up calling:
> ixgbe_read_eerd_X540()
> ixgbe_acquire_swfw_sync_X540
> usleep_range(5000, 6000);
> ixgbe_release_swfw_sync_X540
> usleep_range(5000, 6000);
>
> While these are short sleeps, they end-up calling them over 24 times!
> 24 * 0.0055s = 0.132s. Adding-up to 0.528s for four devices.
>
> While we should keep optimizing the individual device drivers, in some
> cases this is simply a hardware property that forces a specific delay, and
> we must wait.
>
> So, the solution for this problem is to shutdown devices in parallel.
> However, we must shutdown children before shutting down parents, so parent
> device must wait for its children to finish.
>
> With this patch, on the same machine devices_shutdown() takes 1.142s, and
> without mlx4 one second delay only 0.38s
>
> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
> ---
> drivers/base/core.c | 238 +++++++++++++++++++++++++++++++++++---------
> 1 file changed, 189 insertions(+), 49 deletions(-)
>
> diff --git a/drivers/base/core.c b/drivers/base/core.c
> index b610816eb887..f370369a303b 100644
> --- a/drivers/base/core.c
> +++ b/drivers/base/core.c
> @@ -25,6 +25,7 @@
> #include <linux/netdevice.h>
> #include <linux/sched/signal.h>
> #include <linux/sysfs.h>
> +#include <linux/kthread.h>
>
> #include "base.h"
> #include "power/power.h"
> @@ -2102,6 +2103,59 @@ const char *device_get_devnode(struct device *dev,
> return *tmp = s;
> }
>
> +/**
> + * device_children_count - device children count
> + * @parent: parent struct device.
> + *
> + * Returns number of children for this device or 0 if nonde.
> + */
> +static int device_children_count(struct device *parent)
> +{
> + struct klist_iter i;
> + int children = 0;
> +
> + if (!parent->p)
> + return 0;
> +
> + klist_iter_init(&parent->p->klist_children, &i);
> + while (next_device(&i))
> + children++;
> + klist_iter_exit(&i);
> +
> + return children;
> +}
> +
> +/**
> + * device_get_child_by_index - Return child using the provide index.
> + * @parent: parent struct device.
> + * @index: Index of the child, where 0 is the first child in the children list,
> + * and so on.
> + *
> + * Returns child or NULL if child with this index is not present.
> + */
> +static struct device *
> +device_get_child_by_index(struct device *parent, int index)
> +{
> + struct klist_iter i;
> + struct device *dev = NULL, *d;
> + int child_index = 0;
> +
> + if (!parent->p || index < 0)
> + return NULL;
> +
> + klist_iter_init(&parent->p->klist_children, &i);
> + while ((d = next_device(&i)) != NULL) {
perhaps:
while ((d = next_device(&i))) {
> + if (child_index == index) {
> + dev = d;
> + break;
> + }
> + child_index++;
> + }
> + klist_iter_exit(&i);
> +
> + return dev;
> +}
> +
> /**
> * device_for_each_child - device child iterator.
> * @parent: parent struct device.
> @@ -2765,71 +2819,157 @@ int device_move(struct device *dev, struct device *new_parent,
> }
> EXPORT_SYMBOL_GPL(device_move);
>
> +/*
> + * device_shutdown_one - call ->shutdown() for the device passed as
> + * argument.
> + */
> +static void device_shutdown_one(struct device *dev)
> +{
> + /* Don't allow any more runtime suspends */
> + pm_runtime_get_noresume(dev);
> + pm_runtime_barrier(dev);
> +
> + if (dev->class && dev->class->shutdown_pre) {
> + if (initcall_debug)
> + dev_info(dev, "shutdown_pre\n");
> + dev->class->shutdown_pre(dev);
> + }
> + if (dev->bus && dev->bus->shutdown) {
> + if (initcall_debug)
> + dev_info(dev, "shutdown\n");
> + dev->bus->shutdown(dev);
> + } else if (dev->driver && dev->driver->shutdown) {
> + if (initcall_debug)
> + dev_info(dev, "shutdown\n");
> + dev->driver->shutdown(dev);
> + }
> +
> + /* Release device lock, and decrement the reference counter */
> + device_unlock(dev);
> + put_device(dev);
> +}
> +
> +static DECLARE_COMPLETION(device_root_tasks_complete);
> +static void device_shutdown_tree(struct device *dev);
> +static atomic_t device_root_tasks;
> +
> +/*
> + * Passed as an argument to to device_shutdown_task().
> + * child_next_index the next available child index.
> + * tasks_running number of tasks still running. Each tasks decrements it
> + * when job is finished and the last tasks signals that the
> + * job is complete.
> + * complete Used to signal job competition.
> + * parent Parent device.
> + */
> +struct device_shutdown_task_data {
> + atomic_t child_next_index;
> + atomic_t tasks_running;
> + struct completion complete;
> + struct device *parent;
> +};
> +
> +static int device_shutdown_task(void *data)
> +{
> + struct device_shutdown_task_data *tdata =
> + (struct device_shutdown_task_data *)data;
perhaps:
struct device_shutdown_task_data *tdata = data;
> + int child_idx = atomic_inc_return(&tdata->child_next_index) - 1;
> + struct device *dev = device_get_child_by_index(tdata->parent,
> + child_idx);
perhaps:
struct device *dev = device_get_child_by_index(tdata->parent, child_idx);
This is over the 80 character limit but only by one character :)
> +
> + if (dev)
> + device_shutdown_tree(dev);
> + if (atomic_dec_return(&tdata->tasks_running) == 0)
> + complete(&tdata->complete);
> + return 0;
> +}
> +
> +/*
> + * Shutdown device tree with root started in dev. If dev has no children
> + * simply shutdown only this device. If dev has children recursively shutdown
> + * children first, and only then the parent. For performance reasons children
> + * are shutdown in parallel using kernel threads.
> + */
> +static void device_shutdown_tree(struct device *dev)
> +{
> + int children_count = device_children_count(dev);
> +
> + if (children_count) {
> + struct device_shutdown_task_data tdata;
> + int i;
> +
> + init_completion(&tdata.complete);
> + atomic_set(&tdata.child_next_index, 0);
> + atomic_set(&tdata.tasks_running, children_count);
> + tdata.parent = dev;
> +
> + for (i = 0; i < children_count; i++) {
> + kthread_run(device_shutdown_task,
> + &tdata, "device_shutdown.%s",
> + dev_name(dev));
> + }
> + wait_for_completion(&tdata.complete);
> + }
> + device_shutdown_one(dev);
> +}
> +
> +/*
> + * On shutdown each root device (the one that does not have a parent) goes
> + * through this function.
> + */
> +static int
> +device_shutdown_root_task(void *data)
> +{
> + struct device *dev = (struct device *)data;
> +
> + device_shutdown_tree(dev);
> + if (atomic_dec_return(&device_root_tasks) == 0)
> + complete(&device_root_tasks_complete);
> + return 0;
> +}
> +
> /**
> * device_shutdown - call ->shutdown() on each device to shutdown.
> */
> void device_shutdown(void)
> {
> - struct device *dev, *parent;
> + struct list_head *pos, *next;
> + int root_devices = 0;
> + struct device *dev;
>
> spin_lock(&devices_kset->list_lock);
> /*
> - * Walk the devices list backward, shutting down each in turn.
> - * Beware that device unplug events may also start pulling
> - * devices offline, even as the system is shutting down.
> + * Prepare devices for shutdown: lock, and increment references in every
> + * devices. Remove child devices from the list, and count number of root
* Prepare devices for shutdown: lock, and increment reference in each
* device. Remove child devices from the list, and count number of root
Hope this helps,
Tobin.
^ permalink raw reply
* Re: INFO: rcu detected stall in __schedule
From: Tetsuo Handa @ 2018-05-03 5:45 UTC (permalink / raw)
To: syzbot, syzkaller-bugs, dvyukov; +Cc: linux-kernel, linux-ppp, netdev, paulus
In-Reply-To: <000000000000d2fe62056b3ccca5@google.com>
I'm not sure whether this is a PPP bug.
As of uptime = 484, RCU says that it stalled for 125 seconds.
----------
[ 484.407032] INFO: rcu_sched self-detected stall on CPU
[ 484.412488] 0-...!: (125000 ticks this GP) idle=f3e/1/4611686018427387906 softirq=112858/112858 fqs=0
[ 484.422300] (t=125000 jiffies g=61626 c=61625 q=1534)
[ 484.427663] rcu_sched kthread starved for 125000 jiffies! g61626 c61625 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x402 ->cpu=0
----------
484 - 125 = 359, which was about to start SND related fuzzing in that log.
----------
2033/05/18 03:36:31 executing program 1:
r0 = socket(0x40000a, 0x5, 0x7)
setsockopt$inet_int(r0, 0x0, 0x18, &(0x7f0000000000)=0x200, 0x4)
bind$inet6(r0, &(0x7f00000000c0)={0xa, 0x0, 0x0, @loopback={0x0, 0x1}}, 0x1c)
perf_event_open(&(0x7f0000000040)={0x2, 0x70, 0x3e5}, 0x0, 0xffffffffffffffff, 0xffffffffffffffff, 0x0)
timer_create(0x0, &(0x7f00000001c0)={0x0, 0x15, 0x0, @thr={&(0x7f0000000440), &(0x7f0000000540)}}, &(0x7f0000000200))
timer_getoverrun(0x0)
perf_event_open(&(0x7f000025c000)={0x2, 0x78, 0x3e3}, 0x0, 0x0, 0xffffffffffffffff, 0x0)
r1 = syz_open_dev$sndctrl(&(0x7f0000000200)='/dev/snd/controlC#\x00', 0x2, 0x0)
perf_event_open(&(0x7f0000001000)={0x0, 0x70, 0x0, 0x0, 0x0, 0x0, 0x0, 0x8ce, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x7, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xfffffffffffffff8, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, @perf_bp={&(0x7f0000005000), 0x2}, 0x1000000000c}, 0x0, 0x0, 0xffffffffffffffff, 0x0)
ioctl$SNDRV_CTL_IOCTL_SUBSCRIBE_EVENTS(r1, 0xc0045516, &(0x7f00000000c0)=0x1)
r2 = syz_open_dev$sndpcmp(&(0x7f0000000100)='/dev/snd/pcmC#D#p\x00', 0x1, 0x4000)
ioctl$SNDRV_SEQ_IOCTL_GET_QUEUE_CLIENT(r2, 0xc04c5349, &(0x7f0000000240)={0x200, 0xfffffffffffffcdc, 0x1})
syz_open_dev$tun(&(0x7f00000003c0)='/dev/net/tun\x00', 0x0, 0x20402)
ioctl$SNDRV_CTL_IOCTL_PVERSION(r1, 0xc1105517, &(0x7f0000001000)=""/250)
ioctl$SNDRV_CTL_IOCTL_SUBSCRIBE_EVENTS(r1, 0xc0045516, &(0x7f0000000000))
2033/05/18 03:36:31 executing program 4:
syz_emit_ethernet(0x3e, &(0x7f00000000c0)={@broadcast=[0xff, 0xff, 0xff, 0xff, 0xff, 0xff], @empty=[0x0, 0x0, 0xb00000000000000], [], {@ipv4={0x800, {{0x5, 0x4, 0x0, 0x0, 0x30, 0x0, 0x0, 0x0, 0x1, 0x0, @remote={0xac, 0x14, 0x14, 0xbb}, @dev={0xac, 0x14, 0x14}}, @icmp=@parameter_prob={0x5, 0x4, 0x0, 0x0, 0x0, 0x0, {0x5, 0x4, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, @local={0xac, 0x223, 0x14, 0xaa}, @dev={0xac, 0x14, 0x14}}}}}}}, &(0x7f0000000000)={0x0, 0x2, [0x0, 0x2e6]})
2033/05/18 03:36:31 executing program 1:
r0 = socket$pppoe(0x18, 0x1, 0x0)
connect$pppoe(r0, &(0x7f00000000c0)={0x18, 0x0, {0x1, @broadcast=[0xff, 0xff, 0xff, 0xff, 0xff, 0xff], 'ip6_vti0\x00'}}, 0x1e)
r1 = socket(0x3, 0xb, 0x80000001)
setsockopt$inet_sctp6_SCTP_ADAPTATION_LAYER(r1, 0x84, 0x7, &(0x7f0000000100)={0x2}, 0x4)
ioctl$sock_inet_SIOCGIFADDR(r0, 0x8915, &(0x7f0000000040)={'veth1_to_bridge\x00', {0x2, 0x4e21}})
r2 = syz_open_dev$admmidi(&(0x7f0000000000)='/dev/admmidi#\x00', 0x6, 0x8000)
setsockopt$SO_VM_SOCKETS_BUFFER_MAX_SIZE(r2, 0x28, 0x2, &(0x7f0000000080)=0xffffffffffffff00, 0x8)
[ 359.306427] snd_virmidi snd_virmidi.0: control 112:0:0:�\b:0 is already present
----------
^ permalink raw reply
* Re: [lkp-robot] 486ad79630 [ 15.532543] BUG: unable to handle kernel NULL pointer dereference at 0000000000000004
From: Andrew Morton @ 2018-05-03 5:44 UTC (permalink / raw)
To: Cong Wang
Cc: kernel test robot, kernel test robot,
Linux Memory Management List, Johannes Weiner, LKP, David Miller,
Linux Kernel Network Developers
In-Reply-To: <CAM_iQpVDtrGCqd7NQ1vJXTuLMdz=GwbnN77vdkmY+PxtFmKHTw@mail.gmail.com>
On Wed, 2 May 2018 21:58:25 -0700 Cong Wang <xiyou.wangcong@gmail.com> wrote:
> On Wed, May 2, 2018 at 9:27 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > So it's saying that something which got committed into Linus's tree
> > after 4.17-rc3 has caused a NULL deref in
> > sock_release->llc_ui_release+0x3a/0xd0
>
> Do you mean it contains commit 3a04ce7130a7
> ("llc: fix NULL pointer deref for SOCK_ZAPPED")?
That was in 4.17-rc3 so if this report's bisection is correct, that
patch is innocent.
origin.patch (http://ozlabs.org/~akpm/mmots/broken-out/origin.patch)
contains no changes to net/llc/af_llc.c so perhaps this crash is also
occurring in 4.17-rc3 base.
^ permalink raw reply
* Re: [PATCH] net/xfrm: Fix lookups for states with spi == 0
From: Herbert Xu @ 2018-05-03 5:40 UTC (permalink / raw)
To: Dmitry Safonov
Cc: linux-kernel, 0x7f454c46, Steffen Klassert, David S. Miller,
netdev, Masahide NAKAMURA, YOSHIFUJI Hideaki
In-Reply-To: <1525264896.14025.23.camel@arista.com>
On Wed, May 02, 2018 at 01:41:36PM +0100, Dmitry Safonov wrote:
>
> But still it's possible to create ipsec with zero SPI.
> And it seems not making sense to search for a state with SPI hash if
> request has zero SPI.
Fair enough. In fact a zero SPI is legal and defined for IPcomp.
The bug arose from this patch:
commit 7b4dc3600e4877178ba94c7fbf7e520421378aa6
Author: Masahide NAKAMURA <nakam@linux-ipv6.org>
Date: Wed Sep 27 22:21:52 2006 -0700
[XFRM]: Do not add a state whose SPI is zero to the SPI hash.
SPI=0 is used for acquired IPsec SA and MIPv6 RO state.
Such state should not be added to the SPI hash
because we do not care about it on deleting path.
Signed-off-by: Masahide NAKAMURA <nakam@linux-ipv6.org>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
I think it would be better to revert this.
Cheers,
--
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply
* Re: [PATCH net] ipv4: fix fnhe usage by non-cached routes
From: Julian Anastasov @ 2018-05-03 5:32 UTC (permalink / raw)
To: David Ahern; +Cc: David Miller, netdev, Martin KaFai Lau, kernel-team, Xin Long
In-Reply-To: <a20853ac-e177-0fc3-1537-2b973b5a1713@gmail.com>
Hello,
On Wed, 2 May 2018, David Ahern wrote:
> On 5/2/18 12:41 AM, Julian Anastasov wrote:
> > Allow some non-cached routes to use non-expired fnhe:
> >
> > 1. ip_del_fnhe: moved above and now called by find_exception.
> > The 4.5+ commit deed49df7390 expires fnhe only when caching
> > routes. Change that to:
> >
> > 1.1. use fnhe for non-cached local output routes, with the help
> > from (2)
> >
> > 1.2. allow __mkroute_input to detect expired fnhe (outdated
> > fnhe_gw, for example) when do_cache is false, eg. when itag!=0
> > for unicast destinations.
> >
> > 2. __mkroute_output: keep fi to allow local routes with orig_oif != 0
> > to use fnhe info even when the new route will not be cached into fnhe.
> > After commit 839da4d98960 ("net: ipv4: set orig_oif based on fib
> > result for local traffic") it means all local routes will be affected
> > because they are not cached. This change is used to solve a PMTU
> > problem with IPVS (and probably Netfilter DNAT) setups that redirect
> > local clients from target local IP (local route to Virtual IP)
> > to new remote IP target, eg. IPVS TUN real server. Loopback has
> > 64K MTU and we need to create fnhe on the local route that will
> > keep the reduced PMTU for the Virtual IP. Without this change
> > fnhe_pmtu is updated from ICMP but never exposed to non-cached
> > local routes. This includes routes with flowi4_oif!=0 for 4.6+ and
> > with flowi4_oif=any for 4.14+).
>
> Can you add a test case to tools/testing/selftests/net/pmtu.sh to cover
> this situation?
Sure, I'll give it a try.
> > @@ -1310,8 +1340,14 @@ static struct fib_nh_exception *find_exception(struct fib_nh *nh, __be32 daddr)
> >
> > for (fnhe = rcu_dereference(hash[hval].chain); fnhe;
> > fnhe = rcu_dereference(fnhe->fnhe_next)) {
> > - if (fnhe->fnhe_daddr == daddr)
> > + if (fnhe->fnhe_daddr == daddr) {
> > + if (fnhe->fnhe_expires &&
> > + time_after(jiffies, fnhe->fnhe_expires)) {
> > + ip_del_fnhe(nh, daddr);
>
> I'm surprised this is done in the fast path vs gc time. (the existing
> code does as well; your change is only moving the call to make the input
> and output paths the same)
>
>
> The change looks correct to me and all of my functional tests passed.
>
> Acked-by: David Ahern <dsahern@gmail.com>
Thanks for the review!
Regards
^ permalink raw reply
* [PATCH] sched: fix semicolon.cocci warnings
From: kbuild test robot @ 2018-05-03 5:05 UTC (permalink / raw)
To: Toke Høiland-Jørgensen; +Cc: kbuild-all, netdev, cake
In-Reply-To: <152527386316.14936.5409621935637217368.stgit@alrua-kau>
From: Fengguang Wu <fengguang.wu@intel.com>
net/sched/sch_cake.c:580:2-3: Unneeded semicolon
Remove unneeded semicolon.
Generated by: scripts/coccinelle/misc/semicolon.cocci
Fixes: 907a16741a03 ("sched: Add Common Applications Kept Enhanced (cake) qdisc")
CC: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
sch_cake.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/net/sched/sch_cake.c
+++ b/net/sched/sch_cake.c
@@ -577,7 +577,7 @@ cake_hash(struct cake_tin_data *q, const
default:
dsthost_hash = 0;
srchost_hash = 0;
- };
+ }
/* This *must* be after the above switch, since as a
* side-effect it sorts the src and dst addresses.
^ permalink raw reply
* Re: [PATCH net-next v7 1/7] sched: Add Common Applications Kept Enhanced (cake) qdisc
From: kbuild test robot @ 2018-05-03 5:05 UTC (permalink / raw)
To: Toke Høiland-Jørgensen; +Cc: kbuild-all, netdev, cake
In-Reply-To: <152527386316.14936.5409621935637217368.stgit@alrua-kau>
Hi Toke,
Thank you for the patch! Perhaps something to improve:
[auto build test WARNING on net-next/master]
url: https://github.com/0day-ci/linux/commits/Toke-H-iland-J-rgensen/sched-Add-Common-Applications-Kept-Enhanced-cake-qdisc/20180503-073002
coccinelle warnings: (new ones prefixed by >>)
>> net/sched/sch_cake.c:580:2-3: Unneeded semicolon
Please review and possibly fold the followup patch.
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
^ permalink raw reply
* Re: [v2 PATCH 1/1] tg3: fix meaningless hw_stats reading after tg3_halt memset 0 hw_stats
From: Michael Chan @ 2018-05-03 5:04 UTC (permalink / raw)
To: Zumeng Chen
Cc: Netdev, open list, Siva Reddy Kallam,
prashant.sreedharan@broadcom.com, David Miller, Zumeng Chen
In-Reply-To: <8f73e98f-0c55-2d96-a1b7-0890bf90bf41@gmail.com>
On Wed, May 2, 2018 at 5:30 PM, Zumeng Chen <zumeng.chen@gmail.com> wrote:
> On 2018年05月03日 01:32, Michael Chan wrote:
>>
>> On Wed, May 2, 2018 at 3:27 AM, Zumeng Chen <zumeng.chen@gmail.com> wrote:
>>>
>>> On 2018年05月02日 13:12, Michael Chan wrote:
>>>>
>>>> On Tue, May 1, 2018 at 5:42 PM, Zumeng Chen <zumeng.chen@gmail.com>
>>>> wrote:
>>>>
>>>>> diff --git a/drivers/net/ethernet/broadcom/tg3.h
>>>>> b/drivers/net/ethernet/broadcom/tg3.h
>>>>> index 3b5e98e..c61d83c 100644
>>>>> --- a/drivers/net/ethernet/broadcom/tg3.h
>>>>> +++ b/drivers/net/ethernet/broadcom/tg3.h
>>>>> @@ -3102,6 +3102,7 @@ enum TG3_FLAGS {
>>>>> TG3_FLAG_ROBOSWITCH,
>>>>> TG3_FLAG_ONE_DMA_AT_ONCE,
>>>>> TG3_FLAG_RGMII_MODE,
>>>>> + TG3_FLAG_HALT,
>>>>
>>>> I think you should be able to use the existing INIT_COMPLETE flag
>>>
>>>
>>> No, it will bring the uncertain factors into the existed complicate
>>> logic
>>> of INIT_COMPLETE.
>>> And I think it's very simple logic here to fix the meaningless hw_stats
>>> reading and the problem
>>> of commit f5992b72. I even suspect if you have read INIT_COMPLETE related
>>> codes carefully.
>>>
>> We should use an existing flag whenever appropriate
>
>
> I disagree. This is sort of blahblah...
>>
I don't want to see another flag added that is practically the same as
!INIT_COMPLETE. The driver already has close to one hundred flags.
Adding a new flag that is similar to an existing flag will just make
the code more difficult to understand and maintain.
If you don't want to fix it the cleaner way, Siva or I will fix it.
^ permalink raw reply
* Re: Silently dropped UDP packets on kernel 4.14
From: Florian Westphal @ 2018-05-03 5:03 UTC (permalink / raw)
To: Kristian Evensen
Cc: Florian Westphal, Netfilter Development Mailing list,
Network Development
In-Reply-To: <CAKfDRXiO7qKHGLf5faSy7g_f1p9budJLXS9-uZWLPgyx3OAJbA@mail.gmail.com>
Kristian Evensen <kristian.evensen@gmail.com> wrote:
> I went for the early-insert approached and have patched
I'm sorry for suggesting that.
It doesn't work, because of NAT.
NAT rewrites packet content and changes the reply tuple, but the tuples
determine the hash insertion location.
I don't know how to solve this problem.
^ permalink raw reply
* Re: [lkp-robot] 486ad79630 [ 15.532543] BUG: unable to handle kernel NULL pointer dereference at 0000000000000004
From: Cong Wang @ 2018-05-03 4:58 UTC (permalink / raw)
To: Andrew Morton
Cc: kernel test robot, kernel test robot,
Linux Memory Management List, Johannes Weiner, LKP, David Miller,
Linux Kernel Network Developers
In-Reply-To: <20180502212735.7660515ac03cf61630f5ff6b@linux-foundation.org>
On Wed, May 2, 2018 at 9:27 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
>
> So it's saying that something which got committed into Linus's tree
> after 4.17-rc3 has caused a NULL deref in
> sock_release->llc_ui_release+0x3a/0xd0
Do you mean it contains commit 3a04ce7130a7
("llc: fix NULL pointer deref for SOCK_ZAPPED")?
^ permalink raw reply
* [PATCH RFC v2 net-next 4/4] bpfilter: rough bpfilter codegen example hack
From: Alexei Starovoitov @ 2018-05-03 4:36 UTC (permalink / raw)
To: davem; +Cc: daniel, torvalds, gregkh, luto, netdev, linux-kernel, kernel-team
In-Reply-To: <20180503043604.1604587-1-ast@kernel.org>
From: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
net/bpfilter/Makefile | 2 +-
net/bpfilter/bpfilter_mod.h | 285 ++++++++++++++++++++++++++++++++++++++++++-
net/bpfilter/ctor.c | 57 +++++----
net/bpfilter/gen.c | 290 ++++++++++++++++++++++++++++++++++++++++++++
net/bpfilter/init.c | 11 +-
net/bpfilter/main.c | 15 ++-
net/bpfilter/sockopt.c | 137 ++++++++++++++++-----
net/bpfilter/tables.c | 5 +-
net/bpfilter/tgts.c | 1 +
9 files changed, 737 insertions(+), 66 deletions(-)
create mode 100644 net/bpfilter/gen.c
diff --git a/net/bpfilter/Makefile b/net/bpfilter/Makefile
index bec6181de995..3796651c76cb 100644
--- a/net/bpfilter/Makefile
+++ b/net/bpfilter/Makefile
@@ -4,7 +4,7 @@
#
hostprogs-y := bpfilter_umh
-bpfilter_umh-objs := main.o tgts.o targets.o tables.o init.o ctor.o sockopt.o
+bpfilter_umh-objs := main.o tgts.o targets.o tables.o init.o ctor.o sockopt.o gen.o
HOSTCFLAGS += -I. -Itools/include/
# a bit of elf magic to convert bpfilter_umh binary into a binary blob
diff --git a/net/bpfilter/bpfilter_mod.h b/net/bpfilter/bpfilter_mod.h
index f0de41b20793..b4209985efff 100644
--- a/net/bpfilter/bpfilter_mod.h
+++ b/net/bpfilter/bpfilter_mod.h
@@ -21,8 +21,8 @@ struct bpfilter_table_info {
unsigned int initial_entries;
unsigned int hook_entry[BPFILTER_INET_HOOK_MAX];
unsigned int underflow[BPFILTER_INET_HOOK_MAX];
- unsigned int stacksize;
- void ***jumpstack;
+// unsigned int stacksize;
+// void ***jumpstack;
unsigned char entries[0] __aligned(8);
};
@@ -64,22 +64,55 @@ struct bpfilter_ipt_error {
struct bpfilter_target {
struct list_head all_target_list;
- const char name[BPFILTER_EXTENSION_MAXNAMELEN];
+ char name[BPFILTER_EXTENSION_MAXNAMELEN];
unsigned int size;
int hold;
u16 family;
u8 rev;
};
+struct bpfilter_gen_ctx {
+ struct bpf_insn *img;
+ u32 len_cur;
+ u32 len_max;
+ u32 default_verdict;
+ int fd;
+ int ifindex;
+ bool offloaded;
+};
+
+union bpf_attr;
+int sys_bpf(int cmd, union bpf_attr *attr, unsigned int size);
+
+int bpfilter_gen_init(struct bpfilter_gen_ctx *ctx);
+int bpfilter_gen_prologue(struct bpfilter_gen_ctx *ctx);
+int bpfilter_gen_epilogue(struct bpfilter_gen_ctx *ctx);
+int bpfilter_gen_append(struct bpfilter_gen_ctx *ctx,
+ struct bpfilter_ipt_ip *ent, int verdict);
+int bpfilter_gen_commit(struct bpfilter_gen_ctx *ctx);
+void bpfilter_gen_destroy(struct bpfilter_gen_ctx *ctx);
+
struct bpfilter_target *bpfilter_target_get_by_name(const char *name);
void bpfilter_target_put(struct bpfilter_target *tgt);
int bpfilter_target_add(struct bpfilter_target *tgt);
-struct bpfilter_table_info *bpfilter_ipv4_table_ctor(struct bpfilter_table *tbl);
+struct bpfilter_table_info *
+bpfilter_ipv4_table_alloc(struct bpfilter_table *tbl, __u32 size_ents);
+struct bpfilter_table_info *
+bpfilter_ipv4_table_finalize(struct bpfilter_table *tbl,
+ struct bpfilter_table_info *info,
+ __u32 size_ents, __u32 num_ents);
+struct bpfilter_table_info *
+bpfilter_ipv4_table_finalize2(struct bpfilter_table *tbl,
+ struct bpfilter_table_info *info,
+ __u32 size_ents, __u32 num_ents);
+
int bpfilter_ipv4_register_targets(void);
void bpfilter_tables_init(void);
int bpfilter_get_info(void *addr, int len);
int bpfilter_get_entries(void *cmd, int len);
+int bpfilter_set_replace(void *cmd, int len);
+int bpfilter_set_add_counters(void *cmd, int len);
int bpfilter_ipv4_init(void);
int copy_from_user(void *dst, void *addr, int len);
@@ -93,4 +126,248 @@ extern int pid;
extern int debug_fd;
#define ENOTSUPP 524
+/* Helper macros for filter block array initializers. */
+
+/* ALU ops on registers, bpf_add|sub|...: dst_reg += src_reg */
+
+#define BPF_ALU64_REG(OP, DST, SRC) \
+ ((struct bpf_insn) { \
+ .code = BPF_ALU64 | BPF_OP(OP) | BPF_X, \
+ .dst_reg = DST, \
+ .src_reg = SRC, \
+ .off = 0, \
+ .imm = 0 })
+
+#define BPF_ALU32_REG(OP, DST, SRC) \
+ ((struct bpf_insn) { \
+ .code = BPF_ALU | BPF_OP(OP) | BPF_X, \
+ .dst_reg = DST, \
+ .src_reg = SRC, \
+ .off = 0, \
+ .imm = 0 })
+
+/* ALU ops on immediates, bpf_add|sub|...: dst_reg += imm32 */
+
+#define BPF_ALU64_IMM(OP, DST, IMM) \
+ ((struct bpf_insn) { \
+ .code = BPF_ALU64 | BPF_OP(OP) | BPF_K, \
+ .dst_reg = DST, \
+ .src_reg = 0, \
+ .off = 0, \
+ .imm = IMM })
+
+#define BPF_ALU32_IMM(OP, DST, IMM) \
+ ((struct bpf_insn) { \
+ .code = BPF_ALU | BPF_OP(OP) | BPF_K, \
+ .dst_reg = DST, \
+ .src_reg = 0, \
+ .off = 0, \
+ .imm = IMM })
+
+/* Endianess conversion, cpu_to_{l,b}e(), {l,b}e_to_cpu() */
+
+#define BPF_ENDIAN(TYPE, DST, LEN) \
+ ((struct bpf_insn) { \
+ .code = BPF_ALU | BPF_END | BPF_SRC(TYPE), \
+ .dst_reg = DST, \
+ .src_reg = 0, \
+ .off = 0, \
+ .imm = LEN })
+
+/* Short form of mov, dst_reg = src_reg */
+
+#define BPF_MOV64_REG(DST, SRC) \
+ ((struct bpf_insn) { \
+ .code = BPF_ALU64 | BPF_MOV | BPF_X, \
+ .dst_reg = DST, \
+ .src_reg = SRC, \
+ .off = 0, \
+ .imm = 0 })
+
+#define BPF_MOV32_REG(DST, SRC) \
+ ((struct bpf_insn) { \
+ .code = BPF_ALU | BPF_MOV | BPF_X, \
+ .dst_reg = DST, \
+ .src_reg = SRC, \
+ .off = 0, \
+ .imm = 0 })
+
+/* Short form of mov, dst_reg = imm32 */
+
+#define BPF_MOV64_IMM(DST, IMM) \
+ ((struct bpf_insn) { \
+ .code = BPF_ALU64 | BPF_MOV | BPF_K, \
+ .dst_reg = DST, \
+ .src_reg = 0, \
+ .off = 0, \
+ .imm = IMM })
+
+#define BPF_MOV32_IMM(DST, IMM) \
+ ((struct bpf_insn) { \
+ .code = BPF_ALU | BPF_MOV | BPF_K, \
+ .dst_reg = DST, \
+ .src_reg = 0, \
+ .off = 0, \
+ .imm = IMM })
+
+/* BPF_LD_IMM64 macro encodes single 'load 64-bit immediate' insn */
+#define BPF_LD_IMM64(DST, IMM) \
+ BPF_LD_IMM64_RAW(DST, 0, IMM)
+
+#define BPF_LD_IMM64_RAW(DST, SRC, IMM) \
+ ((struct bpf_insn) { \
+ .code = BPF_LD | BPF_DW | BPF_IMM, \
+ .dst_reg = DST, \
+ .src_reg = SRC, \
+ .off = 0, \
+ .imm = (__u32) (IMM) }), \
+ ((struct bpf_insn) { \
+ .code = 0, /* zero is reserved opcode */ \
+ .dst_reg = 0, \
+ .src_reg = 0, \
+ .off = 0, \
+ .imm = ((__u64) (IMM)) >> 32 })
+
+/* pseudo BPF_LD_IMM64 insn used to refer to process-local map_fd */
+#define BPF_LD_MAP_FD(DST, MAP_FD) \
+ BPF_LD_IMM64_RAW(DST, BPF_PSEUDO_MAP_FD, MAP_FD)
+
+/* Short form of mov based on type, BPF_X: dst_reg = src_reg, BPF_K: dst_reg = imm32 */
+
+#define BPF_MOV64_RAW(TYPE, DST, SRC, IMM) \
+ ((struct bpf_insn) { \
+ .code = BPF_ALU64 | BPF_MOV | BPF_SRC(TYPE), \
+ .dst_reg = DST, \
+ .src_reg = SRC, \
+ .off = 0, \
+ .imm = IMM })
+
+#define BPF_MOV32_RAW(TYPE, DST, SRC, IMM) \
+ ((struct bpf_insn) { \
+ .code = BPF_ALU | BPF_MOV | BPF_SRC(TYPE), \
+ .dst_reg = DST, \
+ .src_reg = SRC, \
+ .off = 0, \
+ .imm = IMM })
+
+/* Direct packet access, R0 = *(uint *) (skb->data + imm32) */
+
+#define BPF_LD_ABS(SIZE, IMM) \
+ ((struct bpf_insn) { \
+ .code = BPF_LD | BPF_SIZE(SIZE) | BPF_ABS, \
+ .dst_reg = 0, \
+ .src_reg = 0, \
+ .off = 0, \
+ .imm = IMM })
+
+/* Indirect packet access, R0 = *(uint *) (skb->data + src_reg + imm32) */
+
+#define BPF_LD_IND(SIZE, SRC, IMM) \
+ ((struct bpf_insn) { \
+ .code = BPF_LD | BPF_SIZE(SIZE) | BPF_IND, \
+ .dst_reg = 0, \
+ .src_reg = SRC, \
+ .off = 0, \
+ .imm = IMM })
+
+/* Memory load, dst_reg = *(uint *) (src_reg + off16) */
+
+#define BPF_LDX_MEM(SIZE, DST, SRC, OFF) \
+ ((struct bpf_insn) { \
+ .code = BPF_LDX | BPF_SIZE(SIZE) | BPF_MEM, \
+ .dst_reg = DST, \
+ .src_reg = SRC, \
+ .off = OFF, \
+ .imm = 0 })
+
+/* Memory store, *(uint *) (dst_reg + off16) = src_reg */
+
+#define BPF_STX_MEM(SIZE, DST, SRC, OFF) \
+ ((struct bpf_insn) { \
+ .code = BPF_STX | BPF_SIZE(SIZE) | BPF_MEM, \
+ .dst_reg = DST, \
+ .src_reg = SRC, \
+ .off = OFF, \
+ .imm = 0 })
+
+/* Atomic memory add, *(uint *)(dst_reg + off16) += src_reg */
+
+#define BPF_STX_XADD(SIZE, DST, SRC, OFF) \
+ ((struct bpf_insn) { \
+ .code = BPF_STX | BPF_SIZE(SIZE) | BPF_XADD, \
+ .dst_reg = DST, \
+ .src_reg = SRC, \
+ .off = OFF, \
+ .imm = 0 })
+
+/* Memory store, *(uint *) (dst_reg + off16) = imm32 */
+
+#define BPF_ST_MEM(SIZE, DST, OFF, IMM) \
+ ((struct bpf_insn) { \
+ .code = BPF_ST | BPF_SIZE(SIZE) | BPF_MEM, \
+ .dst_reg = DST, \
+ .src_reg = 0, \
+ .off = OFF, \
+ .imm = IMM })
+
+/* Conditional jumps against registers, if (dst_reg 'op' src_reg) goto pc + off16 */
+
+#define BPF_JMP_REG(OP, DST, SRC, OFF) \
+ ((struct bpf_insn) { \
+ .code = BPF_JMP | BPF_OP(OP) | BPF_X, \
+ .dst_reg = DST, \
+ .src_reg = SRC, \
+ .off = OFF, \
+ .imm = 0 })
+
+/* Conditional jumps against immediates, if (dst_reg 'op' imm32) goto pc + off16 */
+
+#define BPF_JMP_IMM(OP, DST, IMM, OFF) \
+ ((struct bpf_insn) { \
+ .code = BPF_JMP | BPF_OP(OP) | BPF_K, \
+ .dst_reg = DST, \
+ .src_reg = 0, \
+ .off = OFF, \
+ .imm = IMM })
+
+/* Unconditional jumps, goto pc + off16 */
+
+#define BPF_JMP_A(OFF) \
+ ((struct bpf_insn) { \
+ .code = BPF_JMP | BPF_JA, \
+ .dst_reg = 0, \
+ .src_reg = 0, \
+ .off = OFF, \
+ .imm = 0 })
+
+/* Function call */
+
+#define BPF_EMIT_CALL(FUNC) \
+ ((struct bpf_insn) { \
+ .code = BPF_JMP | BPF_CALL, \
+ .dst_reg = 0, \
+ .src_reg = 0, \
+ .off = 0, \
+ .imm = ((FUNC) - __bpf_call_base) })
+
+/* Raw code statement block */
+
+#define BPF_RAW_INSN(CODE, DST, SRC, OFF, IMM) \
+ ((struct bpf_insn) { \
+ .code = CODE, \
+ .dst_reg = DST, \
+ .src_reg = SRC, \
+ .off = OFF, \
+ .imm = IMM })
+
+/* Program exit */
+
+#define BPF_EXIT_INSN() \
+ ((struct bpf_insn) { \
+ .code = BPF_JMP | BPF_EXIT, \
+ .dst_reg = 0, \
+ .src_reg = 0, \
+ .off = 0, \
+ .imm = 0 })
+
#endif
diff --git a/net/bpfilter/ctor.c b/net/bpfilter/ctor.c
index efb7feef3c42..ba44c21cacfa 100644
--- a/net/bpfilter/ctor.c
+++ b/net/bpfilter/ctor.c
@@ -1,8 +1,12 @@
// SPDX-License-Identifier: GPL-2.0
-#include <sys/socket.h>
-#include <linux/bitops.h>
#include <stdlib.h>
#include <stdio.h>
+#include <string.h>
+
+#include <sys/socket.h>
+
+#include <linux/bitops.h>
+
#include "bpfilter_mod.h"
unsigned int __sw_hweight32(unsigned int w)
@@ -13,35 +17,47 @@ unsigned int __sw_hweight32(unsigned int w)
return (w * 0x01010101) >> 24;
}
-struct bpfilter_table_info *bpfilter_ipv4_table_ctor(struct bpfilter_table *tbl)
+struct bpfilter_table_info *bpfilter_ipv4_table_alloc(struct bpfilter_table *tbl,
+ __u32 size_ents)
{
unsigned int num_hooks = hweight32(tbl->valid_hooks);
- struct bpfilter_ipt_standard *tgts;
struct bpfilter_table_info *info;
- struct bpfilter_ipt_error *term;
- unsigned int mask, offset, h, i;
unsigned int size, alloc_size;
size = sizeof(struct bpfilter_ipt_standard) * num_hooks;
size += sizeof(struct bpfilter_ipt_error);
+ size += size_ents;
alloc_size = size + sizeof(struct bpfilter_table_info);
info = malloc(alloc_size);
- if (!info)
- return NULL;
+ if (info) {
+ memset(info, 0, alloc_size);
+ info->size = size;
+ }
+ return info;
+}
+
+struct bpfilter_table_info *bpfilter_ipv4_table_finalize(struct bpfilter_table *tbl,
+ struct bpfilter_table_info *info,
+ __u32 size_ents, __u32 num_ents)
+{
+ unsigned int num_hooks = hweight32(tbl->valid_hooks);
+ struct bpfilter_ipt_standard *tgts;
+ struct bpfilter_ipt_error *term;
+ struct bpfilter_ipt_entry *ent;
+ unsigned int mask, offset, h, i;
- info->num_entries = num_hooks + 1;
- info->size = size;
+ info->num_entries = num_ents + num_hooks + 1;
- tgts = (struct bpfilter_ipt_standard *) (info + 1);
- term = (struct bpfilter_ipt_error *) (tgts + num_hooks);
+ ent = (struct bpfilter_ipt_entry *)(info + 1);
+ tgts = (struct bpfilter_ipt_standard *)((u8 *)ent + size_ents);
+ term = (struct bpfilter_ipt_error *)(tgts + num_hooks);
mask = tbl->valid_hooks;
offset = 0;
h = 0;
i = 0;
- dprintf(debug_fd, "mask %x num_hooks %d\n", mask, num_hooks);
while (mask) {
struct bpfilter_ipt_standard *t;
@@ -55,7 +71,6 @@ struct bpfilter_table_info *bpfilter_ipv4_table_ctor(struct bpfilter_table *tbl)
BPFILTER_IPT_STANDARD_INIT(BPFILTER_NF_ACCEPT);
t->target.target.u.kernel.target =
bpfilter_target_get_by_name(t->target.target.u.user.name);
- dprintf(debug_fd, "user.name %s\n", t->target.target.u.user.name);
if (!t->target.target.u.kernel.target)
goto out_fail;
@@ -67,14 +82,10 @@ struct bpfilter_table_info *bpfilter_ipv4_table_ctor(struct bpfilter_table *tbl)
*term = (struct bpfilter_ipt_error) BPFILTER_IPT_ERROR_INIT;
term->target.target.u.kernel.target =
bpfilter_target_get_by_name(term->target.target.u.user.name);
- dprintf(debug_fd, "user.name %s\n", term->target.target.u.user.name);
- if (!term->target.target.u.kernel.target)
- goto out_fail;
-
- dprintf(debug_fd, "info %p\n", info);
- return info;
-
+ if (!term->target.target.u.kernel.target) {
out_fail:
- free(info);
- return NULL;
+ free(info);
+ return NULL;
+ }
+ return info;
}
diff --git a/net/bpfilter/gen.c b/net/bpfilter/gen.c
new file mode 100644
index 000000000000..8e08561b78f1
--- /dev/null
+++ b/net/bpfilter/gen.c
@@ -0,0 +1,290 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <errno.h>
+#include <string.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+
+#include <linux/if_ether.h>
+#include <linux/if_link.h>
+#include <linux/rtnetlink.h>
+#include <linux/bpf.h>
+typedef __u16 __bitwise __sum16; /* hack */
+#include <linux/ip.h>
+
+#include <arpa/inet.h>
+
+#include "bpfilter_mod.h"
+
+unsigned int if_nametoindex(const char *ifname);
+
+static inline __u64 bpf_ptr_to_u64(const void *ptr)
+{
+ return (__u64)(unsigned long)ptr;
+}
+
+static int bpf_prog_load(enum bpf_prog_type type,
+ const struct bpf_insn *insns,
+ unsigned int insn_num,
+ __u32 offload_ifindex)
+{
+ union bpf_attr attr = {};
+
+ attr.prog_type = type;
+ attr.insns = bpf_ptr_to_u64(insns);
+ attr.insn_cnt = insn_num;
+ attr.license = bpf_ptr_to_u64("GPL");
+ attr.prog_ifindex = offload_ifindex;
+
+ return sys_bpf(BPF_PROG_LOAD, &attr, sizeof(attr));
+}
+
+static int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags)
+{
+ struct sockaddr_nl sa;
+ int sock, seq = 0, len, ret = -1;
+ char buf[4096];
+ struct nlattr *nla, *nla_xdp;
+ struct {
+ struct nlmsghdr nh;
+ struct ifinfomsg ifinfo;
+ char attrbuf[64];
+ } req;
+ struct nlmsghdr *nh;
+ struct nlmsgerr *err;
+
+ memset(&sa, 0, sizeof(sa));
+ sa.nl_family = AF_NETLINK;
+
+ sock = socket(AF_NETLINK, SOCK_RAW, NETLINK_ROUTE);
+ if (sock < 0) {
+ printf("open netlink socket: %s\n", strerror(errno));
+ return -1;
+ }
+
+ if (bind(sock, (struct sockaddr *)&sa, sizeof(sa)) < 0) {
+ printf("bind to netlink: %s\n", strerror(errno));
+ goto cleanup;
+ }
+
+ memset(&req, 0, sizeof(req));
+ req.nh.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg));
+ req.nh.nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK;
+ req.nh.nlmsg_type = RTM_SETLINK;
+ req.nh.nlmsg_pid = 0;
+ req.nh.nlmsg_seq = ++seq;
+ req.ifinfo.ifi_family = AF_UNSPEC;
+ req.ifinfo.ifi_index = ifindex;
+
+ /* started nested attribute for XDP */
+ nla = (struct nlattr *)(((char *)&req)
+ + NLMSG_ALIGN(req.nh.nlmsg_len));
+ nla->nla_type = NLA_F_NESTED | 43/*IFLA_XDP*/;
+ nla->nla_len = NLA_HDRLEN;
+
+ /* add XDP fd */
+ nla_xdp = (struct nlattr *)((char *)nla + nla->nla_len);
+ nla_xdp->nla_type = 1/*IFLA_XDP_FD*/;
+ nla_xdp->nla_len = NLA_HDRLEN + sizeof(int);
+ memcpy((char *)nla_xdp + NLA_HDRLEN, &fd, sizeof(fd));
+ nla->nla_len += nla_xdp->nla_len;
+
+ /* if user passed in any flags, add those too */
+ if (flags) {
+ nla_xdp = (struct nlattr *)((char *)nla + nla->nla_len);
+ nla_xdp->nla_type = 3/*IFLA_XDP_FLAGS*/;
+ nla_xdp->nla_len = NLA_HDRLEN + sizeof(flags);
+ memcpy((char *)nla_xdp + NLA_HDRLEN, &flags, sizeof(flags));
+ nla->nla_len += nla_xdp->nla_len;
+ }
+
+ req.nh.nlmsg_len += NLA_ALIGN(nla->nla_len);
+
+ if (send(sock, &req, req.nh.nlmsg_len, 0) < 0) {
+ printf("send to netlink: %s\n", strerror(errno));
+ goto cleanup;
+ }
+
+ len = recv(sock, buf, sizeof(buf), 0);
+ if (len < 0) {
+ printf("recv from netlink: %s\n", strerror(errno));
+ goto cleanup;
+ }
+
+ for (nh = (struct nlmsghdr *)buf; NLMSG_OK(nh, len);
+ nh = NLMSG_NEXT(nh, len)) {
+ if (nh->nlmsg_pid != getpid()) {
+ printf("Wrong pid %d, expected %d\n",
+ nh->nlmsg_pid, getpid());
+ goto cleanup;
+ }
+ if (nh->nlmsg_seq != seq) {
+ printf("Wrong seq %d, expected %d\n",
+ nh->nlmsg_seq, seq);
+ goto cleanup;
+ }
+ switch (nh->nlmsg_type) {
+ case NLMSG_ERROR:
+ err = (struct nlmsgerr *)NLMSG_DATA(nh);
+ if (!err->error)
+ continue;
+ printf("nlmsg error %s\n", strerror(-err->error));
+ goto cleanup;
+ case NLMSG_DONE:
+ break;
+ }
+ }
+
+ ret = 0;
+
+cleanup:
+ close(sock);
+ return ret;
+}
+
+static int bpfilter_load_dev(struct bpfilter_gen_ctx *ctx)
+{
+ u32 xdp_flags = 0;
+
+ if (ctx->offloaded)
+ xdp_flags |= XDP_FLAGS_HW_MODE;
+ return bpf_set_link_xdp_fd(ctx->ifindex, ctx->fd, xdp_flags);
+}
+
+int bpfilter_gen_init(struct bpfilter_gen_ctx *ctx)
+{
+ unsigned int len_max = BPF_MAXINSNS;
+
+ memset(ctx, 0, sizeof(*ctx));
+ ctx->img = calloc(len_max, sizeof(struct bpf_insn));
+ if (!ctx->img)
+ return -ENOMEM;
+ ctx->len_max = len_max;
+ ctx->fd = -1;
+ ctx->default_verdict = XDP_PASS;
+
+ return 0;
+}
+
+#define EMIT(x) \
+ do { \
+ if (ctx->len_cur + 1 > ctx->len_max) \
+ return -ENOMEM; \
+ ctx->img[ctx->len_cur++] = x; \
+ } while (0)
+
+int bpfilter_gen_prologue(struct bpfilter_gen_ctx *ctx)
+{
+ EMIT(BPF_MOV64_REG(BPF_REG_9, BPF_REG_1));
+ EMIT(BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_9,
+ offsetof(struct xdp_md, data)));
+ EMIT(BPF_LDX_MEM(BPF_W, BPF_REG_3, BPF_REG_9,
+ offsetof(struct xdp_md, data_end)));
+ EMIT(BPF_MOV64_REG(BPF_REG_1, BPF_REG_2));
+ EMIT(BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, ETH_HLEN));
+ EMIT(BPF_JMP_REG(BPF_JLE, BPF_REG_1, BPF_REG_3, 2));
+ EMIT(BPF_MOV32_IMM(BPF_REG_0, ctx->default_verdict));
+ EMIT(BPF_EXIT_INSN());
+ return 0;
+}
+
+int bpfilter_gen_epilogue(struct bpfilter_gen_ctx *ctx)
+{
+ EMIT(BPF_MOV32_IMM(BPF_REG_0, ctx->default_verdict));
+ EMIT(BPF_EXIT_INSN());
+ return 0;
+}
+
+static int bpfilter_gen_check_entry(const struct bpfilter_ipt_ip *ent)
+{
+#define M_FF "\xff\xff\xff\xff"
+ static const __u8 mask1[IFNAMSIZ] = M_FF M_FF M_FF M_FF;
+ static const __u8 mask0[IFNAMSIZ] = { };
+ int ones = strlen(ent->in_iface); ones += ones > 0;
+#undef M_FF
+ if (strlen(ent->out_iface) > 0)
+ return -ENOTSUPP;
+ if (memcmp(ent->in_iface_mask, mask1, ones) ||
+ memcmp(&ent->in_iface_mask[ones], mask0, sizeof(mask0) - ones))
+ return -ENOTSUPP;
+ if ((ent->src_mask != 0 && ent->src_mask != 0xffffffff) ||
+ (ent->dst_mask != 0 && ent->dst_mask != 0xffffffff))
+ return -ENOTSUPP;
+
+ return 0;
+}
+
+int bpfilter_gen_append(struct bpfilter_gen_ctx *ctx,
+ struct bpfilter_ipt_ip *ent, int verdict)
+{
+ u32 match_xdp = verdict == -1 ? XDP_DROP : XDP_PASS;
+ int ret, ifindex, match_state = 0;
+
+ /* convention R1: tmp, R2: data, R3: data_end, R9: xdp_buff */
+ ret = bpfilter_gen_check_entry(ent);
+ if (ret < 0)
+ return ret;
+ if (ent->src_mask == 0 && ent->dst_mask == 0)
+ return 0;
+
+ ifindex = if_nametoindex(ent->in_iface);
+ if (!ifindex)
+ return 0;
+ if (ctx->ifindex && ctx->ifindex != ifindex)
+ return -ENOTSUPP;
+
+ ctx->ifindex = ifindex;
+ match_state = !!ent->src_mask + !!ent->dst_mask;
+
+ EMIT(BPF_MOV64_REG(BPF_REG_1, BPF_REG_2));
+ EMIT(BPF_MOV32_IMM(BPF_REG_5, 0));
+ EMIT(BPF_LDX_MEM(BPF_H, BPF_REG_4, BPF_REG_1,
+ offsetof(struct ethhdr, h_proto)));
+ EMIT(BPF_JMP_IMM(BPF_JNE, BPF_REG_4, htons(ETH_P_IP),
+ 3 + match_state * 3));
+ EMIT(BPF_ALU64_IMM(BPF_ADD, BPF_REG_1,
+ sizeof(struct ethhdr) + sizeof(struct iphdr)));
+ EMIT(BPF_JMP_REG(BPF_JGT, BPF_REG_1, BPF_REG_3, 1 + match_state * 3));
+ EMIT(BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, -(int)sizeof(struct iphdr)));
+ if (ent->src_mask) {
+ EMIT(BPF_LDX_MEM(BPF_W, BPF_REG_4, BPF_REG_1,
+ offsetof(struct iphdr, saddr)));
+ EMIT(BPF_JMP_IMM(BPF_JNE, BPF_REG_4, ent->src, 1));
+ EMIT(BPF_ALU32_IMM(BPF_ADD, BPF_REG_5, 1));
+ }
+ if (ent->dst_mask) {
+ EMIT(BPF_LDX_MEM(BPF_W, BPF_REG_4, BPF_REG_1,
+ offsetof(struct iphdr, daddr)));
+ EMIT(BPF_JMP_IMM(BPF_JNE, BPF_REG_4, ent->dst, 1));
+ EMIT(BPF_ALU32_IMM(BPF_ADD, BPF_REG_5, 1));
+ }
+ EMIT(BPF_JMP_IMM(BPF_JNE, BPF_REG_5, match_state, 2));
+ EMIT(BPF_MOV32_IMM(BPF_REG_0, match_xdp));
+ EMIT(BPF_EXIT_INSN());
+ return 0;
+}
+
+int bpfilter_gen_commit(struct bpfilter_gen_ctx *ctx)
+{
+ int ret;
+
+ ret = bpf_prog_load(BPF_PROG_TYPE_XDP, ctx->img,
+ ctx->len_cur, ctx->ifindex);
+ if (ret > 0)
+ ctx->offloaded = true;
+ if (ret < 0)
+ ret = bpf_prog_load(BPF_PROG_TYPE_XDP, ctx->img,
+ ctx->len_cur, 0);
+ if (ret > 0) {
+ ctx->fd = ret;
+ ret = bpfilter_load_dev(ctx);
+ }
+
+ return ret < 0 ? ret : 0;
+}
+
+void bpfilter_gen_destroy(struct bpfilter_gen_ctx *ctx)
+{
+ free(ctx->img);
+ close(ctx->fd);
+}
diff --git a/net/bpfilter/init.c b/net/bpfilter/init.c
index 699f3f623189..14e621a03217 100644
--- a/net/bpfilter/init.c
+++ b/net/bpfilter/init.c
@@ -1,6 +1,8 @@
// SPDX-License-Identifier: GPL-2.0
-#include <sys/socket.h>
#include <errno.h>
+
+#include <sys/socket.h>
+
#include "bpfilter_mod.h"
static struct bpfilter_table filter_table_ipv4 = {
@@ -22,12 +24,13 @@ int bpfilter_ipv4_init(void)
if (err)
return err;
- info = bpfilter_ipv4_table_ctor(t);
+ info = bpfilter_ipv4_table_alloc(t, 0);
+ if (!info)
+ return -ENOMEM;
+ info = bpfilter_ipv4_table_finalize(t, info, 0, 0);
if (!info)
return -ENOMEM;
-
t->info = info;
-
return bpfilter_table_add(&filter_table_ipv4);
}
diff --git a/net/bpfilter/main.c b/net/bpfilter/main.c
index e0273ca201ad..ebd8a4fb1e95 100644
--- a/net/bpfilter/main.c
+++ b/net/bpfilter/main.c
@@ -1,20 +1,23 @@
// SPDX-License-Identifier: GPL-2.0
#define _GNU_SOURCE
-#include <sys/uio.h>
#include <errno.h>
#include <stdio.h>
-#include <sys/socket.h>
#include <fcntl.h>
#include <unistd.h>
-#include "include/uapi/linux/bpf.h"
+
+#include <sys/uio.h>
+#include <sys/socket.h>
+
#include <asm/unistd.h>
+
+#include "include/uapi/linux/bpf.h"
+
#include "bpfilter_mod.h"
#include "msgfmt.h"
extern long int syscall (long int __sysno, ...);
-static inline int sys_bpf(enum bpf_cmd cmd, union bpf_attr *attr,
- unsigned int size)
+int sys_bpf(int cmd, union bpf_attr *attr, unsigned int size)
{
return syscall(321, cmd, attr, size);
}
@@ -39,7 +42,7 @@ int copy_to_user(void *addr, const void *src, int len)
struct iovec local;
struct iovec remote;
- local.iov_base = (void *) src;
+ local.iov_base = (void *)src;
local.iov_len = len;
remote.iov_base = addr;
remote.iov_len = len;
diff --git a/net/bpfilter/sockopt.c b/net/bpfilter/sockopt.c
index 43687daf51a3..26ad12a11736 100644
--- a/net/bpfilter/sockopt.c
+++ b/net/bpfilter/sockopt.c
@@ -1,10 +1,14 @@
// SPDX-License-Identifier: GPL-2.0
-#include <sys/socket.h>
#include <errno.h>
#include <string.h>
#include <stdio.h>
+#include <stdlib.h>
+
+#include <sys/socket.h>
+
#include "bpfilter_mod.h"
+/* TODO: Get all of this in here properly done in encoding/decoding layer. */
static int fetch_name(void *addr, int len, char *name, int name_len)
{
if (copy_from_user(name, addr, name_len))
@@ -55,12 +59,17 @@ int bpfilter_get_info(void *addr, int len)
return err;
}
-static int copy_target(struct bpfilter_standard_target *ut,
- struct bpfilter_standard_target *kt)
+static int target_u2k(struct bpfilter_standard_target *kt)
{
- struct bpfilter_target *tgt;
- int sz;
+ kt->target.u.kernel.target =
+ bpfilter_target_get_by_name(kt->target.u.user.name);
+ return kt->target.u.kernel.target ? 0 : -EINVAL;
+}
+static int target_k2u(struct bpfilter_standard_target *ut,
+ struct bpfilter_standard_target *kt)
+{
+ struct bpfilter_target *tgt;
if (put_user(kt->target.u.target_size,
&ut->target.u.target_size))
@@ -69,12 +78,9 @@ static int copy_target(struct bpfilter_standard_target *ut,
tgt = kt->target.u.kernel.target;
if (copy_to_user(ut->target.u.user.name, tgt->name, strlen(tgt->name)))
return -EFAULT;
-
if (put_user(tgt->rev, &ut->target.u.user.revision))
return -EFAULT;
-
- sz = tgt->size;
- if (copy_to_user(ut->target.data, kt->target.data, sz))
+ if (copy_to_user(ut->target.data, kt->target.data, tgt->size))
return -EFAULT;
return 0;
@@ -84,30 +90,25 @@ static int do_get_entries(void *up,
struct bpfilter_table *tbl,
struct bpfilter_table_info *info)
{
- unsigned int total_size = info->size;
const struct bpfilter_ipt_entry *ent;
+ unsigned int total_size = info->size;
+ void *base = info->entries;
unsigned int off;
- void *base;
-
- base = info->entries;
for (off = 0; off < total_size; off += ent->next_offset) {
- struct bpfilter_xt_counters *cntrs;
struct bpfilter_standard_target *tgt;
+ struct bpfilter_xt_counters *cntrs;
ent = base + off;
if (copy_to_user(up + off, ent, sizeof(*ent)))
return -EFAULT;
-
- /* XXX Just clear counters for now. XXX */
+ /* XXX: Just clear counters for now. */
cntrs = up + off + offsetof(struct bpfilter_ipt_entry, cntrs);
if (put_user(0, &cntrs->packet_cnt) ||
put_user(0, &cntrs->byte_cnt))
return -EINVAL;
-
- tgt = (void *) ent + ent->target_offset;
- dprintf(debug_fd, "target.verdict %d\n", tgt->verdict);
- if (copy_target(up + off + ent->target_offset, tgt))
+ tgt = (void *)ent + ent->target_offset;
+ if (target_k2u(up + off + ent->target_offset, tgt))
return -EFAULT;
}
return 0;
@@ -123,31 +124,113 @@ int bpfilter_get_entries(void *cmd, int len)
if (len < sizeof(struct bpfilter_ipt_get_entries))
return -EINVAL;
-
if (copy_from_user(&req, cmd, sizeof(req)))
return -EFAULT;
-
tbl = bpfilter_table_get_by_name(req.name, strlen(req.name));
if (!tbl)
return -ENOENT;
-
info = tbl->info;
if (!info) {
err = -ENOENT;
goto out_put;
}
-
if (info->size != req.size) {
err = -EINVAL;
goto out_put;
}
-
err = do_get_entries(uptr->entries, tbl, info);
- dprintf(debug_fd, "do_get_entries %d req.size %d\n", err, req.size);
-
out_put:
bpfilter_table_put(tbl);
+ return err;
+}
+static int do_set_replace(struct bpfilter_ipt_replace *req, void *base,
+ struct bpfilter_table *tbl)
+{
+ unsigned int total_size = req->size;
+ struct bpfilter_table_info *info;
+ struct bpfilter_ipt_entry *ent;
+ struct bpfilter_gen_ctx ctx;
+ unsigned int off, sents = 0, ents = 0;
+ int ret;
+
+ ret = bpfilter_gen_init(&ctx);
+ if (ret < 0)
+ return ret;
+ ret = bpfilter_gen_prologue(&ctx);
+ if (ret < 0)
+ return ret;
+ info = bpfilter_ipv4_table_alloc(tbl, total_size);
+ if (!info)
+ return -ENOMEM;
+ if (copy_from_user(&info->entries[0], base, req->size)) {
+ free(info);
+ return -EFAULT;
+ }
+ base = &info->entries[0];
+ for (off = 0; off < total_size; off += ent->next_offset) {
+ struct bpfilter_standard_target *tgt;
+ ent = base + off;
+ ents++;
+ sents += ent->next_offset;
+ tgt = (void *) ent + ent->target_offset;
+ target_u2k(tgt);
+ ret = bpfilter_gen_append(&ctx, &ent->ip, tgt->verdict);
+ if (ret < 0)
+ goto err;
+ }
+ info->num_entries = ents;
+ info->size = sents;
+ memcpy(info->hook_entry, req->hook_entry, sizeof(info->hook_entry));
+ memcpy(info->underflow, req->underflow, sizeof(info->hook_entry));
+ ret = bpfilter_gen_epilogue(&ctx);
+ if (ret < 0)
+ goto err;
+ ret = bpfilter_gen_commit(&ctx);
+ if (ret < 0)
+ goto err;
+ free(tbl->info);
+ tbl->info = info;
+ bpfilter_gen_destroy(&ctx);
+ dprintf(debug_fd, "offloaded %u\n", ctx.offloaded);
+ return ret;
+err:
+ free(info);
+ return ret;
+}
+
+int bpfilter_set_replace(void *cmd, int len)
+{
+ struct bpfilter_ipt_replace *uptr = cmd;
+ struct bpfilter_ipt_replace req;
+ struct bpfilter_table_info *info;
+ struct bpfilter_table *tbl;
+ int err;
+
+ if (len < sizeof(req))
+ return -EINVAL;
+ if (copy_from_user(&req, cmd, sizeof(req)))
+ return -EFAULT;
+ if (req.num_counters >= INT_MAX / sizeof(struct bpfilter_xt_counters))
+ return -ENOMEM;
+ if (req.num_counters == 0)
+ return -EINVAL;
+ req.name[sizeof(req.name) - 1] = 0;
+ tbl = bpfilter_table_get_by_name(req.name, strlen(req.name));
+ if (!tbl)
+ return -ENOENT;
+ info = tbl->info;
+ if (!info) {
+ err = -ENOENT;
+ goto out_put;
+ }
+ err = do_set_replace(&req, uptr->entries, tbl);
+out_put:
+ bpfilter_table_put(tbl);
return err;
}
+int bpfilter_set_add_counters(void *cmd, int len)
+{
+ return 0;
+}
diff --git a/net/bpfilter/tables.c b/net/bpfilter/tables.c
index 9a96599be634..e0dab283092d 100644
--- a/net/bpfilter/tables.c
+++ b/net/bpfilter/tables.c
@@ -1,8 +1,11 @@
// SPDX-License-Identifier: GPL-2.0
-#include <sys/socket.h>
#include <errno.h>
#include <string.h>
+
+#include <sys/socket.h>
+
#include <linux/hashtable.h>
+
#include "bpfilter_mod.h"
static unsigned int full_name_hash(const void *salt, const char *name, unsigned int len)
diff --git a/net/bpfilter/tgts.c b/net/bpfilter/tgts.c
index eac5e8ac0b4b..0a00bc289d3d 100644
--- a/net/bpfilter/tgts.c
+++ b/net/bpfilter/tgts.c
@@ -1,5 +1,6 @@
// SPDX-License-Identifier: GPL-2.0
#include <sys/socket.h>
+
#include "bpfilter_mod.h"
struct bpfilter_target std_tgt = {
--
2.9.5
^ permalink raw reply related
* [PATCH RFC v2 net-next 3/4] bpfilter: add iptable get/set parsing
From: Alexei Starovoitov @ 2018-05-03 4:36 UTC (permalink / raw)
To: davem; +Cc: daniel, torvalds, gregkh, luto, netdev, linux-kernel, kernel-team
In-Reply-To: <20180503043604.1604587-1-ast@kernel.org>
From: "David S. Miller" <davem@davemloft.net>
parse iptable binary blobs into bpfilter internal data structures
bpfilter.ko only passing the [gs]etsockopt commands from kernel to umh
All parsing is done inside umh
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
include/uapi/linux/bpfilter.h | 179 ++++++++++++++++++++++++++++++++++++++++++
net/bpfilter/Makefile | 2 +-
net/bpfilter/bpfilter_mod.h | 96 ++++++++++++++++++++++
net/bpfilter/ctor.c | 80 +++++++++++++++++++
net/bpfilter/init.c | 33 ++++++++
net/bpfilter/main.c | 51 ++++++++++++
net/bpfilter/sockopt.c | 153 ++++++++++++++++++++++++++++++++++++
net/bpfilter/tables.c | 70 +++++++++++++++++
net/bpfilter/targets.c | 51 ++++++++++++
net/bpfilter/tgts.c | 25 ++++++
10 files changed, 739 insertions(+), 1 deletion(-)
create mode 100644 net/bpfilter/bpfilter_mod.h
create mode 100644 net/bpfilter/ctor.c
create mode 100644 net/bpfilter/init.c
create mode 100644 net/bpfilter/sockopt.c
create mode 100644 net/bpfilter/tables.c
create mode 100644 net/bpfilter/targets.c
create mode 100644 net/bpfilter/tgts.c
diff --git a/include/uapi/linux/bpfilter.h b/include/uapi/linux/bpfilter.h
index 2ec3cc99ea4c..38d54e9947a1 100644
--- a/include/uapi/linux/bpfilter.h
+++ b/include/uapi/linux/bpfilter.h
@@ -18,4 +18,183 @@ enum {
BPFILTER_IPT_GET_MAX,
};
+enum {
+ BPFILTER_XT_TABLE_MAXNAMELEN = 32,
+};
+
+enum {
+ BPFILTER_NF_DROP = 0,
+ BPFILTER_NF_ACCEPT = 1,
+ BPFILTER_NF_STOLEN = 2,
+ BPFILTER_NF_QUEUE = 3,
+ BPFILTER_NF_REPEAT = 4,
+ BPFILTER_NF_STOP = 5,
+ BPFILTER_NF_MAX_VERDICT = BPFILTER_NF_STOP,
+};
+
+enum {
+ BPFILTER_INET_HOOK_PRE_ROUTING = 0,
+ BPFILTER_INET_HOOK_LOCAL_IN = 1,
+ BPFILTER_INET_HOOK_FORWARD = 2,
+ BPFILTER_INET_HOOK_LOCAL_OUT = 3,
+ BPFILTER_INET_HOOK_POST_ROUTING = 4,
+ BPFILTER_INET_HOOK_MAX,
+};
+
+enum {
+ BPFILTER_PROTO_UNSPEC = 0,
+ BPFILTER_PROTO_INET = 1,
+ BPFILTER_PROTO_IPV4 = 2,
+ BPFILTER_PROTO_ARP = 3,
+ BPFILTER_PROTO_NETDEV = 5,
+ BPFILTER_PROTO_BRIDGE = 7,
+ BPFILTER_PROTO_IPV6 = 10,
+ BPFILTER_PROTO_DECNET = 12,
+ BPFILTER_PROTO_NUMPROTO,
+};
+
+#ifndef INT_MAX
+#define INT_MAX ((int)(~0U>>1))
+#endif
+#ifndef INT_MIN
+#define INT_MIN (-INT_MAX - 1)
+#endif
+
+enum {
+ BPFILTER_IP_PRI_FIRST = INT_MIN,
+ BPFILTER_IP_PRI_CONNTRACK_DEFRAG = -400,
+ BPFILTER_IP_PRI_RAW = -300,
+ BPFILTER_IP_PRI_SELINUX_FIRST = -225,
+ BPFILTER_IP_PRI_CONNTRACK = -200,
+ BPFILTER_IP_PRI_MANGLE = -150,
+ BPFILTER_IP_PRI_NAT_DST = -100,
+ BPFILTER_IP_PRI_FILTER = 0,
+ BPFILTER_IP_PRI_SECURITY = 50,
+ BPFILTER_IP_PRI_NAT_SRC = 100,
+ BPFILTER_IP_PRI_SELINUX_LAST = 225,
+ BPFILTER_IP_PRI_CONNTRACK_HELPER = 300,
+ BPFILTER_IP_PRI_CONNTRACK_CONFIRM = INT_MAX,
+ BPFILTER_IP_PRI_LAST = INT_MAX,
+};
+
+#define BPFILTER_FUNCTION_MAXNAMELEN 30
+#define BPFILTER_EXTENSION_MAXNAMELEN 29
+#define BPFILTER_TABLE_MAXNAMELEN 32
+
+struct bpfilter_match;
+struct bpfilter_entry_match {
+ union {
+ struct {
+ __u16 match_size;
+ char name[BPFILTER_EXTENSION_MAXNAMELEN];
+ __u8 revision;
+ } user;
+ struct {
+ __u16 match_size;
+ struct bpfilter_match *match;
+ } kernel;
+ __u16 match_size;
+ } u;
+ unsigned char data[0];
+};
+
+struct bpfilter_target;
+struct bpfilter_entry_target {
+ union {
+ struct {
+ __u16 target_size;
+ char name[BPFILTER_EXTENSION_MAXNAMELEN];
+ __u8 revision;
+ } user;
+ struct {
+ __u16 target_size;
+ struct bpfilter_target *target;
+ } kernel;
+ __u16 target_size;
+ } u;
+ unsigned char data[0];
+};
+
+struct bpfilter_standard_target {
+ struct bpfilter_entry_target target;
+ int verdict;
+};
+
+struct bpfilter_error_target {
+ struct bpfilter_entry_target target;
+ char error_name[BPFILTER_FUNCTION_MAXNAMELEN];
+};
+
+#define __ALIGN_KERNEL(x, a) __ALIGN_KERNEL_MASK(x, (typeof(x))(a) - 1)
+#define __ALIGN_KERNEL_MASK(x, mask) (((x) + (mask)) & ~(mask))
+
+#define BPFILTER_ALIGN(__X) \
+ __ALIGN_KERNEL(__X, __alignof__(__u64))
+
+#define BPFILTER_TARGET_INIT(__name, __size) \
+{ \
+ .target.u.user = { \
+ .target_size = BPFILTER_ALIGN(__size), \
+ .name = (__name), \
+ }, \
+}
+#define BPFILTER_STANDARD_TARGET ""
+#define BPFILTER_ERROR_TARGET "ERROR"
+
+struct bpfilter_xt_counters {
+ __u64 packet_cnt;
+ __u64 byte_cnt;
+};
+
+struct bpfilter_ipt_ip {
+ __u32 src;
+ __u32 dst;
+ __u32 src_mask;
+ __u32 dst_mask;
+ char in_iface[IFNAMSIZ];
+ char out_iface[IFNAMSIZ];
+ __u8 in_iface_mask[IFNAMSIZ];
+ __u8 out_iface_mask[IFNAMSIZ];
+ __u16 protocol;
+ __u8 flags;
+ __u8 inv_flags;
+};
+
+struct bpfilter_ipt_entry {
+ struct bpfilter_ipt_ip ip;
+ __u32 bfcache;
+ __u16 target_offset;
+ __u16 next_offset;
+ __u32 camefrom;
+ struct bpfilter_xt_counters cntrs;
+ __u8 elems[0];
+};
+
+struct bpfilter_ipt_get_info {
+ char name[BPFILTER_XT_TABLE_MAXNAMELEN];
+ __u32 valid_hooks;
+ __u32 hook_entry[BPFILTER_INET_HOOK_MAX];
+ __u32 underflow[BPFILTER_INET_HOOK_MAX];
+ __u32 num_entries;
+ __u32 size;
+};
+
+struct bpfilter_ipt_get_entries {
+ char name[BPFILTER_XT_TABLE_MAXNAMELEN];
+ __u32 size;
+ struct bpfilter_ipt_entry entries[0];
+};
+
+struct bpfilter_ipt_replace {
+ char name[BPFILTER_XT_TABLE_MAXNAMELEN];
+ __u32 valid_hooks;
+ __u32 num_entries;
+ __u32 size;
+ __u32 hook_entry[BPFILTER_INET_HOOK_MAX];
+ __u32 underflow[BPFILTER_INET_HOOK_MAX];
+ __u32 num_counters;
+ struct bpfilter_xt_counters *cntrs;
+ struct bpfilter_ipt_entry entries[0];
+};
+
#endif /* _UAPI_LINUX_BPFILTER_H */
diff --git a/net/bpfilter/Makefile b/net/bpfilter/Makefile
index 897eedae523e..bec6181de995 100644
--- a/net/bpfilter/Makefile
+++ b/net/bpfilter/Makefile
@@ -4,7 +4,7 @@
#
hostprogs-y := bpfilter_umh
-bpfilter_umh-objs := main.o
+bpfilter_umh-objs := main.o tgts.o targets.o tables.o init.o ctor.o sockopt.o
HOSTCFLAGS += -I. -Itools/include/
# a bit of elf magic to convert bpfilter_umh binary into a binary blob
diff --git a/net/bpfilter/bpfilter_mod.h b/net/bpfilter/bpfilter_mod.h
new file mode 100644
index 000000000000..f0de41b20793
--- /dev/null
+++ b/net/bpfilter/bpfilter_mod.h
@@ -0,0 +1,96 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_BPFILTER_INTERNAL_H
+#define _LINUX_BPFILTER_INTERNAL_H
+
+#include "include/uapi/linux/bpfilter.h"
+#include <linux/list.h>
+
+struct bpfilter_table {
+ struct hlist_node hash;
+ u32 valid_hooks;
+ struct bpfilter_table_info *info;
+ int hold;
+ u8 family;
+ int priority;
+ const char name[BPFILTER_XT_TABLE_MAXNAMELEN];
+};
+
+struct bpfilter_table_info {
+ unsigned int size;
+ u32 num_entries;
+ unsigned int initial_entries;
+ unsigned int hook_entry[BPFILTER_INET_HOOK_MAX];
+ unsigned int underflow[BPFILTER_INET_HOOK_MAX];
+ unsigned int stacksize;
+ void ***jumpstack;
+ unsigned char entries[0] __aligned(8);
+};
+
+struct bpfilter_table *bpfilter_table_get_by_name(const char *name, int name_len);
+void bpfilter_table_put(struct bpfilter_table *tbl);
+int bpfilter_table_add(struct bpfilter_table *tbl);
+
+struct bpfilter_ipt_standard {
+ struct bpfilter_ipt_entry entry;
+ struct bpfilter_standard_target target;
+};
+
+struct bpfilter_ipt_error {
+ struct bpfilter_ipt_entry entry;
+ struct bpfilter_error_target target;
+};
+
+#define BPFILTER_IPT_ENTRY_INIT(__sz) \
+{ \
+ .target_offset = sizeof(struct bpfilter_ipt_entry), \
+ .next_offset = (__sz), \
+}
+
+#define BPFILTER_IPT_STANDARD_INIT(__verdict) \
+{ \
+ .entry = BPFILTER_IPT_ENTRY_INIT(sizeof(struct bpfilter_ipt_standard)), \
+ .target = BPFILTER_TARGET_INIT(BPFILTER_STANDARD_TARGET, \
+ sizeof(struct bpfilter_standard_target)),\
+ .target.verdict = -(__verdict) - 1, \
+}
+
+#define BPFILTER_IPT_ERROR_INIT \
+{ \
+ .entry = BPFILTER_IPT_ENTRY_INIT(sizeof(struct bpfilter_ipt_error)), \
+ .target = BPFILTER_TARGET_INIT(BPFILTER_ERROR_TARGET, \
+ sizeof(struct bpfilter_error_target)), \
+ .target.error_name = "ERROR", \
+}
+
+struct bpfilter_target {
+ struct list_head all_target_list;
+ const char name[BPFILTER_EXTENSION_MAXNAMELEN];
+ unsigned int size;
+ int hold;
+ u16 family;
+ u8 rev;
+};
+
+struct bpfilter_target *bpfilter_target_get_by_name(const char *name);
+void bpfilter_target_put(struct bpfilter_target *tgt);
+int bpfilter_target_add(struct bpfilter_target *tgt);
+
+struct bpfilter_table_info *bpfilter_ipv4_table_ctor(struct bpfilter_table *tbl);
+int bpfilter_ipv4_register_targets(void);
+void bpfilter_tables_init(void);
+int bpfilter_get_info(void *addr, int len);
+int bpfilter_get_entries(void *cmd, int len);
+int bpfilter_ipv4_init(void);
+
+int copy_from_user(void *dst, void *addr, int len);
+int copy_to_user(void *addr, const void *src, int len);
+#define put_user(x, ptr) \
+({ \
+ __typeof__(*(ptr)) __x = (x); \
+ copy_to_user(ptr, &__x, sizeof(*(ptr))); \
+})
+extern int pid;
+extern int debug_fd;
+#define ENOTSUPP 524
+
+#endif
diff --git a/net/bpfilter/ctor.c b/net/bpfilter/ctor.c
new file mode 100644
index 000000000000..efb7feef3c42
--- /dev/null
+++ b/net/bpfilter/ctor.c
@@ -0,0 +1,80 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <sys/socket.h>
+#include <linux/bitops.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include "bpfilter_mod.h"
+
+unsigned int __sw_hweight32(unsigned int w)
+{
+ w -= (w >> 1) & 0x55555555;
+ w = (w & 0x33333333) + ((w >> 2) & 0x33333333);
+ w = (w + (w >> 4)) & 0x0f0f0f0f;
+ return (w * 0x01010101) >> 24;
+}
+
+struct bpfilter_table_info *bpfilter_ipv4_table_ctor(struct bpfilter_table *tbl)
+{
+ unsigned int num_hooks = hweight32(tbl->valid_hooks);
+ struct bpfilter_ipt_standard *tgts;
+ struct bpfilter_table_info *info;
+ struct bpfilter_ipt_error *term;
+ unsigned int mask, offset, h, i;
+ unsigned int size, alloc_size;
+
+ size = sizeof(struct bpfilter_ipt_standard) * num_hooks;
+ size += sizeof(struct bpfilter_ipt_error);
+
+ alloc_size = size + sizeof(struct bpfilter_table_info);
+
+ info = malloc(alloc_size);
+ if (!info)
+ return NULL;
+
+ info->num_entries = num_hooks + 1;
+ info->size = size;
+
+ tgts = (struct bpfilter_ipt_standard *) (info + 1);
+ term = (struct bpfilter_ipt_error *) (tgts + num_hooks);
+
+ mask = tbl->valid_hooks;
+ offset = 0;
+ h = 0;
+ i = 0;
+ dprintf(debug_fd, "mask %x num_hooks %d\n", mask, num_hooks);
+ while (mask) {
+ struct bpfilter_ipt_standard *t;
+
+ if (!(mask & 1))
+ goto next;
+
+ info->hook_entry[h] = offset;
+ info->underflow[h] = offset;
+ t = &tgts[i++];
+ *t = (struct bpfilter_ipt_standard)
+ BPFILTER_IPT_STANDARD_INIT(BPFILTER_NF_ACCEPT);
+ t->target.target.u.kernel.target =
+ bpfilter_target_get_by_name(t->target.target.u.user.name);
+ dprintf(debug_fd, "user.name %s\n", t->target.target.u.user.name);
+ if (!t->target.target.u.kernel.target)
+ goto out_fail;
+
+ offset += sizeof(struct bpfilter_ipt_standard);
+ next:
+ mask >>= 1;
+ h++;
+ }
+ *term = (struct bpfilter_ipt_error) BPFILTER_IPT_ERROR_INIT;
+ term->target.target.u.kernel.target =
+ bpfilter_target_get_by_name(term->target.target.u.user.name);
+ dprintf(debug_fd, "user.name %s\n", term->target.target.u.user.name);
+ if (!term->target.target.u.kernel.target)
+ goto out_fail;
+
+ dprintf(debug_fd, "info %p\n", info);
+ return info;
+
+out_fail:
+ free(info);
+ return NULL;
+}
diff --git a/net/bpfilter/init.c b/net/bpfilter/init.c
new file mode 100644
index 000000000000..699f3f623189
--- /dev/null
+++ b/net/bpfilter/init.c
@@ -0,0 +1,33 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <sys/socket.h>
+#include <errno.h>
+#include "bpfilter_mod.h"
+
+static struct bpfilter_table filter_table_ipv4 = {
+ .name = "filter",
+ .valid_hooks = ((1<<BPFILTER_INET_HOOK_LOCAL_IN) |
+ (1<<BPFILTER_INET_HOOK_FORWARD) |
+ (1<<BPFILTER_INET_HOOK_LOCAL_OUT)),
+ .family = BPFILTER_PROTO_IPV4,
+ .priority = BPFILTER_IP_PRI_FILTER,
+};
+
+int bpfilter_ipv4_init(void)
+{
+ struct bpfilter_table *t = &filter_table_ipv4;
+ struct bpfilter_table_info *info;
+ int err;
+
+ err = bpfilter_ipv4_register_targets();
+ if (err)
+ return err;
+
+ info = bpfilter_ipv4_table_ctor(t);
+ if (!info)
+ return -ENOMEM;
+
+ t->info = info;
+
+ return bpfilter_table_add(&filter_table_ipv4);
+}
+
diff --git a/net/bpfilter/main.c b/net/bpfilter/main.c
index 81bbc1684896..e0273ca201ad 100644
--- a/net/bpfilter/main.c
+++ b/net/bpfilter/main.c
@@ -8,13 +8,52 @@
#include <unistd.h>
#include "include/uapi/linux/bpf.h"
#include <asm/unistd.h>
+#include "bpfilter_mod.h"
#include "msgfmt.h"
+extern long int syscall (long int __sysno, ...);
+
+static inline int sys_bpf(enum bpf_cmd cmd, union bpf_attr *attr,
+ unsigned int size)
+{
+ return syscall(321, cmd, attr, size);
+}
+
+int pid;
int debug_fd;
+int copy_from_user(void *dst, void *addr, int len)
+{
+ struct iovec local;
+ struct iovec remote;
+
+ local.iov_base = dst;
+ local.iov_len = len;
+ remote.iov_base = addr;
+ remote.iov_len = len;
+ return process_vm_readv(pid, &local, 1, &remote, 1, 0) != len;
+}
+
+int copy_to_user(void *addr, const void *src, int len)
+{
+ struct iovec local;
+ struct iovec remote;
+
+ local.iov_base = (void *) src;
+ local.iov_len = len;
+ remote.iov_base = addr;
+ remote.iov_len = len;
+ return process_vm_writev(pid, &local, 1, &remote, 1, 0) != len;
+}
+
static int handle_get_cmd(struct mbox_request *cmd)
{
+ pid = cmd->pid;
switch (cmd->cmd) {
+ case BPFILTER_IPT_SO_GET_INFO:
+ return bpfilter_get_info((void *)(long)cmd->addr, cmd->len);
+ case BPFILTER_IPT_SO_GET_ENTRIES:
+ return bpfilter_get_entries((void *)(long)cmd->addr, cmd->len);
case 0:
return 0;
default:
@@ -25,11 +64,23 @@ static int handle_get_cmd(struct mbox_request *cmd)
static int handle_set_cmd(struct mbox_request *cmd)
{
+ pid = cmd->pid;
+ switch (cmd->cmd) {
+ case BPFILTER_IPT_SO_SET_REPLACE:
+ return bpfilter_set_replace((void *)(long)cmd->addr, cmd->len);
+ case BPFILTER_IPT_SO_SET_ADD_COUNTERS:
+ return bpfilter_set_add_counters((void *)(long)cmd->addr, cmd->len);
+ default:
+ break;
+ }
return -ENOPROTOOPT;
}
static void loop(void)
{
+ bpfilter_tables_init();
+ bpfilter_ipv4_init();
+
while (1) {
struct mbox_request req;
struct mbox_reply reply;
diff --git a/net/bpfilter/sockopt.c b/net/bpfilter/sockopt.c
new file mode 100644
index 000000000000..43687daf51a3
--- /dev/null
+++ b/net/bpfilter/sockopt.c
@@ -0,0 +1,153 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <sys/socket.h>
+#include <errno.h>
+#include <string.h>
+#include <stdio.h>
+#include "bpfilter_mod.h"
+
+static int fetch_name(void *addr, int len, char *name, int name_len)
+{
+ if (copy_from_user(name, addr, name_len))
+ return -EFAULT;
+
+ name[BPFILTER_XT_TABLE_MAXNAMELEN-1] = '\0';
+ return 0;
+}
+
+int bpfilter_get_info(void *addr, int len)
+{
+ char name[BPFILTER_XT_TABLE_MAXNAMELEN];
+ struct bpfilter_ipt_get_info resp;
+ struct bpfilter_table_info *info;
+ struct bpfilter_table *tbl;
+ int err;
+
+ if (len != sizeof(struct bpfilter_ipt_get_info))
+ return -EINVAL;
+
+ err = fetch_name(addr, len, name, sizeof(name));
+ if (err)
+ return err;
+
+ tbl = bpfilter_table_get_by_name(name, strlen(name));
+ if (!tbl)
+ return -ENOENT;
+
+ info = tbl->info;
+ if (!info) {
+ err = -ENOENT;
+ goto out_put;
+ }
+
+ memset(&resp, 0, sizeof(resp));
+ memcpy(resp.name, name, sizeof(resp.name));
+ resp.valid_hooks = tbl->valid_hooks;
+ memcpy(&resp.hook_entry, info->hook_entry, sizeof(resp.hook_entry));
+ memcpy(&resp.underflow, info->underflow, sizeof(resp.underflow));
+ resp.num_entries = info->num_entries;
+ resp.size = info->size;
+
+ err = 0;
+ if (copy_to_user(addr, &resp, len))
+ err = -EFAULT;
+out_put:
+ bpfilter_table_put(tbl);
+ return err;
+}
+
+static int copy_target(struct bpfilter_standard_target *ut,
+ struct bpfilter_standard_target *kt)
+{
+ struct bpfilter_target *tgt;
+ int sz;
+
+
+ if (put_user(kt->target.u.target_size,
+ &ut->target.u.target_size))
+ return -EFAULT;
+
+ tgt = kt->target.u.kernel.target;
+ if (copy_to_user(ut->target.u.user.name, tgt->name, strlen(tgt->name)))
+ return -EFAULT;
+
+ if (put_user(tgt->rev, &ut->target.u.user.revision))
+ return -EFAULT;
+
+ sz = tgt->size;
+ if (copy_to_user(ut->target.data, kt->target.data, sz))
+ return -EFAULT;
+
+ return 0;
+}
+
+static int do_get_entries(void *up,
+ struct bpfilter_table *tbl,
+ struct bpfilter_table_info *info)
+{
+ unsigned int total_size = info->size;
+ const struct bpfilter_ipt_entry *ent;
+ unsigned int off;
+ void *base;
+
+ base = info->entries;
+
+ for (off = 0; off < total_size; off += ent->next_offset) {
+ struct bpfilter_xt_counters *cntrs;
+ struct bpfilter_standard_target *tgt;
+
+ ent = base + off;
+ if (copy_to_user(up + off, ent, sizeof(*ent)))
+ return -EFAULT;
+
+ /* XXX Just clear counters for now. XXX */
+ cntrs = up + off + offsetof(struct bpfilter_ipt_entry, cntrs);
+ if (put_user(0, &cntrs->packet_cnt) ||
+ put_user(0, &cntrs->byte_cnt))
+ return -EINVAL;
+
+ tgt = (void *) ent + ent->target_offset;
+ dprintf(debug_fd, "target.verdict %d\n", tgt->verdict);
+ if (copy_target(up + off + ent->target_offset, tgt))
+ return -EFAULT;
+ }
+ return 0;
+}
+
+int bpfilter_get_entries(void *cmd, int len)
+{
+ struct bpfilter_ipt_get_entries *uptr = cmd;
+ struct bpfilter_ipt_get_entries req;
+ struct bpfilter_table_info *info;
+ struct bpfilter_table *tbl;
+ int err;
+
+ if (len < sizeof(struct bpfilter_ipt_get_entries))
+ return -EINVAL;
+
+ if (copy_from_user(&req, cmd, sizeof(req)))
+ return -EFAULT;
+
+ tbl = bpfilter_table_get_by_name(req.name, strlen(req.name));
+ if (!tbl)
+ return -ENOENT;
+
+ info = tbl->info;
+ if (!info) {
+ err = -ENOENT;
+ goto out_put;
+ }
+
+ if (info->size != req.size) {
+ err = -EINVAL;
+ goto out_put;
+ }
+
+ err = do_get_entries(uptr->entries, tbl, info);
+ dprintf(debug_fd, "do_get_entries %d req.size %d\n", err, req.size);
+
+out_put:
+ bpfilter_table_put(tbl);
+
+ return err;
+}
+
diff --git a/net/bpfilter/tables.c b/net/bpfilter/tables.c
new file mode 100644
index 000000000000..9a96599be634
--- /dev/null
+++ b/net/bpfilter/tables.c
@@ -0,0 +1,70 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <sys/socket.h>
+#include <errno.h>
+#include <string.h>
+#include <linux/hashtable.h>
+#include "bpfilter_mod.h"
+
+static unsigned int full_name_hash(const void *salt, const char *name, unsigned int len)
+{
+ unsigned int hash = 0;
+ int i;
+
+ for (i = 0; i < len; i++)
+ hash ^= *(name + i);
+ return hash;
+}
+
+DEFINE_HASHTABLE(bpfilter_tables, 4);
+//DEFINE_MUTEX(bpfilter_table_mutex);
+
+struct bpfilter_table *bpfilter_table_get_by_name(const char *name, int name_len)
+{
+ unsigned int hval = full_name_hash(NULL, name, name_len);
+ struct bpfilter_table *tbl;
+
+// mutex_lock(&bpfilter_table_mutex);
+ hash_for_each_possible(bpfilter_tables, tbl, hash, hval) {
+ if (!strcmp(name, tbl->name)) {
+ tbl->hold++;
+ goto out;
+ }
+ }
+ tbl = NULL;
+out:
+// mutex_unlock(&bpfilter_table_mutex);
+ return tbl;
+}
+
+void bpfilter_table_put(struct bpfilter_table *tbl)
+{
+// mutex_lock(&bpfilter_table_mutex);
+ tbl->hold--;
+// mutex_unlock(&bpfilter_table_mutex);
+}
+
+int bpfilter_table_add(struct bpfilter_table *tbl)
+{
+ unsigned int hval = full_name_hash(NULL, tbl->name, strlen(tbl->name));
+ struct bpfilter_table *srch;
+
+// mutex_lock(&bpfilter_table_mutex);
+ hash_for_each_possible(bpfilter_tables, srch, hash, hval) {
+ if (!strcmp(srch->name, tbl->name))
+ goto exists;
+ }
+ hash_add(bpfilter_tables, &tbl->hash, hval);
+// mutex_unlock(&bpfilter_table_mutex);
+
+ return 0;
+
+exists:
+// mutex_unlock(&bpfilter_table_mutex);
+ return -EEXIST;
+}
+
+void bpfilter_tables_init(void)
+{
+ hash_init(bpfilter_tables);
+}
+
diff --git a/net/bpfilter/targets.c b/net/bpfilter/targets.c
new file mode 100644
index 000000000000..4086ac82eaf5
--- /dev/null
+++ b/net/bpfilter/targets.c
@@ -0,0 +1,51 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <sys/socket.h>
+#include <errno.h>
+#include <string.h>
+#include "bpfilter_mod.h"
+
+//DEFINE_MUTEX(bpfilter_target_mutex);
+static LIST_HEAD(bpfilter_targets);
+
+struct bpfilter_target *bpfilter_target_get_by_name(const char *name)
+{
+ struct bpfilter_target *tgt;
+
+// mutex_lock(&bpfilter_target_mutex);
+ list_for_each_entry(tgt, &bpfilter_targets, all_target_list) {
+ if (!strcmp(tgt->name, name)) {
+ tgt->hold++;
+ goto out;
+ }
+ }
+ tgt = NULL;
+out:
+// mutex_unlock(&bpfilter_target_mutex);
+ return tgt;
+}
+
+void bpfilter_target_put(struct bpfilter_target *tgt)
+{
+// mutex_lock(&bpfilter_target_mutex);
+ tgt->hold--;
+// mutex_unlock(&bpfilter_target_mutex);
+}
+
+int bpfilter_target_add(struct bpfilter_target *tgt)
+{
+ struct bpfilter_target *srch;
+
+// mutex_lock(&bpfilter_target_mutex);
+ list_for_each_entry(srch, &bpfilter_targets, all_target_list) {
+ if (!strcmp(srch->name, tgt->name))
+ goto exists;
+ }
+ list_add_tail(&tgt->all_target_list, &bpfilter_targets);
+// mutex_unlock(&bpfilter_target_mutex);
+ return 0;
+
+exists:
+// mutex_unlock(&bpfilter_target_mutex);
+ return -EEXIST;
+}
+
diff --git a/net/bpfilter/tgts.c b/net/bpfilter/tgts.c
new file mode 100644
index 000000000000..eac5e8ac0b4b
--- /dev/null
+++ b/net/bpfilter/tgts.c
@@ -0,0 +1,25 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <sys/socket.h>
+#include "bpfilter_mod.h"
+
+struct bpfilter_target std_tgt = {
+ .name = BPFILTER_STANDARD_TARGET,
+ .family = BPFILTER_PROTO_IPV4,
+ .size = sizeof(int),
+};
+
+struct bpfilter_target err_tgt = {
+ .name = BPFILTER_ERROR_TARGET,
+ .family = BPFILTER_PROTO_IPV4,
+ .size = BPFILTER_FUNCTION_MAXNAMELEN,
+};
+
+int bpfilter_ipv4_register_targets(void)
+{
+ int err = bpfilter_target_add(&std_tgt);
+
+ if (err)
+ return err;
+ return bpfilter_target_add(&err_tgt);
+}
+
--
2.9.5
^ permalink raw reply related
* [PATCH v2 net-next 2/4] net: add skeleton of bpfilter kernel module
From: Alexei Starovoitov @ 2018-05-03 4:36 UTC (permalink / raw)
To: davem; +Cc: daniel, torvalds, gregkh, luto, netdev, linux-kernel, kernel-team
In-Reply-To: <20180503043604.1604587-1-ast@kernel.org>
bpfilter.ko consists of bpfilter_kern.c (normal kernel module code)
and user mode helper code that is embedded into bpfilter.ko
The steps to build bpfilter.ko are the following:
- main.c is compiled by HOSTCC into the bpfilter_umh elf executable file
- with quite a bit of objcopy and Makefile magic the bpfilter_umh elf file
is converted into bpfilter_umh.o object file
with _binary_net_bpfilter_bpfilter_umh_start and _end symbols
Example:
$ nm ./bld_x64/net/bpfilter/bpfilter_umh.o
0000000000004cf8 T _binary_net_bpfilter_bpfilter_umh_end
0000000000004cf8 A _binary_net_bpfilter_bpfilter_umh_size
0000000000000000 T _binary_net_bpfilter_bpfilter_umh_start
- bpfilter_umh.o and bpfilter_kern.o are linked together into bpfilter.ko
bpfilter_kern.c is a normal kernel module code that calls
the fork_usermode_blob() helper to execute part of its own data
as a user mode process.
Notice that _binary_net_bpfilter_bpfilter_umh_start - end
is placed into .init.rodata section, so it's freed as soon as __init
function of bpfilter.ko is finished.
As part of __init the bpfilter.ko does first request/reply action
via two unix pipe provided by fork_usermode_blob() helper to
make sure that umh is healthy. If not it will kill it via pid.
Later bpfilter_process_sockopt() will be called from bpfilter hooks
in get/setsockopt() to pass iptable commands into umh via bpfilter.ko
If admin does 'rmmod bpfilter' the __exit code bpfilter.ko will
kill umh as well.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
include/linux/bpfilter.h | 15 +++++++
include/uapi/linux/bpfilter.h | 21 ++++++++++
net/Kconfig | 2 +
net/Makefile | 1 +
net/bpfilter/Kconfig | 17 ++++++++
net/bpfilter/Makefile | 24 +++++++++++
net/bpfilter/bpfilter_kern.c | 93 +++++++++++++++++++++++++++++++++++++++++++
net/bpfilter/main.c | 63 +++++++++++++++++++++++++++++
net/bpfilter/msgfmt.h | 17 ++++++++
net/ipv4/Makefile | 2 +
net/ipv4/bpfilter/Makefile | 2 +
net/ipv4/bpfilter/sockopt.c | 42 +++++++++++++++++++
net/ipv4/ip_sockglue.c | 17 ++++++++
13 files changed, 316 insertions(+)
create mode 100644 include/linux/bpfilter.h
create mode 100644 include/uapi/linux/bpfilter.h
create mode 100644 net/bpfilter/Kconfig
create mode 100644 net/bpfilter/Makefile
create mode 100644 net/bpfilter/bpfilter_kern.c
create mode 100644 net/bpfilter/main.c
create mode 100644 net/bpfilter/msgfmt.h
create mode 100644 net/ipv4/bpfilter/Makefile
create mode 100644 net/ipv4/bpfilter/sockopt.c
diff --git a/include/linux/bpfilter.h b/include/linux/bpfilter.h
new file mode 100644
index 000000000000..687b1760bb9f
--- /dev/null
+++ b/include/linux/bpfilter.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_BPFILTER_H
+#define _LINUX_BPFILTER_H
+
+#include <uapi/linux/bpfilter.h>
+
+struct sock;
+int bpfilter_ip_set_sockopt(struct sock *sk, int optname, char *optval,
+ unsigned int optlen);
+int bpfilter_ip_get_sockopt(struct sock *sk, int optname, char *optval,
+ int *optlen);
+extern int (*bpfilter_process_sockopt)(struct sock *sk, int optname,
+ char __user *optval,
+ unsigned int optlen, bool is_set);
+#endif
diff --git a/include/uapi/linux/bpfilter.h b/include/uapi/linux/bpfilter.h
new file mode 100644
index 000000000000..2ec3cc99ea4c
--- /dev/null
+++ b/include/uapi/linux/bpfilter.h
@@ -0,0 +1,21 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _UAPI_LINUX_BPFILTER_H
+#define _UAPI_LINUX_BPFILTER_H
+
+#include <linux/if.h>
+
+enum {
+ BPFILTER_IPT_SO_SET_REPLACE = 64,
+ BPFILTER_IPT_SO_SET_ADD_COUNTERS = 65,
+ BPFILTER_IPT_SET_MAX,
+};
+
+enum {
+ BPFILTER_IPT_SO_GET_INFO = 64,
+ BPFILTER_IPT_SO_GET_ENTRIES = 65,
+ BPFILTER_IPT_SO_GET_REVISION_MATCH = 66,
+ BPFILTER_IPT_SO_GET_REVISION_TARGET = 67,
+ BPFILTER_IPT_GET_MAX,
+};
+
+#endif /* _UAPI_LINUX_BPFILTER_H */
diff --git a/net/Kconfig b/net/Kconfig
index b62089fb1332..ed6368b306fa 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -201,6 +201,8 @@ source "net/bridge/netfilter/Kconfig"
endif
+source "net/bpfilter/Kconfig"
+
source "net/dccp/Kconfig"
source "net/sctp/Kconfig"
source "net/rds/Kconfig"
diff --git a/net/Makefile b/net/Makefile
index a6147c61b174..7f982b7682bd 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -20,6 +20,7 @@ obj-$(CONFIG_TLS) += tls/
obj-$(CONFIG_XFRM) += xfrm/
obj-$(CONFIG_UNIX) += unix/
obj-$(CONFIG_NET) += ipv6/
+obj-$(CONFIG_BPFILTER) += bpfilter/
obj-$(CONFIG_PACKET) += packet/
obj-$(CONFIG_NET_KEY) += key/
obj-$(CONFIG_BRIDGE) += bridge/
diff --git a/net/bpfilter/Kconfig b/net/bpfilter/Kconfig
new file mode 100644
index 000000000000..782a732b9a5c
--- /dev/null
+++ b/net/bpfilter/Kconfig
@@ -0,0 +1,17 @@
+menuconfig BPFILTER
+ bool "BPF based packet filtering framework (BPFILTER)"
+ default n
+ depends on NET && BPF
+ help
+ This builds experimental bpfilter framework that is aiming to
+ provide netfilter compatible functionality via BPF
+
+if BPFILTER
+config BPFILTER_UMH
+ tristate "bpftiler kernel module with user mode helper"
+ default m
+ depends on m
+ help
+ This builds bpfilter kernel module with embedded user mode helper
+endif
+
diff --git a/net/bpfilter/Makefile b/net/bpfilter/Makefile
new file mode 100644
index 000000000000..897eedae523e
--- /dev/null
+++ b/net/bpfilter/Makefile
@@ -0,0 +1,24 @@
+# SPDX-License-Identifier: GPL-2.0
+#
+# Makefile for the Linux BPFILTER layer.
+#
+
+hostprogs-y := bpfilter_umh
+bpfilter_umh-objs := main.o
+HOSTCFLAGS += -I. -Itools/include/
+
+# a bit of elf magic to convert bpfilter_umh binary into a binary blob
+# inside bpfilter_umh.o elf file referenced by
+# _binary_net_bpfilter_bpfilter_umh_start symbol
+# which bpfilter_kern.c passes further into umh blob loader at run-time
+quiet_cmd_copy_umh = GEN $@
+ cmd_copy_umh = echo ':' > $(obj)/.bpfilter_umh.o.cmd; \
+ $(OBJCOPY) -I binary -O $(CONFIG_OUTPUT_FORMAT) \
+ -B `$(OBJDUMP) -f $<|grep architecture|cut -d, -f1|cut -d' ' -f2` \
+ --rename-section .data=.init.rodata $< $@
+
+$(obj)/bpfilter_umh.o: $(obj)/bpfilter_umh
+ $(call cmd,copy_umh)
+
+obj-$(CONFIG_BPFILTER_UMH) += bpfilter.o
+bpfilter-objs += bpfilter_kern.o bpfilter_umh.o
diff --git a/net/bpfilter/bpfilter_kern.c b/net/bpfilter/bpfilter_kern.c
new file mode 100644
index 000000000000..e0a6fdd5842b
--- /dev/null
+++ b/net/bpfilter/bpfilter_kern.c
@@ -0,0 +1,93 @@
+// SPDX-License-Identifier: GPL-2.0
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/umh.h>
+#include <linux/bpfilter.h>
+#include <linux/sched.h>
+#include <linux/sched/signal.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include "msgfmt.h"
+
+#define UMH_start _binary_net_bpfilter_bpfilter_umh_start
+#define UMH_end _binary_net_bpfilter_bpfilter_umh_end
+
+extern char UMH_start;
+extern char UMH_end;
+
+static struct umh_info info;
+
+static void shutdown_umh(struct umh_info *info)
+{
+ struct task_struct *tsk;
+
+ tsk = pid_task(find_vpid(info->pid), PIDTYPE_PID);
+ if (tsk)
+ force_sig(SIGKILL, tsk);
+ fput(info->pipe_to_umh);
+ fput(info->pipe_from_umh);
+}
+
+static void stop_umh(void)
+{
+ if (bpfilter_process_sockopt) {
+ bpfilter_process_sockopt = NULL;
+ shutdown_umh(&info);
+ }
+}
+
+static int __bpfilter_process_sockopt(struct sock *sk, int optname,
+ char __user *optval,
+ unsigned int optlen, bool is_set)
+{
+ struct mbox_request req;
+ struct mbox_reply reply;
+ loff_t pos;
+ ssize_t n;
+
+ req.is_set = is_set;
+ req.pid = current->pid;
+ req.cmd = optname;
+ req.addr = (long)optval;
+ req.len = optlen;
+ n = __kernel_write(info.pipe_to_umh, &req, sizeof(req), &pos);
+ if (n != sizeof(req)) {
+ pr_err("write fail %zd\n", n);
+ stop_umh();
+ return -EFAULT;
+ }
+ pos = 0;
+ n = kernel_read(info.pipe_from_umh, &reply, sizeof(reply), &pos);
+ if (n != sizeof(reply)) {
+ pr_err("read fail %zd\n", n);
+ stop_umh();
+ return -EFAULT;
+ }
+ return reply.status;
+}
+
+static int __init load_umh(void)
+{
+ int err;
+
+ err = fork_usermode_blob(&UMH_start, &UMH_end - &UMH_start, &info);
+ if (err)
+ return err;
+ pr_info("Loaded umh pid %d\n", info.pid);
+ bpfilter_process_sockopt = &__bpfilter_process_sockopt;
+
+ if (__bpfilter_process_sockopt(NULL, 0, 0, 0, 0) != 0) {
+ stop_umh();
+ return -EFAULT;
+ }
+ return 0;
+}
+
+static void __exit fini_umh(void)
+{
+ stop_umh();
+}
+module_init(load_umh);
+module_exit(fini_umh);
+MODULE_LICENSE("GPL");
diff --git a/net/bpfilter/main.c b/net/bpfilter/main.c
new file mode 100644
index 000000000000..81bbc1684896
--- /dev/null
+++ b/net/bpfilter/main.c
@@ -0,0 +1,63 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include <sys/uio.h>
+#include <errno.h>
+#include <stdio.h>
+#include <sys/socket.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include "include/uapi/linux/bpf.h"
+#include <asm/unistd.h>
+#include "msgfmt.h"
+
+int debug_fd;
+
+static int handle_get_cmd(struct mbox_request *cmd)
+{
+ switch (cmd->cmd) {
+ case 0:
+ return 0;
+ default:
+ break;
+ }
+ return -ENOPROTOOPT;
+}
+
+static int handle_set_cmd(struct mbox_request *cmd)
+{
+ return -ENOPROTOOPT;
+}
+
+static void loop(void)
+{
+ while (1) {
+ struct mbox_request req;
+ struct mbox_reply reply;
+ int n;
+
+ n = read(0, &req, sizeof(req));
+ if (n != sizeof(req)) {
+ dprintf(debug_fd, "invalid request %d\n", n);
+ return;
+ }
+
+ reply.status = req.is_set ?
+ handle_set_cmd(&req) :
+ handle_get_cmd(&req);
+
+ n = write(1, &reply, sizeof(reply));
+ if (n != sizeof(reply)) {
+ dprintf(debug_fd, "reply failed %d\n", n);
+ return;
+ }
+ }
+}
+
+int main(void)
+{
+ debug_fd = open("/dev/console", 00000002 | 00000100);
+ dprintf(debug_fd, "Started bpfilter\n");
+ loop();
+ close(debug_fd);
+ return 0;
+}
diff --git a/net/bpfilter/msgfmt.h b/net/bpfilter/msgfmt.h
new file mode 100644
index 000000000000..94b9ac9e5114
--- /dev/null
+++ b/net/bpfilter/msgfmt.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _NET_BPFILTER_MSGFMT_H
+#define _NET_BPFTILER_MSGFMT_H
+
+struct mbox_request {
+ __u64 addr;
+ __u32 len;
+ __u32 is_set;
+ __u32 cmd;
+ __u32 pid;
+};
+
+struct mbox_reply {
+ __u32 status;
+};
+
+#endif
diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
index b379520f9133..7018f91c5a39 100644
--- a/net/ipv4/Makefile
+++ b/net/ipv4/Makefile
@@ -16,6 +16,8 @@ obj-y := route.o inetpeer.o protocol.o \
inet_fragment.o ping.o ip_tunnel_core.o gre_offload.o \
metrics.o
+obj-$(CONFIG_BPFILTER) += bpfilter/
+
obj-$(CONFIG_NET_IP_TUNNEL) += ip_tunnel.o
obj-$(CONFIG_SYSCTL) += sysctl_net_ipv4.o
obj-$(CONFIG_PROC_FS) += proc.o
diff --git a/net/ipv4/bpfilter/Makefile b/net/ipv4/bpfilter/Makefile
new file mode 100644
index 000000000000..ce262d76cc48
--- /dev/null
+++ b/net/ipv4/bpfilter/Makefile
@@ -0,0 +1,2 @@
+obj-$(CONFIG_BPFILTER) += sockopt.o
+
diff --git a/net/ipv4/bpfilter/sockopt.c b/net/ipv4/bpfilter/sockopt.c
new file mode 100644
index 000000000000..42a96d2d8d05
--- /dev/null
+++ b/net/ipv4/bpfilter/sockopt.c
@@ -0,0 +1,42 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/uaccess.h>
+#include <linux/bpfilter.h>
+#include <uapi/linux/bpf.h>
+#include <linux/wait.h>
+#include <linux/kmod.h>
+
+int (*bpfilter_process_sockopt)(struct sock *sk, int optname,
+ char __user *optval,
+ unsigned int optlen, bool is_set);
+EXPORT_SYMBOL_GPL(bpfilter_process_sockopt);
+
+int bpfilter_mbox_request(struct sock *sk, int optname, char __user *optval,
+ unsigned int optlen, bool is_set)
+{
+ if (!bpfilter_process_sockopt) {
+ int err = request_module("bpfilter");
+
+ if (err)
+ return err;
+ if (!bpfilter_process_sockopt)
+ return -ECHILD;
+ }
+ return bpfilter_process_sockopt(sk, optname, optval, optlen, is_set);
+}
+
+int bpfilter_ip_set_sockopt(struct sock *sk, int optname, char __user *optval,
+ unsigned int optlen)
+{
+ return bpfilter_mbox_request(sk, optname, optval, optlen, true);
+}
+
+int bpfilter_ip_get_sockopt(struct sock *sk, int optname, char __user *optval,
+ int __user *optlen)
+{
+ int len;
+
+ if (get_user(len, optlen))
+ return -EFAULT;
+
+ return bpfilter_mbox_request(sk, optname, optval, len, false);
+}
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index 5ad2d8ed3a3f..e0791faacb24 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -47,6 +47,8 @@
#include <linux/errqueue.h>
#include <linux/uaccess.h>
+#include <linux/bpfilter.h>
+
/*
* SOL_IP control messages.
*/
@@ -1244,6 +1246,11 @@ int ip_setsockopt(struct sock *sk, int level,
return -ENOPROTOOPT;
err = do_ip_setsockopt(sk, level, optname, optval, optlen);
+#ifdef CONFIG_BPFILTER
+ if (optname >= BPFILTER_IPT_SO_SET_REPLACE &&
+ optname < BPFILTER_IPT_SET_MAX)
+ err = bpfilter_ip_set_sockopt(sk, optname, optval, optlen);
+#endif
#ifdef CONFIG_NETFILTER
/* we need to exclude all possible ENOPROTOOPTs except default case */
if (err == -ENOPROTOOPT && optname != IP_HDRINCL &&
@@ -1552,6 +1559,11 @@ int ip_getsockopt(struct sock *sk, int level,
int err;
err = do_ip_getsockopt(sk, level, optname, optval, optlen, 0);
+#ifdef CONFIG_BPFILTER
+ if (optname >= BPFILTER_IPT_SO_GET_INFO &&
+ optname < BPFILTER_IPT_GET_MAX)
+ err = bpfilter_ip_get_sockopt(sk, optname, optval, optlen);
+#endif
#ifdef CONFIG_NETFILTER
/* we need to exclude all possible ENOPROTOOPTs except default case */
if (err == -ENOPROTOOPT && optname != IP_PKTOPTIONS &&
@@ -1584,6 +1596,11 @@ int compat_ip_getsockopt(struct sock *sk, int level, int optname,
err = do_ip_getsockopt(sk, level, optname, optval, optlen,
MSG_CMSG_COMPAT);
+#ifdef CONFIG_BPFILTER
+ if (optname >= BPFILTER_IPT_SO_GET_INFO &&
+ optname < BPFILTER_IPT_GET_MAX)
+ err = bpfilter_ip_get_sockopt(sk, optname, optval, optlen);
+#endif
#ifdef CONFIG_NETFILTER
/* we need to exclude all possible ENOPROTOOPTs except default case */
if (err == -ENOPROTOOPT && optname != IP_PKTOPTIONS &&
--
2.9.5
^ permalink raw reply related
* [PATCH v2 net-next 1/4] umh: introduce fork_usermode_blob() helper
From: Alexei Starovoitov @ 2018-05-03 4:36 UTC (permalink / raw)
To: davem; +Cc: daniel, torvalds, gregkh, luto, netdev, linux-kernel, kernel-team
In-Reply-To: <20180503043604.1604587-1-ast@kernel.org>
Introduce helper:
int fork_usermode_blob(void *data, size_t len, struct umh_info *info);
struct umh_info {
struct file *pipe_to_umh;
struct file *pipe_from_umh;
pid_t pid;
};
that GPLed kernel modules (signed or unsigned) can use it to execute part
of its own data as swappable user mode process.
The kernel will do:
- mount "tmpfs"
- allocate a unique file in tmpfs
- populate that file with [data, data + len] bytes
- user-mode-helper code will do_execve that file and, before the process
starts, the kernel will create two unix pipes for bidirectional
communication between kernel module and umh
- close tmpfs file, effectively deleting it
- the fork_usermode_blob will return zero on success and populate
'struct umh_info' with two unix pipes and the pid of the user process
As the first step in the development of the bpfilter project
the fork_usermode_blob() helper is introduced to allow user mode code
to be invoked from a kernel module. The idea is that user mode code plus
normal kernel module code are built as part of the kernel build
and installed as traditional kernel module into distro specified location,
such that from a distribution point of view, there is
no difference between regular kernel modules and kernel modules + umh code.
Such modules can be signed, modprobed, rmmod, etc. The use of this new helper
by a kernel module doesn't make it any special from kernel and user space
tooling point of view.
Such approach enables kernel to delegate functionality traditionally done
by the kernel modules into the user space processes (either root or !root) and
reduces security attack surface of the new code. The buggy umh code would crash
the user process, but not the kernel. Another advantage is that umh code
of the kernel module can be debugged and tested out of user space
(e.g. opening the possibility to run clang sanitizers, fuzzers or
user space test suites on the umh code).
In case of the bpfilter project such architecture allows complex control plane
to be done in the user space while bpf based data plane stays in the kernel.
Since umh can crash, can be oom-ed by the kernel, killed by the admin,
the kernel module that uses them (like bpfilter) needs to manage life
time of umh on its own via two unix pipes and the pid of umh.
The exit code of such kernel module should kill the umh it started,
so that rmmod of the kernel module will cleanup the corresponding umh.
Just like if the kernel module does kmalloc() it should kfree() it in the exit code.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
fs/exec.c | 38 ++++++++---
include/linux/binfmts.h | 1 +
include/linux/umh.h | 12 ++++
kernel/umh.c | 176 +++++++++++++++++++++++++++++++++++++++++++++++-
4 files changed, 215 insertions(+), 12 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index 183059c427b9..30a36c2a39bf 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1706,14 +1706,13 @@ static int exec_binprm(struct linux_binprm *bprm)
/*
* sys_execve() executes a new program.
*/
-static int do_execveat_common(int fd, struct filename *filename,
- struct user_arg_ptr argv,
- struct user_arg_ptr envp,
- int flags)
+static int __do_execve_file(int fd, struct filename *filename,
+ struct user_arg_ptr argv,
+ struct user_arg_ptr envp,
+ int flags, struct file *file)
{
char *pathbuf = NULL;
struct linux_binprm *bprm;
- struct file *file;
struct files_struct *displaced;
int retval;
@@ -1752,7 +1751,8 @@ static int do_execveat_common(int fd, struct filename *filename,
check_unsafe_exec(bprm);
current->in_execve = 1;
- file = do_open_execat(fd, filename, flags);
+ if (!file)
+ file = do_open_execat(fd, filename, flags);
retval = PTR_ERR(file);
if (IS_ERR(file))
goto out_unmark;
@@ -1760,7 +1760,9 @@ static int do_execveat_common(int fd, struct filename *filename,
sched_exec();
bprm->file = file;
- if (fd == AT_FDCWD || filename->name[0] == '/') {
+ if (!filename) {
+ bprm->filename = "none";
+ } else if (fd == AT_FDCWD || filename->name[0] == '/') {
bprm->filename = filename->name;
} else {
if (filename->name[0] == '\0')
@@ -1826,7 +1828,8 @@ static int do_execveat_common(int fd, struct filename *filename,
task_numa_free(current);
free_bprm(bprm);
kfree(pathbuf);
- putname(filename);
+ if (filename)
+ putname(filename);
if (displaced)
put_files_struct(displaced);
return retval;
@@ -1849,10 +1852,27 @@ static int do_execveat_common(int fd, struct filename *filename,
if (displaced)
reset_files_struct(displaced);
out_ret:
- putname(filename);
+ if (filename)
+ putname(filename);
return retval;
}
+static int do_execveat_common(int fd, struct filename *filename,
+ struct user_arg_ptr argv,
+ struct user_arg_ptr envp,
+ int flags)
+{
+ return __do_execve_file(fd, filename, argv, envp, flags, NULL);
+}
+
+int do_execve_file(struct file *file, void *__argv, void *__envp)
+{
+ struct user_arg_ptr argv = { .ptr.native = __argv };
+ struct user_arg_ptr envp = { .ptr.native = __envp };
+
+ return __do_execve_file(AT_FDCWD, NULL, argv, envp, 0, file);
+}
+
int do_execve(struct filename *filename,
const char __user *const __user *__argv,
const char __user *const __user *__envp)
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index 4955e0863b83..c05f24fac4f6 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -150,5 +150,6 @@ extern int do_execveat(int, struct filename *,
const char __user * const __user *,
const char __user * const __user *,
int);
+int do_execve_file(struct file *file, void *__argv, void *__envp);
#endif /* _LINUX_BINFMTS_H */
diff --git a/include/linux/umh.h b/include/linux/umh.h
index 244aff638220..5c812acbb80a 100644
--- a/include/linux/umh.h
+++ b/include/linux/umh.h
@@ -22,8 +22,10 @@ struct subprocess_info {
const char *path;
char **argv;
char **envp;
+ struct file *file;
int wait;
int retval;
+ pid_t pid;
int (*init)(struct subprocess_info *info, struct cred *new);
void (*cleanup)(struct subprocess_info *info);
void *data;
@@ -38,6 +40,16 @@ call_usermodehelper_setup(const char *path, char **argv, char **envp,
int (*init)(struct subprocess_info *info, struct cred *new),
void (*cleanup)(struct subprocess_info *), void *data);
+struct subprocess_info *call_usermodehelper_setup_file(struct file *file,
+ int (*init)(struct subprocess_info *info, struct cred *new),
+ void (*cleanup)(struct subprocess_info *), void *data);
+struct umh_info {
+ struct file *pipe_to_umh;
+ struct file *pipe_from_umh;
+ pid_t pid;
+};
+int fork_usermode_blob(void *data, size_t len, struct umh_info *info);
+
extern int
call_usermodehelper_exec(struct subprocess_info *info, int wait);
diff --git a/kernel/umh.c b/kernel/umh.c
index f76b3ff876cf..c3f418d7d51a 100644
--- a/kernel/umh.c
+++ b/kernel/umh.c
@@ -25,6 +25,8 @@
#include <linux/ptrace.h>
#include <linux/async.h>
#include <linux/uaccess.h>
+#include <linux/shmem_fs.h>
+#include <linux/pipe_fs_i.h>
#include <trace/events/module.h>
@@ -97,9 +99,13 @@ static int call_usermodehelper_exec_async(void *data)
commit_creds(new);
- retval = do_execve(getname_kernel(sub_info->path),
- (const char __user *const __user *)sub_info->argv,
- (const char __user *const __user *)sub_info->envp);
+ if (sub_info->file)
+ retval = do_execve_file(sub_info->file,
+ sub_info->argv, sub_info->envp);
+ else
+ retval = do_execve(getname_kernel(sub_info->path),
+ (const char __user *const __user *)sub_info->argv,
+ (const char __user *const __user *)sub_info->envp);
out:
sub_info->retval = retval;
/*
@@ -185,6 +191,8 @@ static void call_usermodehelper_exec_work(struct work_struct *work)
if (pid < 0) {
sub_info->retval = pid;
umh_complete(sub_info);
+ } else {
+ sub_info->pid = pid;
}
}
}
@@ -393,6 +401,168 @@ struct subprocess_info *call_usermodehelper_setup(const char *path, char **argv,
}
EXPORT_SYMBOL(call_usermodehelper_setup);
+struct subprocess_info *call_usermodehelper_setup_file(struct file *file,
+ int (*init)(struct subprocess_info *info, struct cred *new),
+ void (*cleanup)(struct subprocess_info *info), void *data)
+{
+ struct subprocess_info *sub_info;
+
+ sub_info = kzalloc(sizeof(struct subprocess_info), GFP_KERNEL);
+ if (!sub_info)
+ return NULL;
+
+ INIT_WORK(&sub_info->work, call_usermodehelper_exec_work);
+ sub_info->path = "none";
+ sub_info->file = file;
+ sub_info->init = init;
+ sub_info->cleanup = cleanup;
+ sub_info->data = data;
+ return sub_info;
+}
+
+static struct vfsmount *umh_fs;
+
+static int init_tmpfs(void)
+{
+ struct file_system_type *type;
+
+ if (umh_fs)
+ return 0;
+ type = get_fs_type("tmpfs");
+ if (!type)
+ return -ENODEV;
+ umh_fs = kern_mount(type);
+ if (IS_ERR(umh_fs)) {
+ int err = PTR_ERR(umh_fs);
+
+ put_filesystem(type);
+ umh_fs = NULL;
+ return err;
+ }
+ return 0;
+}
+
+static int alloc_tmpfs_file(size_t size, struct file **filp)
+{
+ struct file *file;
+ int err;
+
+ err = init_tmpfs();
+ if (err)
+ return err;
+ file = shmem_file_setup_with_mnt(umh_fs, "umh", size, VM_NORESERVE);
+ if (IS_ERR(file))
+ return PTR_ERR(file);
+ *filp = file;
+ return 0;
+}
+
+static int populate_file(struct file *file, const void *data, size_t size)
+{
+ size_t offset = 0;
+ int err;
+
+ do {
+ unsigned int len = min_t(typeof(size), size, PAGE_SIZE);
+ struct page *page;
+ void *pgdata, *vaddr;
+
+ err = pagecache_write_begin(file, file->f_mapping, offset, len,
+ 0, &page, &pgdata);
+ if (err < 0)
+ goto fail;
+
+ vaddr = kmap(page);
+ memcpy(vaddr, data, len);
+ kunmap(page);
+
+ err = pagecache_write_end(file, file->f_mapping, offset, len,
+ len, page, pgdata);
+ if (err < 0)
+ goto fail;
+
+ size -= len;
+ data += len;
+ offset += len;
+ } while (size);
+ return 0;
+fail:
+ return err;
+}
+
+static int umh_pipe_setup(struct subprocess_info *info, struct cred *new)
+{
+ struct umh_info *umh_info = info->data;
+ struct file *from_umh[2];
+ struct file *to_umh[2];
+ int err;
+
+ /* create pipe to send data to umh */
+ err = create_pipe_files(to_umh, 0);
+ if (err)
+ return err;
+ err = replace_fd(0, to_umh[0], 0);
+ fput(to_umh[0]);
+ if (err < 0) {
+ fput(to_umh[1]);
+ return err;
+ }
+
+ /* create pipe to receive data from umh */
+ err = create_pipe_files(from_umh, 0);
+ if (err) {
+ fput(to_umh[1]);
+ replace_fd(0, NULL, 0);
+ return err;
+ }
+ err = replace_fd(1, from_umh[1], 0);
+ fput(from_umh[1]);
+ if (err < 0) {
+ fput(to_umh[1]);
+ replace_fd(0, NULL, 0);
+ fput(from_umh[0]);
+ return err;
+ }
+
+ umh_info->pipe_to_umh = to_umh[1];
+ umh_info->pipe_from_umh = from_umh[0];
+ return 0;
+}
+
+static void umh_save_pid(struct subprocess_info *info)
+{
+ struct umh_info *umh_info = info->data;
+
+ umh_info->pid = info->pid;
+}
+
+int fork_usermode_blob(void *data, size_t len, struct umh_info *info)
+{
+ struct subprocess_info *sub_info;
+ struct file *file = NULL;
+ int err;
+
+ err = alloc_tmpfs_file(len, &file);
+ if (err)
+ return err;
+
+ err = populate_file(file, data, len);
+ if (err)
+ goto out;
+
+ err = -ENOMEM;
+ sub_info = call_usermodehelper_setup_file(file, umh_pipe_setup,
+ umh_save_pid, info);
+ if (!sub_info)
+ goto out;
+
+ err = call_usermodehelper_exec(sub_info, UMH_WAIT_EXEC);
+out:
+ fput(file);
+ return err;
+}
+EXPORT_SYMBOL_GPL(fork_usermode_blob);
+
/**
* call_usermodehelper_exec - start a usermode application
* @sub_info: information about the subprocessa
--
2.9.5
^ permalink raw reply related
* [PATCH v2 net-next 0/4] bpfilter
From: Alexei Starovoitov @ 2018-05-03 4:36 UTC (permalink / raw)
To: davem; +Cc: daniel, torvalds, gregkh, luto, netdev, linux-kernel, kernel-team
Hi All,
v1->v2:
this patch set is almost a full rewrite of the earlier umh modules approach
The v1 of patches and follow up discussion was covered by LWN:
https://lwn.net/Articles/749108/
I believe the v2 addresses all issues brought up by Andy and others.
Mainly there are zero changes to kernel/module.c
Instead of teaching module loading logic to recognize special
umh module, let normal kernel modules execute part of its own
.init.rodata as a new user space process (Andy's idea)
Patch 1 introduces this new helper:
int fork_usermode_blob(void *data, size_t len, struct umh_info *info);
Input:
data + len == executable file
Output:
struct umh_info {
struct file *pipe_to_umh;
struct file *pipe_from_umh;
pid_t pid;
};
Advantages vs v1:
- the embedded user mode executable is stored as .init.rodata inside
normal kernel module. These pages are freed when .ko finishes loading
- the elf file is copied into tmpfs file. The user mode process is swappable.
- the communication between user mode process and 'parent' kernel module
is done via two unix pipes, hence protocol is not exposed to
user space
- impossible to launch umh on its own (that was the main issue of v1)
and impossible to be man-in-the-middle due to pipes
- bpfilter.ko consists of tiny kernel part that passes the data
between kernel and umh via pipes and much bigger umh part that
doing all the work
- 'lsmod' shows bpfilter.ko as usual.
'rmmod bpfilter' removes kernel module and kills corresponding umh
- signed bpfilter.ko covers the whole image including umh code
Few issues:
- architecturally bpfilter.ko can be builtin, but doesn't work yet.
Still debugging. Kinda cool to have user mode executables
to be part of vmlinux
- the user can still attach to the process and debug it with
'gdb /proc/pid/exe pid', but 'gdb -p pid' doesn't work.
(a bit worse comparing to v1)
- tinyconfig will notice a small increase in .text
+766 | TEXT | 7c8b94806bec umh: introduce fork_usermode_blob() helper
More details in patches 1 and 2 that are ready to land.
Patches 3 and 4 are still rough. They were mainly used for
testing and to demonstrate how bpfilter is building on top.
The patch 4 approach of converting one iptable rule to few bpf
instructions will certainly change in the future, since it doesn't
scale to thousands of rules.
Alexei Starovoitov (2):
umh: introduce fork_usermode_blob() helper
net: add skeleton of bpfilter kernel module
Daniel Borkmann (1):
bpfilter: rough bpfilter codegen example hack
David S. Miller (1):
bpfilter: add iptable get/set parsing
fs/exec.c | 38 ++++-
include/linux/binfmts.h | 1 +
include/linux/bpfilter.h | 15 ++
include/linux/umh.h | 12 ++
include/uapi/linux/bpfilter.h | 200 ++++++++++++++++++++++
kernel/umh.c | 176 +++++++++++++++++++-
net/Kconfig | 2 +
net/Makefile | 1 +
net/bpfilter/Kconfig | 17 ++
net/bpfilter/Makefile | 24 +++
net/bpfilter/bpfilter_kern.c | 93 +++++++++++
net/bpfilter/bpfilter_mod.h | 373 ++++++++++++++++++++++++++++++++++++++++++
net/bpfilter/ctor.c | 91 +++++++++++
net/bpfilter/gen.c | 290 ++++++++++++++++++++++++++++++++
net/bpfilter/init.c | 36 ++++
net/bpfilter/main.c | 117 +++++++++++++
net/bpfilter/msgfmt.h | 17 ++
net/bpfilter/sockopt.c | 236 ++++++++++++++++++++++++++
net/bpfilter/tables.c | 73 +++++++++
net/bpfilter/targets.c | 51 ++++++
net/bpfilter/tgts.c | 26 +++
net/ipv4/Makefile | 2 +
net/ipv4/bpfilter/Makefile | 2 +
net/ipv4/bpfilter/sockopt.c | 42 +++++
net/ipv4/ip_sockglue.c | 17 ++
25 files changed, 1940 insertions(+), 12 deletions(-)
create mode 100644 include/linux/bpfilter.h
create mode 100644 include/uapi/linux/bpfilter.h
create mode 100644 net/bpfilter/Kconfig
create mode 100644 net/bpfilter/Makefile
create mode 100644 net/bpfilter/bpfilter_kern.c
create mode 100644 net/bpfilter/bpfilter_mod.h
create mode 100644 net/bpfilter/ctor.c
create mode 100644 net/bpfilter/gen.c
create mode 100644 net/bpfilter/init.c
create mode 100644 net/bpfilter/main.c
create mode 100644 net/bpfilter/msgfmt.h
create mode 100644 net/bpfilter/sockopt.c
create mode 100644 net/bpfilter/tables.c
create mode 100644 net/bpfilter/targets.c
create mode 100644 net/bpfilter/tgts.c
create mode 100644 net/ipv4/bpfilter/Makefile
create mode 100644 net/ipv4/bpfilter/sockopt.c
--
2.9.5
^ permalink raw reply
* Re: [lkp-robot] 486ad79630 [ 15.532543] BUG: unable to handle kernel NULL pointer dereference at 0000000000000004
From: Andrew Morton @ 2018-05-03 4:27 UTC (permalink / raw)
To: kernel test robot
Cc: kernel test robot, Linux Memory Management List, Johannes Weiner,
LKP, David Miller, netdev, Cong Wang
In-Reply-To: <20180503041450.pq2njvkssxtay64o@shao2-debian>
(networking cc's added)
On Thu, 3 May 2018 12:14:50 +0800 kernel test robot <shun.hao@intel.com> wrote:
> Greetings,
>
> 0day kernel testing robot got the below dmesg and the first bad commit is
>
> git://git.cmpxchg.org/linux-mmotm.git master
>
> commit 486ad79630d0ba0b7205a8db9fe15ba392f5ee32
> Author: Andrew Morton <akpm@linux-foundation.org>
> AuthorDate: Fri Apr 20 22:00:53 2018 +0000
> Commit: Johannes Weiner <hannes@cmpxchg.org>
> CommitDate: Fri Apr 20 22:00:53 2018 +0000
>
> origin
OK, this got confusing. origin.patch is the diff between 4.17-rc3 and
current mainline.
>
> [many lines deleted]
>
> [main] Setsockopt(101 c 1b24000 a) on fd 177 [3:5:240]
> [main] Setsockopt(1 2c 1b24000 4) on fd 178 [5:2:0]
> [main] Setsockopt(29 8 1b24000 4) on fd 180 [10:1:0]
> [main] Setsockopt(1 20 1b24000 4) on fd 181 [26:2:125]
> [main] Setsockopt(11 1 1b24000 4) on fd 183 [2:2:17]
> [ 15.532543] BUG: unable to handle kernel NULL pointer dereference at 0000000000000004
> [ 15.534143] PGD 800000001734b067 P4D 800000001734b067 PUD 17350067 PMD 0
> [ 15.535516] Oops: 0002 [#1] PTI
> [ 15.536165] Modules linked in:
> [ 15.536798] CPU: 0 PID: 363 Comm: trinity-main Not tainted 4.17.0-rc1-00001-g486ad79 #2
> [ 15.538396] RIP: 0010:llc_ui_release+0x3a/0xd0
> [ 15.539293] RSP: 0018:ffffc9000015bd70 EFLAGS: 00010202
> [ 15.540345] RAX: 0000000000000001 RBX: ffff88001fa60008 RCX: 0000000000000006
> [ 15.541802] RDX: 0000000000000006 RSI: ffff88001fdda660 RDI: ffff88001fa60008
> [ 15.543139] RBP: ffffc9000015bd80 R08: 0000000000000000 R09: 0000000000000000
> [ 15.544725] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> [ 15.546287] R13: ffff88001fa61730 R14: ffff88001e130a60 R15: ffff880019bdb3f0
> [ 15.547962] FS: 00007f2221bb1700(0000) GS:ffffffff82034000(0000) knlGS:0000000000000000
> [ 15.549848] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 15.551186] CR2: 0000000000000004 CR3: 000000001734e000 CR4: 00000000000006b0
> [ 15.552671] DR0: 0000000002232000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 15.554105] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
> [ 15.555534] Call Trace:
> [ 15.556049] sock_release+0x14/0x60
> [ 15.556767] sock_close+0xd/0x20
> [ 15.557427] __fput+0xba/0x1f0
> [ 15.558058] ____fput+0x9/0x10
> [ 15.558682] task_work_run+0x73/0xa0
> [ 15.559416] do_exit+0x231/0xab0
> [ 15.560079] do_group_exit+0x3f/0xc0
> [ 15.560810] __x64_sys_exit_group+0x13/0x20
> [ 15.561656] do_syscall_64+0x58/0x2f0
> [ 15.562407] ? trace_hardirqs_off_thunk+0x1a/0x1c
> [ 15.563360] entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [ 15.564471] RIP: 0033:0x7f2221696408
> [ 15.565264] RSP: 002b:00007ffe5c544c48 EFLAGS: 00000206 ORIG_RAX: 00000000000000e7
> [ 15.566924] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f2221696408
> [ 15.568485] RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000
> [ 15.570046] RBP: 0000000000000000 R08: 00000000000000e7 R09: ffffffffffffffa0
> [ 15.571603] R10: 00007ffe5c5449e0 R11: 0000000000000206 R12: 0000000000000004
> [ 15.573160] R13: 00007ffe5c544e30 R14: 0000000000000000 R15: 0000000000000000
> [ 15.574720] Code: 7b ff 43 78 0f 88 a5 6f 14 00 31 f6 48 89 df e8 ad 33 fb ff 48 89 df e8 55 94 ff ff 85 c0 0f 84 84 00 00 00 4c 8b a3 d8 04 00 00 <41> ff 44 24 04 0f 88 7f 6f 14 00 48 8b 43 58 f6 c4 01 74 58 48
> [ 15.578679] RIP: llc_ui_release+0x3a/0xd0 RSP: ffffc9000015bd70
> [ 15.579874] CR2: 0000000000000004
> [ 15.580553] ---[ end trace 0dd8fdc6b7182234 ]---
>
So it's saying that something which got committed into Linus's tree
after 4.17-rc3 has caused a NULL deref in
sock_release->llc_ui_release+0x3a/0xd0
^ permalink raw reply
* [PATCH net] macsonic: Set platform device coherent_dma_mask
From: Finn Thain @ 2018-05-03 4:24 UTC (permalink / raw)
To: David S. Miller; +Cc: linux-m68k, netdev, linux-kernel
Set the device's coherent_dma_mask to avoid a WARNING splat.
Please see commit 205e1b7f51e4 ("dma-mapping: warn when there is
no coherent_dma_mask").
Cc: linux-m68k@lists.linux-m68k.org
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
---
drivers/net/ethernet/natsemi/macsonic.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/drivers/net/ethernet/natsemi/macsonic.c b/drivers/net/ethernet/natsemi/macsonic.c
index 0937fc2a928e..37b1ffa8bb61 100644
--- a/drivers/net/ethernet/natsemi/macsonic.c
+++ b/drivers/net/ethernet/natsemi/macsonic.c
@@ -523,6 +523,10 @@ static int mac_sonic_platform_probe(struct platform_device *pdev)
struct sonic_local *lp;
int err;
+ err = dma_coerce_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(32));
+ if (err)
+ return err;
+
dev = alloc_etherdev(sizeof(struct sonic_local));
if (!dev)
return -ENOMEM;
--
2.16.1
^ permalink raw reply related
* [PATCH net] macsonic: Set platform device coherent_dma_mask
From: Finn Thain @ 2018-05-03 4:24 UTC (permalink / raw)
To: David S. Miller; +Cc: linux-m68k, netdev, linux-kernel
Set the device's coherent_dma_mask to avoid a WARNING splat.
Please see commit 205e1b7f51e4 ("dma-mapping: warn when there is
no coherent_dma_mask").
Cc: linux-m68k@lists.linux-m68k.org
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
---
drivers/net/ethernet/natsemi/macsonic.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/drivers/net/ethernet/natsemi/macsonic.c b/drivers/net/ethernet/natsemi/macsonic.c
index 0937fc2a928e..37b1ffa8bb61 100644
--- a/drivers/net/ethernet/natsemi/macsonic.c
+++ b/drivers/net/ethernet/natsemi/macsonic.c
@@ -523,6 +523,10 @@ static int mac_sonic_platform_probe(struct platform_device *pdev)
struct sonic_local *lp;
int err;
+ err = dma_coerce_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(32));
+ if (err)
+ return err;
+
dev = alloc_etherdev(sizeof(struct sonic_local));
if (!dev)
return -ENOMEM;
--
2.16.1
^ permalink raw reply related
* [PATCH net] macsonic: Set platform device coherent_dma_mask
From: Finn Thain @ 2018-05-03 4:24 UTC (permalink / raw)
To: David S. Miller; +Cc: linux-m68k, netdev, linux-kernel
Set the device's coherent_dma_mask to avoid a WARNING splat.
Please see commit 205e1b7f51e4 ("dma-mapping: warn when there is
no coherent_dma_mask").
Cc: linux-m68k@lists.linux-m68k.org
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
---
drivers/net/ethernet/natsemi/macsonic.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/drivers/net/ethernet/natsemi/macsonic.c b/drivers/net/ethernet/natsemi/macsonic.c
index 0937fc2a928e..37b1ffa8bb61 100644
--- a/drivers/net/ethernet/natsemi/macsonic.c
+++ b/drivers/net/ethernet/natsemi/macsonic.c
@@ -523,6 +523,10 @@ static int mac_sonic_platform_probe(struct platform_device *pdev)
struct sonic_local *lp;
int err;
+ err = dma_coerce_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(32));
+ if (err)
+ return err;
+
dev = alloc_etherdev(sizeof(struct sonic_local));
if (!dev)
return -ENOMEM;
--
2.16.1
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox