From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Michael Kerrisk (man-pages)" Subject: Re: Draft 3 of bpf(2) man page for review Date: Wed, 22 Jul 2015 22:10:39 +0200 Message-ID: <55AFF8BF.3050204@gmail.com> References: <55AFE46F.3090800@gmail.com> <55AFED75.2030208@plumgrid.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: In-Reply-To: <55AFED75.2030208-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> Sender: linux-man-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org To: Alexei Starovoitov , Daniel Borkmann Cc: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org, linux-man , linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Silvan Jegen , Walter Harms List-Id: linux-man@vger.kernel.org On 07/22/2015 09:22 PM, Alexei Starovoitov wrote: > On 7/22/15 11:43 AM, Michael Kerrisk (man-pages) wrote: >> .TH BPF 2 2015-03-10 "Linux" "Linux Programmer's Manual" >=20 > should the date be updated ? It'll get updated later, by scripts. >> BPF maps are a generic data structure for storage of different data = types. >> A user process can create multiple maps (with key/value-pairs being >> opaque bytes of data) and access them via file descriptors. >> eBPF programs can access maps from inside the kernel in parallel. >> .\" >> .\" FIXME!! What does the previous sentence mean? >> .\" >> .\" Isn't "from inside the kernel" redundant? (I mean: all eBPF prog= rams >> .\" are running inside the kernel, right?) >=20 > 99.9% of the time. yes. all eBPF programs are running inside the kern= el, > though recently I've seen two versions of 'user space eBPF' where > kernel interpreter/x64_jit were ported to user space. > If you think 'from kernel' is redundant, just drop it. Okay. Done. >> .\" And what does "in parallel" mean? >> .\" Would a simpler version of this sentence be correct? As in: >> .\" "Different eBPF programs can access the same maps in paralle= l." >=20 > yes. different eBPF programs and user space processes can access the > same maps in parallel. Okay. >> The new map has the type specified by >> .IR map_type , >> and attributes as specified in >> .IR key_size , >> .IR value_size , >> and >> .IR max_entries . >> .\" FIXME!! In the next sentence, what does "process-local" mean? >> On success, this operation returns a process-local file descriptor. >=20 > Just drop this unnecessary qualifier. Just 'returns a file descriptor= ' Done. >> .in +4n >> .nf >> bpf_map_lookup_elem(map_fd, fp - 4) >> .fi >> .in >> >> the program will be rejected, >> since the in-kernel helper function >> >> bpf_map_lookup_elem(map_fd, void *key) >> >> expects to read 8 bytes from >> .I key >> pointer, but >> .IR "fp\ -\ 4" >> .\" FIXME!! I'm lost! What is 'fp' in this context? >=20 > it refers to 2nd argument of 'bpf_map_lookup_elem(map_fd, fp - 4)' > fp =3D top of the stack. > fp - 4 =3D pointer to 4 bytes below top of the stack. > So 8 byte access from there will be out of bounds. Okay. I added some words mentioning that 'fp' is top of stack. >> The following map types are supported: >> .TP >> .B BPF_MAP_TYPE_HASH >> .\" commit 0f8e4bd8a1fc8c4185f1630061d0a1f2d197a475 >> .\" FIXME!! Please review the following list of points, which draws >> .\" heavily from the commit message, but reworks the text significan= tly >> .\" and so may have introduced errors. >> Hash-table maps have the following characteristics: >> .RS >> .IP * 3 >> Maps are created and destroyed by user-space programs. >> Both user-space and eBPF programs >> can perform lookuo, update, and delete operations. >=20 > typo 'lookup' Thanks, fixed. >> .IP * >> The kernel takes care of allocating and freeing key/value pairs. >> .IP * >> The >> .BR map_update_elem () >> helper with fail to insert new element when the >> .I max_entries >> limit is reached. >> (This ensures that eBPF programs cannot exhaust memory.) >> .IP * >> .BR map_update_elem () >> replaces existing elements atomically. >> .RE >> .IP >> Hash-table maps are >> optimized for speed of lookup. >> .TP >> .B BPF_MAP_TYPE_ARRAY >> .\" commit 28fbcfa08d8ed7c5a50d41a0433aad222835e8e3 >> .\" FIXME!! Please review the following list of points, which draws >> .\" heavily from the commit message, but reworks the text significan= tly >> .\" and so may have introduced errors. >> Array maps have the following characteristics: >> .RS >> .IP * 3 >> Optimized for fastest possible lookup. >> In the future ithe verifier/JIT compiler >=20 > typo 'the' =46ixed. >> may recognize lookup() operations that employ a constant key >> and optimize it into constant pointer. >> It is possible to optimize a non-constant >> key into direct pointer arithmetic as well, since pointers and >> .I value_size >> are constant for the life of the eBPF program. >> In other words, >> .BR array_map_lookup_elem () >> may be 'inlined' by the verifier/JIT compiler >> while preserving concurrent access to this map from user space. >> .IP * >> All array elements pre-allocated and zero initialized at init time >> .IP * >> The key is an array index, and must be exactly four bytes. >> .IP * >> .BR map_delete_elem () >> fails with the error >> .BR EINVAL , >> since elements cannot be deleted. >> .IP * >> .BR map_update_elem () >> replaces elements in an non-atomic fashion; >> for atomic updates, a hash-table map should be used instead. >=20 > the description of hash and array maps looks good. Okay. Thanks for checking. >> .\" FIXME The following paragraph needs amending. Alexei commented: >> .\" >> .\" Actually now in case of SOCKET_FILTER, SCHED_CLS, SCHED_ACT >> .\" the program can now access skb fields. >> .\" See 'struct __sk_buff' and commit 9bac3d6d548e5 >> .\" >> .\" Do we want some text here to explain how the program access __sk= _buff? >=20 > I think commit 9bac3d6d548e5 tried to explain it, but translating > that to english would be nice :) Yes, but my C-to-English translator failed. >> .\" FIXME!! Alexei, is the following correct? >> eBPF objects (maps and programs) can be shared between processes. >> For example, after >> .BR fork (2), >> the child inherits file descriptors referring to the same eBPF objec= ts. >> In addition, file descriptors referring to eBPF objects can be >> transferred over UNIX domain sockets. >> File descriptors referring to eBPF objects can be duplicated >> in the usual way, using >> .BR dup (2) >> and similar calls. >> An eBPF object is deallocated only after all file descriptors >> referring to the object have been closed. >=20 > yes. all correct. Thanks. >> eBPF programs can be written in a restricted C that is compiled (usi= ng the >> .B clang >> compiler) into eBPF bytecode and executed on the in-kernel virtual m= achine or >> just-in-time compiled into native code. >> (Various features are omitted from this restricted C, such as loops, >> global variables, variadic functions, floating-point numbers, >> and passing structures as function arguments.) >> Some examples can be found in the >> .I samples/bpf/*_kern.c >> files in the kernel source tree. >=20 > thanks. whole thing looks good. Thanks. Below is the current rendered version of the man page. Cheers, Michael NAME bpf - perform a command on an extended eBPF map or program SYNOPSIS #include int bpf(int cmd, union bpf_attr *attr, unsigned int size); DESCRIPTION The bpf() system call performs a range of operations related t= o extended Berkeley Packet Filters. Extended BPF (or eBPF) is sim= =E2=80=90 ilar to the original ("classic") BPF (cBPF) used to filter net= =E2=80=90 work packets. For both cBPF and eBPF programs, the kernel stati= =E2=80=90 cally analyzes the programs before loading them, in order t= o ensure that they cannot harm the running system. eBPF extends cBPF in multiple ways, including the ability to cal= l a fixed set of in-kernel helper functions (via the BPF_CAL= L opcode extension provided by eBPF) and access shared data struc= =E2=80=90 tures such as eBPF maps. Extended BPF Design/Architecture BPF maps are a generic data structure for storage of differen= t data types. A user process can create multiple maps (wit= h key/value-pairs being opaque bytes of data) and access them vi= a file descriptors. Differnt eBPF programs can access the sam= e maps in parallel. It's up to the user process and eBPF progra= m to decide what they store inside maps. eBPF programs are similar to kernel modules. They are loaded b= y the user process and automatically unloaded when the proces= s exits. Each program is a set of instructions that is safe to ru= n until its completion. An in-kernel verifier statically deter= =E2=80=90 mines that the eBPF program terminates and is safe to execute= =2E During verification, the kernel increments reference counts fo= r each of the maps that the eBPF program uses, so that the selecte= d maps cannot be removed until the program is unloaded. eBPF programs can be attached to different events. These event= s can be the arrival of network packets, tracing events, classifi= =E2=80=90 cation event by qdisc (for eBPF programs attached to a tc(8= ) classifier), and other types that may be added in the future. = A new event triggers execution of the eBPF program, which may stor= e information about the event in eBPF maps. Beyond storing data= , eBPF programs may call a fixed set of in-kernel helper functions= =2E The same eBPF program can be attached to multiple events and dif= =E2=80=90 ferent eBPF programs can access the same map: tracing tracing tracing packet packet event A event B event C on eth0 on eth1 | | | | | | | | | | --> tracing <-- tracing socket socket prog_1 prog_2 prog_3 prog_4 | | | | |--- -----| |-------| map_3 map_1 map_2 Arguments The operation to be performed by the bpf() system call is deter= =E2=80=90 mined by the cmd argument. Each operation takes an accompanyin= g argument, provided via attr, which is a pointer to a union o= f type bpf_attr (see below). The size argument is the size of th= e union pointed to by attr. The value provided in cmd is one of the following: BPF_MAP_CREATE Create a map with and return a file descriptor that refer= s to the map. BPF_MAP_LOOKUP_ELEM Look up an element by key in a specified map and retur= n its value. BPF_MAP_UPDATE_ELEM Create or update an element (key/value pair) in a speci= =E2=80=90 fied map. BPF_MAP_DELETE_ELEM Look up and delete an element by key in a specified map. BPF_MAP_GET_NEXT_KEY Look up an element by key in a specified map and retur= n the key of the next element. BPF_PROG_LOAD Verify and load an eBPF program, returning a new fil= e descriptor associated with the program. The bpf_attr union consists of various anonymous structures tha= t are used by different bpf() commands: union bpf_attr { struct { /* Used by BPF_MAP_CREATE */ __u32 map_type; __u32 key_size; /* size of key in bytes *= / __u32 value_size; /* size of value in bytes= */ __u32 max_entries; /* maximum number of entr= ies in a map */ }; struct { /* Used by BPF_MAP_*_ELEM and BPF_MAP_GET_NE= XT_KEY commands */ __u32 map_fd; __aligned_u64 key; union { __aligned_u64 value; __aligned_u64 next_key; }; __u64 flags; }; struct { /* Used by BPF_PROG_LOAD */ __u32 prog_type; __u32 insn_cnt; __aligned_u64 insns; /* 'const struct bpf_insn = *' */ __aligned_u64 license; /* 'const char *' */ __u32 log_level; /* verbosity level of veri= fier */ __u32 log_size; /* size of user buffer */ __aligned_u64 log_buf; /* user supplied 'char *' buffer */ __u32 kern_version; /* checked when prog_type=3D= kprobe (since Linux 4.1) */ }; } __attribute__((aligned(8))); eBPF maps Maps are a generic data structure for storage of different type= s of data. They allow sharing of data between eBPF kernel pro= =E2=80=90 grams, and also between kernel and user-space applications. Each map type has the following attributes: * type * maximum number of elements * key size in bytes * value size in bytes The following wrapper functions demonstrate how various bpf(= ) commands can be used to access the maps. The functions use th= e cmd argument to invoke different operations. BPF_MAP_CREATE The BPF_MAP_CREATE command creates a new map, returning = a new file descriptor that refers to the map. int bpf_create_map(enum bpf_map_type map_type, int key_si= ze, int value_size, int max_entries) { union bpf_attr attr =3D { .map_type =3D map_type, .key_size =3D key_size, .value_size =3D value_size, .max_entries =3D max_entries }; return bpf(BPF_MAP_CREATE, &attr, sizeof(attr)); } The new map has the type specified by map_type, an= d attributes as specified in key_size, value_size, an= d max_entries. On success, this operation returns a fil= e descriptor. On error, -1 is returned and errno is set t= o EINVAL, EPERM, or ENOMEM. The attributes key_size and value_size will be used by th= e verifier during program loading to check that the progra= m is calling bpf_map_*_elem() helper functions with a cor= =E2=80=90 rectly initialized key and to check that the progra= m doesn't access the map element value beyond the specifie= d value_size. For example, when a map is created with = a key_size of 8 and the eBPF program calls bpf_map_lookup_elem(map_fd, fp - 4) the program will be rejected, since the in-kernel helpe= r function bpf_map_lookup_elem(map_fd, void *key) expects to read 8 bytes from the location pointed to b= y key, but the fp - 4 (where fp is the top of the stack= ) starting address will cause out-of-bounds stack access. Similarly, when a map is created with a value_size of = 1 and the eBPF program contains value =3D bpf_map_lookup_elem(...); *(u32 *) value =3D 1; the program will be rejected, since it accesses the valu= e pointer beyond the specified 1 byte value_size limit. Currently, the following values are supported fo= r map_type: enum bpf_map_type { BPF_MAP_TYPE_UNSPEC, /* Reserve 0 as invalid map= type */ BPF_MAP_TYPE_HASH, BPF_MAP_TYPE_ARRAY, BPF_MAP_TYPE_PROG_ARRAY, }; map_type selects one of the available map implementation= s in the kernel. For all map types, eBPF programs acces= s maps with the same bpf_map_lookup_elem() an= d bpf_map_update_elem() helper functions. Further detail= s of the various map types are given below. BPF_MAP_LOOKUP_ELEM The BPF_MAP_LOOKUP_ELEM command looks up an element with = a given key in the map referred to by the file descripto= r fd. int bpf_lookup_elem(int fd, void *key, void *value) { union bpf_attr attr =3D { .map_fd =3D fd, .key =3D ptr_to_u64(key), .value =3D ptr_to_u64(value), }; return bpf(BPF_MAP_LOOKUP_ELEM, &attr, sizeof(att= r)); } If an element is found, the operation returns zero an= d stores the element's value into value, which must point t= o a buffer of value_size bytes. If no element is found, the operation returns -1 and set= s errno to ENOENT. BPF_MAP_UPDATE_ELEM The BPF_MAP_UPDATE_ELEM command creates or updates an ele= =E2=80=90 ment with a given key/value in the map referred to by th= e file descriptor fd. int bpf_update_elem(int fd, void *key, void *value, __u64= flags) { union bpf_attr attr =3D { .map_fd =3D fd, .key =3D ptr_to_u64(key), .value =3D ptr_to_u64(value), .flags =3D flags, }; return bpf(BPF_MAP_UPDATE_ELEM, &attr, sizeof(att= r)); } The flags argument should be specified as one of the fol= =E2=80=90 lowing: BPF_ANY Create a new element or update an existing element= =2E BPF_NOEXIST Create a new element only if it did not exist. BPF_EXIST Update an existing element. On success, the operation returns zero. On error, -1 i= s returned and errno is set to EINVAL, EPERM, ENOMEM, o= r E2BIG. E2BIG indicates that the number of elements in th= e map reached the max_entries limit specified at map cre= =E2=80=90 ation time. EEXIST will be returned if flags specifie= s BPF_NOEXIST and the element with key already exists in th= e map. ENOENT will be returned if flags specifies BPF_EXIS= T and the element with key doesn't exist in the map. BPF_MAP_DELETE_ELEM The BPF_MAP_DELETE_ELEM command deleted the element whos= e key is key from the map referred to by the file descripto= r fd. int bpf_delete_elem(int fd, void *key) { union bpf_attr attr =3D { .map_fd =3D fd, .key =3D ptr_to_u64(key), }; return bpf(BPF_MAP_DELETE_ELEM, &attr, sizeof(att= r)); } On success, zero is returned. If the element is no= t found, -1 is returned and errno is set to ENOENT. BPF_MAP_GET_NEXT_KEY The BPF_MAP_GET_NEXT_KEY command looks up an element b= y key in the map referred to by the file descriptor fd an= d sets the next_key pointer to the key of the next element. int bpf_get_next_key(int fd, void *key, void *next_key) { union bpf_attr attr =3D { .map_fd =3D fd, .key =3D ptr_to_u64(key), .next_key =3D ptr_to_u64(next_key), }; return bpf(BPF_MAP_GET_NEXT_KEY, &attr, sizeof(at= tr)); } If key is found, the operation returns zero and sets th= e next_key pointer to the key of the next element. If ke= y is not found, the operation returns zero and sets th= e next_key pointer to the key of the first element. If ke= y is the last element, -1 is returned and errno is set t= o ENOENT. Other possible errno values are ENOMEM, EFAULT= , EPERM, and EINVAL. This method can be used to iterat= e over all elements in the map. close(map_fd) Delete the map referred to by the file descriptor map_fd= =2E When the user-space program that created a map exits, al= l maps will be deleted automatically (but see NOTES). eBPF map types The following map types are supported: BPF_MAP_TYPE_HASH Hash-table maps have the following characteristics: * Maps are created and destroyed by user-space programs= =2E Both user-space and eBPF programs can perform lookup= , update, and delete operations. * The kernel takes care of allocating and freein= g key/value pairs. * The map_update_elem() helper with fail to insert ne= w element when the max_entries limit is reached. (Thi= s ensures that eBPF programs cannot exhaust memory.) * map_update_elem() replaces existing elements atomi= =E2=80=90 cally. Hash-table maps are optimized for speed of lookup. BPF_MAP_TYPE_ARRAY Array maps have the following characteristics: * Optimized for fastest possible lookup. In the futur= e the verifier/JIT compiler may recognize lookup() opera= =E2=80=90 tions that employ a constant key and optimize it int= o constant pointer. It is possible to optimize a non= - constant key into direct pointer arithmetic as well= , since pointers and value_size are constant for the lif= e of the eBPF program. In other words= , array_map_lookup_elem() may be 'inlined' by the veri= =E2=80=90 fier/JIT compiler while preserving concurrent access t= o this map from user space. * All array elements pre-allocated and zero initialize= d at init time * The key is an array index, and must be exactly fou= r bytes. * map_delete_elem() fails with the error EINVAL, sinc= e elements cannot be deleted. * map_update_elem() replaces elements in an non-atomi= c fashion; for atomic updates, a hash-table map should b= e used instead. Among the uses for array maps are the following: * As "global" eBPF variables: an array of 1 element whos= e key is (index) 0 and where the value is a collection o= f 'global' variables which eBPF programs can use to kee= p state between events. * Aggregation of tracing events into a fixed set of buck= =E2=80=90 ets. BPF_MAP_TYPE_PROG_ARRAY (since Linux 4.2) [To be completed] eBPF programs The BPF_PROG_LOAD command is used to load an eBPF program int= o the kernel. The return value for this command is a new fil= e descriptor associated with this eBPF program. char bpf_log_buf[LOG_BUF_SIZE]; int bpf_prog_load(enum bpf_prog_type prog_type, const struct bpf_insn *insns, int insn_cnt, const char *license) { union bpf_attr attr =3D { .prog_type =3D prog_type, .insns =3D ptr_to_u64(insns), .insn_cnt =3D insn_cnt, .license =3D ptr_to_u64(license), .log_buf =3D ptr_to_u64(bpf_log_buf), .log_size =3D LOG_BUF_SIZE, .log_level =3D 1, }; return bpf(BPF_PROG_LOAD, &attr, sizeof(attr)); } prog_type is one of the available program types: enum bpf_prog_type { BPF_PROG_TYPE_UNSPEC, /* Reserve 0 as invalid program type */ BPF_PROG_TYPE_SOCKET_FILTER, BPF_PROG_TYPE_KPROBE, BPF_PROG_TYPE_SCHED_CLS, BPF_PROG_TYPE_SCHED_ACT, }; For further details of eBPF program types, see below. The remaining fields of bpf_attr are set as follows: * insns is an array of struct bpf_insn instructions. * insn_cnt is the number of instructions in the program referre= d to by insns. * license is a license string, which must be GPL compatible t= o call helper functions marked gpl_only. * log_buf is a pointer to a caller-allocated buffer in which th= e in-kernel verifier can store the verification log. This lo= g is a multi-line string that can be checked by the progra= m author in order to understand how the verifier came to th= e conclusion that the BPF program is unsafe. The format of th= e output can change at any time as the verifier evolves. * log_size size of the buffer pointed to by log_bug. If th= e size of the buffer is not large enough to store all verifie= r messages, -1 is returned and errno is set to ENOSPC. * log_level verbosity level of the verifier. A value of zer= o means that the verifier will not provide a log. Applying close(2) to the file descriptor returned b= y BPF_PROG_LOAD will unload the eBPF program (but see NOTES). Maps are accessible from eBPF programs and are used to exchang= e data between eBPF programs and between eBPF programs and user= - space programs. For example, eBPF programs can process variou= s events (like kprobe, packets) and store their data into a map= , and user-space programs can then fetch data from the map. Con= =E2=80=90 versely, user-space programs can use a map as a configuratio= n mechanism, populating the map with values checked by the eBP= =46 program, which then modifies its behavior on the fly according t= o those values. eBPF program types By picking prog_type, the program author selects a set of helpe= r functions that can be called from the eBPF program and the corre= =E2=80=90 sponding format of struct bpf_context (which is the data blo= b passed into the eBPF program as the first argument). For exam= =E2=80=90 ple, programs loaded with a prog_type o= f BPF_PROG_TYPE_SOCKET_FILTER may call the bpf_map_lookup_elem(= ) helper, whereas some other program types may not be able t= o employ this helper. The set of functions available to eBPF pro= =E2=80=90 grams of a given type may increase in the future. The following program types are supported: BPF_PROG_TYPE_SOCKET_FILTER (since Linux 3.19) Currently, the set of functions fo= r BPF_PROG_TYPE_SOCKET_FILTER is: bpf_map_lookup_elem(map_fd, void *key) /* look up key in a map_fd */ bpf_map_update_elem(map_fd, void *key, void *value) /* update key/value */ bpf_map_delete_elem(map_fd, void *key) /* delete key in a map_fd */ The bpf_context argument is a pointer to a struct sk_buff= =2E Programs cannot access the fields of sk_buff directly. BPF_PROG_TYPE_KPROBE (since Linux 4.1) [To be documented] BPF_PROG_TYPE_SCHED_CLS (since Linux 4.1) [To be documented] BPF_PROG_TYPE_SCHED_ACT (since Linux 4.1) [To be documented] Events Once a program is loaded, it can be attached to an event. Vari= =E2=80=90 ous kernel subsystems have different ways to do so. Since Linux 3.19, the following call will attach the progra= m prog_fd to the socket sockfd, which was created by an earlie= r call to socket(2): setsockopt(sockfd, SOL_SOCKET, SO_ATTACH_BPF, &prog_fd, sizeof(prog_fd)); Since Linux 4.1, the following call may be used to attach th= e eBPF program referred to by the file descriptor prog_fd to a per= f event file descriptor, event_fd, that was created by a previou= s call to perf_event_open(2): ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, prog_fd); EXAMPLES /* bpf+sockets example: * 1. create array map of 256 elements * 2. load program that counts number of packets received * r0 =3D skb->data[ETH_HLEN + offsetof(struct iphdr, protoco= l)] * map[r0]++ * 3. attach prog_fd to raw socket via setsockopt() * 4. print number of received TCP/UDP packets every second */ int main(int argc, char **argv) { int sock, map_fd, prog_fd, key; long long value =3D 0, tcp_cnt, udp_cnt; map_fd =3D bpf_create_map(BPF_MAP_TYPE_ARRAY, sizeof(key), sizeof(value), 256); if (map_fd < 0) { printf("failed to create map '%s'\n", strerror(errno)); /* likely not run as root */ return 1; } struct bpf_insn prog[] =3D { BPF_MOV64_REG(BPF_REG_6, BPF_REG_1), /* r6 =3D r1= */ BPF_LD_ABS(BPF_B, ETH_HLEN + offsetof(struct iphdr, prot= ocol)), /* r0 =3D ip->proto */ BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4), /* *(u32 *)(fp - 4) =3D r0 */ BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), /* r2 =3D fp= */ BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4), /* r2 =3D r2= - 4 */ BPF_LD_MAP_FD(BPF_REG_1, map_fd), /* r1 =3D ma= p_fd */ BPF_CALL_FUNC(BPF_FUNC_map_lookup_elem), /* r0 =3D map_lookup(r1, r2) */ BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2), /* if (r0 =3D=3D 0) goto pc+2 */ BPF_MOV64_IMM(BPF_REG_1, 1), /* r1 =3D 1 = */ BPF_XADD(BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0), /* lock *(u64 *) r0 +=3D r1 */ BPF_MOV64_IMM(BPF_REG_0, 0), /* r0 =3D 0 = */ BPF_EXIT_INSN(), /* return r0= */ }; prog_fd =3D bpf_prog_load(BPF_PROG_TYPE_SOCKET_FILTER, prog, sizeof(prog), "GPL"); sock =3D open_raw_sock("lo"); assert(setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, &prog_fd, sizeof(prog_fd)) =3D=3D 0); for (;;) { key =3D IPPROTO_TCP; assert(bpf_lookup_elem(map_fd, &key, &tcp_cnt) =3D=3D 0)= ; key =3D IPPROTO_UDP assert(bpf_lookup_elem(map_fd, &key, &udp_cnt) =3D=3D 0)= ; printf("TCP %lld UDP %lld packets0, tcp_cnt, udp_cnt); sleep(1); } return 0; } Some complete working code can be found in the samples/bpf direc= =E2=80=90 tory in the kernel source tree. RETURN VALUE For a successful call, the return value depends on the operation= : BPF_MAP_CREATE The new file descriptor associated with the eBPF map. BPF_PROG_LOAD The new file descriptor associated with the eBPF program. All other commands Zero. On error, -1 is returned, and errno is set appropriately. ERRORS EPERM The call was made without sufficient privilege (withou= t the CAP_SYS_ADMIN capability). ENOMEM Cannot allocate sufficient memory. EBADF fd is not an open file descriptor EFAULT One of the pointers (key or value or log_buf or insns) i= s outside the accessible address space. EINVAL The value specified in cmd is not recognized by this ker= =E2=80=90 nel. EINVAL For BPF_MAP_CREATE, either map_type or attributes ar= e invalid. EINVAL For BPF_MAP_*_ELEM commands, some of the fields of unio= n bpf_attr that are not used by this command are not set t= o zero. EINVAL For BPF_PROG_LOAD, indicates an attempt to load an invali= d program. BPF programs can be deemed einvalid due t= o unrecognized instructions, the use of reserved fields= , jumps out of range, infinite loops or calls of unknow= n functions. EACCES For BPF_PROG_LOAD, even though all program instruction= s are valid, the program has been rejected because it wa= s deemed unsafe. This may be because it may have accessed = a disallowed memory region or an uninitialized stack/regis= =E2=80=90 ter or because the function constraints don't match th= e actual types or because there was a misaligned memor= y access. In this case, it is recommended to call bpf(= ) again with log_level =3D 1 and examine log_buf for the s= pe=E2=80=90 cific reason provided by the verifier. ENOENT For BPF_MAP_LOOKUP_ELEM or BPF_MAP_DELETE_ELEM, indicate= s that the element with the given key was not found. E2BIG The BPF program is too large or a map reached th= e max_entries limit (maximum number of elements). VERSIONS The bpf() system call first appeared in Linux 3.18. CONFORMING TO The bpf() system call is Linux-specific. NOTES In the current implementation, all bpf() commands require th= e caller to have the CAP_SYS_ADMIN capability. eBPF objects (maps and programs) can be shared between processes= =2E For example, after fork(2), the child inherits file descriptor= s referring to the same eBPF objects. In addition, file descrip= =E2=80=90 tors referring to eBPF objects can be transferred over UNI= X domain sockets. File descriptors referring to eBPF objects ca= n be duplicated in the usual way, using dup(2) and similar calls= =2E An eBPF object is deallocated only after all file descriptor= s referring to the object have been closed. eBPF programs can be written in a restricted C that is compile= d (using the clang compiler) into eBPF bytecode and executed on th= e in-kernel virtual machine or just-in-time compiled into nativ= e code. (Various features are omitted from this restricted C, suc= h as loops, global variables, variadic functions, floating-poin= t numbers, and passing structures as function arguments.) Som= e examples can be found in the samples/bpf/*_kern.c files in th= e kernel source tree. SEE ALSO seccomp(2), socket(7), tc(8), tc-bpf(8) Both classic and extended BPF are explained in the kernel sourc= e file Documentation/networking/filter.txt. --=20 Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/ -- To unsubscribe from this list: send the line "unsubscribe linux-man" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html