* Re: [RFC bpf-next] bpf: document eBPF helpers and add a script to generate man page
From: Markus Heiser @ 2018-04-09 9:26 UTC (permalink / raw)
To: Daniel Borkmann, Jonathan Corbet
Cc: Quentin Monnet, ast, netdev, oss-drivers, Linux Doc Mailing List,
linux-man
In-Reply-To: <05d2d03a-0b39-9d0c-9ba0-3461afc45fac@iogearbox.net>
On 04/06/2018 01:11 PM, Quentin Monnet wrote:
>> eBPF helper functions can be called from within eBPF programs to perform
>> a variety of tasks that would be otherwise hard or impossible to do with
>> eBPF itself. There is a growing number of such helper functions in the
>> kernel, but documentation is scarce. The main user space header file
>> does contain a short commented description of most helpers, but it is
>> somewhat outdated and not complete. It is more a "cheat sheet" than a
>> real documentation accessible to new eBPF developers.
>>
>> This commit attempts to improve the situation by replacing the existing
>> overview for the helpers with a more developed description. Furthermore,
>> a Python script is added to generate a manual page for eBPF helpers. The
>> workflow is the following, and requires the rst2man utility:
>>
>> $ ./scripts/bpf_helpers_doc.py \
>> --filename include/uapi/linux/bpf.h > /tmp/bpf-helpers.rst
>> $ rst2man /tmp/bpf-helpers.rst > /tmp/bpf-helpers.7
>> $ man /tmp/bpf-helpers.7
>>
>> The objective is to keep all documentation related to the helpers in a
>> single place,
Do we really need another kernel-doc parser?
./scripts/kernel-doc include/uapi/linux/bpf.h
should already do the job (producing .rst). For more infos, take a look at
https://www.kernel.org/doc/html/latest/doc-guide/index.html
-- Markus --
PS: sorry for re-post, first post was HTML which is not accepted by ML :o
^ permalink raw reply
* Re: [RFC bpf-next] bpf: document eBPF helpers and add a script to generate man page
From: Daniel Borkmann @ 2018-04-09 9:25 UTC (permalink / raw)
To: Markus Heiser, Jonathan Corbet
Cc: Quentin Monnet, ast, netdev, oss-drivers, Linux Doc Mailing List,
linux-man
In-Reply-To: <AB89F6E2-DE8A-4259-B361-BE96AD32E84E@darmarit.de>
On 04/09/2018 11:21 AM, Markus Heiser wrote:
[...]
> Do we really need another kernel-doc parser?
>
> ./scripts/kernel-doc include/uapi/linux/bpf.h
>
> should already do the job (producing .rst). For more infos, take a look at
This has absolutely zero to do with kernel-doc, but rather producing
a description of BPF helper function that are later assembled into an
actual man-page that BPF program developers (user space) can use.
^ permalink raw reply
* Re: [RFC bpf-next] bpf: document eBPF helpers and add a script to generate man page
From: Daniel Borkmann @ 2018-04-09 9:01 UTC (permalink / raw)
To: Quentin Monnet, ast; +Cc: netdev, oss-drivers, linux-doc, linux-man
In-Reply-To: <20180406111122.11038-1-quentin.monnet@netronome.com>
On 04/06/2018 01:11 PM, Quentin Monnet wrote:
> eBPF helper functions can be called from within eBPF programs to perform
> a variety of tasks that would be otherwise hard or impossible to do with
> eBPF itself. There is a growing number of such helper functions in the
> kernel, but documentation is scarce. The main user space header file
> does contain a short commented description of most helpers, but it is
> somewhat outdated and not complete. It is more a "cheat sheet" than a
> real documentation accessible to new eBPF developers.
>
> This commit attempts to improve the situation by replacing the existing
> overview for the helpers with a more developed description. Furthermore,
> a Python script is added to generate a manual page for eBPF helpers. The
> workflow is the following, and requires the rst2man utility:
>
> $ ./scripts/bpf_helpers_doc.py \
> --filename include/uapi/linux/bpf.h > /tmp/bpf-helpers.rst
> $ rst2man /tmp/bpf-helpers.rst > /tmp/bpf-helpers.7
> $ man /tmp/bpf-helpers.7
>
> The objective is to keep all documentation related to the helpers in a
> single place, and to be able to generate from here a manual page that
> could be packaged in the man-pages repository and shipped with most
> distributions [1].
>
> Additionally, parsing the prototypes of the helper functions could
> hopefully be reused, with a different Printer object, to generate
> header files needed in some eBPF-related projects.
>
> Regarding the description of each helper, it comprises several items:
>
> - The function prototype.
> - A description of the function and of its arguments (except for a
> couple of cases, when there are no arguments and the return value
> makes the function usage really obvious).
> - A description of return values (if not void).
> - A listing of eBPF program types (if relevant, map types) compatible
> with the helper.
> - Information about the helper being restricted to GPL programs, or not.
> - The kernel version in which the helper was introduced.
> - The commit that introduced the helper (this is mostly to have it in
> the source of the man page, as it can be used to track changes and
> update the page).
>
> For several helpers, descriptions are inspired (at times, nearly copied)
> from the commit logs introducing them in the kernel--Many thanks to
> their respective authors! They were completed as much as possible, the
> objective being to have something easily accessible even for people just
> starting with eBPF. There is probably a bit more work to do in this
> direction for some helpers.
>
> Some RST formatting is used in the descriptions (not in function
> prototypes, to keep them readable, but the Python script provided in
> order to generate the RST for the manual page does add formatting to
> prototypes, to produce something pretty) to get "bold" and "italics" in
> manual pages. Hopefully, the descriptions in bpf.h file remains
> perfectly readable. Note that the few trailing white spaces are
> intentional, removing them would break paragraphs for rst2man.
>
> The descriptions should ideally be updated each time someone adds a new
> helper, or updates the behaviour (compatibility extended to new program
> types, new socket option supported...) or the interface (new flags
> available, ...) of existing ones.
>
> [1] I have not contacted people from the man-pages project prior to
> sending this RFC, so I can offer no guaranty at this time that they
> would accept to take the generated man page.
>
> Cc: linux-doc@vger.kernel.org
> Cc: linux-man@vger.kernel.org
> Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Great work, thanks a lot for doing this!
[...]
> + * int bpf_probe_read(void *dst, u32 size, const void *src)
> + * Description
> + * For tracing programs, safely attempt to read *size* bytes from
> + * address *src* and store the data in *dst*.
> + * Return
> + * 0 on success, or a negative error in case of failure.
> + * For
> + * *Tracing programs*.
> + * GPL only
> + * Yes
> + * Since
> + * Linux 4.1
> + *
> + * .. commit 2541517c32be
> *
> * u64 bpf_ktime_get_ns(void)
> - * Return: current ktime
> - *
> - * int bpf_trace_printk(const char *fmt, int fmt_size, ...)
> - * Return: length of buffer written or negative error
> + * Description
> + * Return the time elapsed since system boot, in nanoseconds.
> + * Return
> + * Current *ktime*.
> + * For
> + * All program types, except
> + * **BPF_PROG_TYPE_CGROUP_DEVICE**.
I think we should probably always just list the actual program types instead,
since when new types get added to the kernel not supporting bpf_ktime_get_ns()
helper in this example, the above statement would be misleading (potentially
more misleading than the other way around when it's not yet mentioned to be
supported).
> + * GPL only
> + * Yes
> + * Since
> + * Linux 4.1
> + *
> + * .. commit d9847d310ab4
> + *
> + * int bpf_trace_printk(const char *fmt, u32 fmt_size, ...)
> + * Description
> + * This helper is a "printk()-like" facility for debugging. It
> + * prints a message defined by format *fmt* (of size *fmt_size*)
> + * to file *\/sys/kernel/debug/tracing/trace* from DebugFS, if
> + * available. It can take up to three additional **u64**
> + * arguments (as an eBPF helpers, the total number of arguments is
> + * limited to five). Each time the helper is called, it appends a
> + * line that looks like the following:
> + *
> + * ::
> + *
> + * telnet-470 [001] .N.. 419421.045894: 0x00000001: BPF command: 2
> + *
> + * In the above:
> + *
> + * * ``telnet`` is the name of the current task.
> + * * ``470`` is the PID of the current task.
> + * * ``001`` is the CPU number on which the task is
> + * running.
> + * * In ``.N..``, each character refers to a set of
> + * options (whether irqs are enabled, scheduling
> + * options, whether hard/softirqs are running, level of
> + * preempt_disabled respectively). **N** means that
> + * **TIF_NEED_RESCHED** and **PREEMPT_NEED_RESCHED**
> + * are set.
> + * * ``419421.045894`` is a timestamp.
> + * * ``0x00000001`` is a fake value used by BPF for the
> + * instruction pointer register.
> + * * ``BPF command: 2`` is the message formatted with
> + * *fmt*.
> + *
> + * The conversion specifiers supported by *fmt* are similar, but
> + * more limited than for printk(). They are **%d**, **%i**,
> + * **%u**, **%x**, **%ld**, **%li**, **%lu**, **%lx**, **%lld**,
> + * **%lli**, **%llu**, **%llx**, **%p**, **%s**. No modifier (size
> + * of field, padding with zeroes, etc.) is available, and the
> + * helper will silently fail if it encounters an unknown
> + * specifier.
> + *
> + * Also, note that **bpf_trace_printk**\ () is slow, and should
> + * only be used for debugging purposes. For passing values to user
> + * space, perf events should be preferred.
> + * Return
> + * The number of bytes written to the buffer, or a negative error
> + * in case of failure.
> + * For
> + * All types of programs.
Ditto, and various other places.
> + * GPL only
> + * Yes
> + * Since
> + * Linux 4.1
> + *
> + * .. commit 9c959c863f82
> *
> * u32 bpf_prandom_u32(void)
> - * Return: random value
> - *
> - * u32 bpf_raw_smp_processor_id(void)
> - * Return: SMP processor ID
> - *
[...]
> + * u32 bpf_get_smp_processor_id(void)
> + * Return
> + * The SMP (Symmetric multiprocessing) processor id.
> + * For
> + * *Networking programs*.
Ditto plus above is actually not correct. See tracing_func_proto():
[...]
case BPF_FUNC_get_smp_processor_id:
return &bpf_get_smp_processor_id_proto;
[...]
> + * GPL only
> + * No
> + * Since
> + * Linux 4.1
> + *
> + * .. commit c04167ce2ca0
> + *
> + * int bpf_skb_store_bytes(struct sk_buff *skb, u32 offset, const void *from, u32 len, u64 flags)
> + * Description
> + * Store *len* bytes from address *from* into the packet
> + * associated to *skb*, at *offset*. *flags* are a combination of
> + * **BPF_F_RECOMPUTE_CSUM** (automatically recompute the
> + * checksum for the packet after storing the bytes) and
> + * **BPF_F_INVALIDATE_HASH** (set *skb*\ **->hash**, *skb*\
> + * **->swhash** and *skb*\ **->l4hash** to 0).
> + *
> + * A call to this helper is susceptible to change data from the
> + * packet. Therefore, at load time, all checks on pointers
> + * previously done by the verifier are invalidated and must be
> + * performed again.
> + * Return
[...]
Agree with Alexei that 'Description' but also 'Return' is the most useful
(the latter we can still improve a bit wrt errors). 'For' gets indeed quickly
out of hand but I do think that for BPF program developers 'Since' has good
value to quickly check when a helper could be available, but then again
this is actually tied to an individual kconfig and/or program type, and could
potentially require to add or (worst case) remove the info in a second commit
e.g. after the merge window similar with the commit sha. Probably best indeed
to stick to the signature, 'Description' and 'Return' initially.
We should probably also have the bpf_helpers_doc.py workflow in a separate
comment above this one such that developers don't have to look up the original
commit message, but have the required steps to render the rst/man page available
right where they need it.
Thanks,
Daniel
^ permalink raw reply
* [PATCH] slip: Check if rstate is initialized before uncompressing
From: Tejaswi Tanikella @ 2018-04-09 8:53 UTC (permalink / raw)
To: netdev
On receiving a packet the state index points to the rstate which must be
used to fill up IP and TCP headers. But if the state index points to a
rstate which is unitialized, i.e. filled with zeros, it gets stuck in an
infinite loop inside ip_fast_csum trying to compute the ip checsum of a
header with zero length.
89.666953: <2> [<ffffff9dd3e94d38>] slhc_uncompress+0x464/0x468
89.666965: <2> [<ffffff9dd3e87d88>] ppp_receive_nonmp_frame+0x3b4/0x65c
89.666978: <2> [<ffffff9dd3e89dd4>] ppp_receive_frame+0x64/0x7e0
89.666991: <2> [<ffffff9dd3e8a708>] ppp_input+0x104/0x198
89.667005: <2> [<ffffff9dd3e93868>] pppopns_recv_core+0x238/0x370
89.667027: <2> [<ffffff9dd4428fc8>] __sk_receive_skb+0xdc/0x250
89.667040: <2> [<ffffff9dd3e939e4>] pppopns_recv+0x44/0x60
89.667053: <2> [<ffffff9dd4426848>] __sock_queue_rcv_skb+0x16c/0x24c
89.667065: <2> [<ffffff9dd4426954>] sock_queue_rcv_skb+0x2c/0x38
89.667085: <2> [<ffffff9dd44f7358>] raw_rcv+0x124/0x154
89.667098: <2> [<ffffff9dd44f7568>] raw_local_deliver+0x1e0/0x22c
89.667117: <2> [<ffffff9dd44c8ba0>] ip_local_deliver_finish+0x70/0x24c
89.667131: <2> [<ffffff9dd44c92f4>] ip_local_deliver+0x100/0x10c
./scripts/faddr2line vmlinux slhc_uncompress+0x464/0x468 output:
ip_fast_csum at arch/arm64/include/asm/checksum.h:40
(inlined by) slhc_uncompress at drivers/net/slip/slhc.c:615
Adding a variable to indicate if the current rstate is initialized. If
such a packet arrives, move to toss state.
Signed-off-by: Tejaswi Tanikella <tejaswit@codeaurora.org>
---
drivers/net/slip/slhc.c | 5 +++++
include/net/slhc_vj.h | 1 +
2 files changed, 6 insertions(+)
diff --git a/drivers/net/slip/slhc.c b/drivers/net/slip/slhc.c
index 5782733..e710b0b 100644
--- a/drivers/net/slip/slhc.c
+++ b/drivers/net/slip/slhc.c
@@ -509,6 +509,10 @@ struct slcompress *
if(x < 0 || x > comp->rslot_limit)
goto bad;
+ /* Check if the cstate is initialized */
+ if(!comp->rstate[x].initialized)
+ goto bad;
+
comp->flags &=~ SLF_TOSS;
comp->recv_current = x;
} else {
@@ -673,6 +677,7 @@ struct slcompress *
if (cs->cs_tcp.doff > 5)
memcpy(cs->cs_tcpopt, icp + ihl*4 + sizeof(struct tcphdr), (cs->cs_tcp.doff - 5) * 4);
cs->cs_hsize = ihl*2 + cs->cs_tcp.doff*2;
+ cs->initialized = 1;
/* Put headers back on packet
* Neither header checksum is recalculated
*/
diff --git a/include/net/slhc_vj.h b/include/net/slhc_vj.h
index 8716d59..d673365 100644
--- a/include/net/slhc_vj.h
+++ b/include/net/slhc_vj.h
@@ -127,6 +127,7 @@
*/
struct cstate {
byte_t cs_this; /* connection id number (xmit) */
+ byte_t initialized; /* non-zero if initialized */
struct cstate *next; /* next in ring (xmit) */
struct iphdr cs_ip; /* ip/tcp hdr from most recent packet */
struct tcphdr cs_tcp;
--
1.9.1
^ permalink raw reply related
* Re: [PATCH net-next 6/6] netdevsim: Add simple FIB resource controller via devlink
From: Jiri Pirko @ 2018-04-09 8:18 UTC (permalink / raw)
To: David Ahern
Cc: netdev, davem, roopa, shm, jiri, idosch, jakub.kicinski,
andy.roulin
In-Reply-To: <c85f2b41-6100-fe12-b30a-8f12f63e8aab@cumulusnetworks.com>
Fri, Apr 06, 2018 at 11:22:29PM CEST, dsa@cumulusnetworks.com wrote:
>On 4/5/18 11:52 PM, Jiri Pirko wrote:
>> Thu, Apr 05, 2018 at 11:06:41PM CEST, dsa@cumulusnetworks.com wrote:
>>> On 4/5/18 2:10 PM, David Ahern wrote:
>>>>
>>>> The ASIC here is the kernel tables in a namespace. It does not make
>>>> sense to have 2 devlink instances for a single namespace.
>>>
>>> I put this example controller in netdevsim per a suggestion from Ido.
>>> The netdevsim seemed like a good idea given that modules intention --
>>> testing network facilities. Perhaps I should have done this as a
>>> completely standalone module ...
>>>
>>> The intention is to treat the kernel's tables *per namespace* as a
>>> standalone entity that can be managed very similar to ASIC resources.
>>
>> So you say you want to treat a namespace as an ASIC? That sounds very
>> odd to me :/
>
>Why? The kernel has forwarding tables, acl's, etc just like the ASIC,
>and each namespace is a separate set of tables.
I don't get it. What's the point? For HW, the reason is it has limited
resources and those resources are not mapped 1:1 with kernel object.
However, for kernel, that is meaningless.
>
>If you think about it, userspace "programs" the kernel just like mlxsw
>and userspace SDKs "program" an asic.
I don't give a **** about sdks. I have no clue why you mention that here.
>
>
>>> Given that I can add a resource controller module
>>> (drivers/net/kern_res_mgr.c?) that creates a 'struct device' per network
>>> namespace with a devlink instance. In this case the device would very
>>> much be tied to the namespace 1:1.
>>
>> That sounds more reasonable and accurate, yet still odd. You would not
>> have any netdevices there? Any ports?
>>
>
>Sure, what ever ports are assigned to or created in the namespace.
>
>Nothing about the devlink API says it has to be a real h/w device.
Sure, it could represent something made-up, like netdevsim. However I
see a big misfit when you want to represent a namespace.
>Nothing about the devlink API says it can only be used for real h/w that
>has ports represented by netdevices that the devlink instance some how
>has "control" over.
>
>As the netdevsim demo shows, I can build an L3 resource controller for
>the kernel tables using just the devlink API and the in-kernel notifiers.
^ permalink raw reply
* Re: [RFC PATCH net-next v5 3/4] virtio_net: Extend virtio to use VF datapath when available
From: Jiri Pirko @ 2018-04-09 8:07 UTC (permalink / raw)
To: Samudrala, Sridhar
Cc: mst, stephen, davem, netdev, virtualization, virtio-dev,
jesse.brandeburg, alexander.h.duyck, kubakici, jasowang,
loseweigh
In-Reply-To: <bb4f0821-246f-1052-de8b-d96be3388ed3@intel.com>
Sat, Apr 07, 2018 at 12:59:14AM CEST, sridhar.samudrala@intel.com wrote:
>On 4/6/2018 5:48 AM, Jiri Pirko wrote:
>> Thu, Apr 05, 2018 at 11:08:22PM CEST, sridhar.samudrala@intel.com wrote:
[...]
>> > +static int virtnet_bypass_join_child(struct net_device *bypass_netdev,
>> > + struct net_device *child_netdev)
>> > +{
>> > + struct virtnet_bypass_info *vbi;
>> > + bool backup;
>> > +
>> > + vbi = netdev_priv(bypass_netdev);
>> > + backup = (child_netdev->dev.parent == bypass_netdev->dev.parent);
>> > + if (backup ? rtnl_dereference(vbi->backup_netdev) :
>> > + rtnl_dereference(vbi->active_netdev)) {
>> > + netdev_info(bypass_netdev,
>> > + "%s attempting to join bypass dev when %s already present\n",
>> > + child_netdev->name, backup ? "backup" : "active");
>> Bypass module should check if there is already some other netdev
>> enslaved and refuse right there.
>
>This will work for virtio-net with 3 netdev model, but this check has to be done by netvsc
>as its bypass_netdev is same as the backup_netdev.
>Will add a flag while registering with the bypass module to indicate if the driver is doing
>a 2 netdev or 3 netdev model and based on that flag this check can be done in bypass module
>for 3 netdev scenario.
Just let me undestand it clearly. What I expect the difference would be
between 2netdev and3 netdev model is this:
2netdev:
bypass_master
/
/
VF_slave
3netdev:
bypass_master
/ \
/ \
VF_slave backup_slave
Is that correct? If not, how does it look like?
Thanks!
^ permalink raw reply
* [PATCH RESEND v2] vhost-net: set packet weight of tx polling to 2 * vq size
From: haibinzhang(张海斌) @ 2018-04-09 7:22 UTC (permalink / raw)
To: Michael S. Tsirkin, Jason Wang, kvm@vger.kernel.org,
virtualization@lists.linux-foundation.org, netdev@vger.kernel.org,
linux-kernel@vger.kernel.org
Cc: lidongchen(陈立东),
yunfangtai(台运方)
handle_tx will delay rx for tens or even hundreds of milliseconds when tx busy
polling udp packets with small length(e.g. 1byte udp payload), because setting
VHOST_NET_WEIGHT takes into account only sent-bytes but no single packet length.
Ping-Latencies shown below were tested between two Virtual Machines using
netperf (UDP_STREAM, len=1), and then another machine pinged the client:
vq size=256
Packet-Weight Ping-Latencies(millisecond)
min avg max
Origin 3.319 18.489 57.303
64 1.643 2.021 2.552
128 1.825 2.600 3.224
256 1.997 2.710 4.295
512 1.860 3.171 4.631
1024 2.002 4.173 9.056
2048 2.257 5.650 9.688
4096 2.093 8.508 15.943
vq size=512
Packet-Weight Ping-Latencies(millisecond)
min avg max
Origin 6.537 29.177 66.245
64 2.798 3.614 4.403
128 2.861 3.820 4.775
256 3.008 4.018 4.807
512 3.254 4.523 5.824
1024 3.079 5.335 7.747
2048 3.944 8.201 12.762
4096 4.158 11.057 19.985
Seems pretty consistent, a small dip at 2 VQ sizes.
Ring size is a hint from device about a burst size it can tolerate. Based on
benchmarks, set the weight to 2 * vq size.
To evaluate this change, another tests were done using netperf(RR, TX) between
two machines with Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz, and vq size was
tweaked through qemu. Results shown below does not show obvious changes.
vq size=256 TCP_RR vq size=512 TCP_RR
size/sessions/+thu%/+normalize% size/sessions/+thu%/+normalize%
1/ 1/ -7%/ -2% 1/ 1/ 0%/ -2%
1/ 4/ +1%/ 0% 1/ 4/ +1%/ 0%
1/ 8/ +1%/ -2% 1/ 8/ 0%/ +1%
64/ 1/ -6%/ 0% 64/ 1/ +7%/ +3%
64/ 4/ 0%/ +2% 64/ 4/ -1%/ +1%
64/ 8/ 0%/ 0% 64/ 8/ -1%/ -2%
256/ 1/ -3%/ -4% 256/ 1/ -4%/ -2%
256/ 4/ +3%/ +4% 256/ 4/ +1%/ +2%
256/ 8/ +2%/ 0% 256/ 8/ +1%/ -1%
vq size=256 UDP_RR vq size=512 UDP_RR
size/sessions/+thu%/+normalize% size/sessions/+thu%/+normalize%
1/ 1/ -5%/ +1% 1/ 1/ -3%/ -2%
1/ 4/ +4%/ +1% 1/ 4/ -2%/ +2%
1/ 8/ -1%/ -1% 1/ 8/ -1%/ 0%
64/ 1/ -2%/ -3% 64/ 1/ +1%/ +1%
64/ 4/ -5%/ -1% 64/ 4/ +2%/ 0%
64/ 8/ 0%/ -1% 64/ 8/ -2%/ +1%
256/ 1/ +7%/ +1% 256/ 1/ -7%/ 0%
256/ 4/ +1%/ +1% 256/ 4/ -3%/ -4%
256/ 8/ +2%/ +2% 256/ 8/ +1%/ +1%
vq size=256 TCP_STREAM vq size=512 TCP_STREAM
size/sessions/+thu%/+normalize% size/sessions/+thu%/+normalize%
64/ 1/ 0%/ -3% 64/ 1/ 0%/ 0%
64/ 4/ +3%/ -1% 64/ 4/ -2%/ +4%
64/ 8/ +9%/ -4% 64/ 8/ -1%/ +2%
256/ 1/ +1%/ -4% 256/ 1/ +1%/ +1%
256/ 4/ -1%/ -1% 256/ 4/ -3%/ 0%
256/ 8/ +7%/ +5% 256/ 8/ -3%/ 0%
512/ 1/ +1%/ 0% 512/ 1/ -1%/ -1%
512/ 4/ +1%/ -1% 512/ 4/ 0%/ 0%
512/ 8/ +7%/ -5% 512/ 8/ +6%/ -1%
1024/ 1/ 0%/ -1% 1024/ 1/ 0%/ +1%
1024/ 4/ +3%/ 0% 1024/ 4/ +1%/ 0%
1024/ 8/ +8%/ +5% 1024/ 8/ -1%/ 0%
2048/ 1/ +2%/ +2% 2048/ 1/ -1%/ 0%
2048/ 4/ +1%/ 0% 2048/ 4/ 0%/ -1%
2048/ 8/ -2%/ 0% 2048/ 8/ 5%/ -1%
4096/ 1/ -2%/ 0% 4096/ 1/ -2%/ 0%
4096/ 4/ +2%/ 0% 4096/ 4/ 0%/ 0%
4096/ 8/ +9%/ -2% 4096/ 8/ -5%/ -1%
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Haibin Zhang <haibinzhang@tencent.com>
Signed-off-by: Yunfang Tai <yunfangtai@tencent.com>
Signed-off-by: Lidong Chen <lidongchen@tencent.com>
---
drivers/vhost/net.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 8139bc70ad7d..3563a305cc0a 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -44,6 +44,10 @@ MODULE_PARM_DESC(experimental_zcopytx, "Enable Zero Copy TX;"
* Using this limit prevents one virtqueue from starving others. */
#define VHOST_NET_WEIGHT 0x80000
+/* Max number of packets transferred before requeueing the job.
+ * Using this limit prevents one virtqueue from starving rx. */
+#define VHOST_NET_PKT_WEIGHT(vq) ((vq)->num * 2)
+
/* MAX number of TX used buffers for outstanding zerocopy */
#define VHOST_MAX_PEND 128
#define VHOST_GOODCOPY_LEN 256
@@ -473,6 +477,7 @@ static void handle_tx(struct vhost_net *net)
struct socket *sock;
struct vhost_net_ubuf_ref *uninitialized_var(ubufs);
bool zcopy, zcopy_used;
+ int sent_pkts = 0;
mutex_lock(&vq->mutex);
sock = vq->private_data;
@@ -580,7 +585,8 @@ static void handle_tx(struct vhost_net *net)
else
vhost_zerocopy_signal_used(net, vq);
vhost_net_tx_packet(net);
- if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
+ if (unlikely(total_len >= VHOST_NET_WEIGHT) ||
+ unlikely(++sent_pkts >= VHOST_NET_PKT_WEIGHT(vq))) {
vhost_poll_queue(&vq->poll);
break;
}
--
2.12.3
^ permalink raw reply related
* Re: [PATCH net-next v2] ipvs: fix multiplicative hashing in sh/dh/lblc/lblcr algorithms
From: Simon Horman @ 2018-04-09 7:16 UTC (permalink / raw)
To: Julian Anastasov
Cc: Vincent Bernat, Wensong Zhang, David S. Miller, netdev, lvs-devel
In-Reply-To: <alpine.LFD.2.20.1804011407500.9304@ja.home.ssi.bg>
On Sun, Apr 01, 2018 at 02:11:51PM +0300, Julian Anastasov wrote:
>
> Hello,
>
> On Sun, 1 Apr 2018, Vincent Bernat wrote:
>
> > The sh/dh/lblc/lblcr algorithms are using Knuth's multiplicative
> > hashing incorrectly. Replace its use by the hash_32() macro, which
> > correctly implements this algorithm. It doesn't use the same constant,
> > but it shouldn't matter.
> >
> > Signed-off-by: Vincent Bernat <vincent@bernat.im>
>
> Looks good to me, thanks! Simon, please apply, if possible
> with the extra space removed, see below...
>
> Acked-by: Julian Anastasov <ja@ssi.bg>
Thanks, I have applied this with the extra space removed.
I will submit it for inclusion in v4.18.
^ permalink raw reply
* [PATCH v2] vhost-net: set packet weight of tx polling to 2 * vq size
From: haibinzhang(张海斌) @ 2018-04-09 7:07 UTC (permalink / raw)
To: Michael S. Tsirkin, Jason Wang, kvm@vger.kernel.org,
virtualization@lists.linux-foundation.org, netdev@vger.kernel.org,
linux-kernel@vger.kernel.org
Cc: lidongchen(陈立东),
yunfangtai(台运方)
handle_tx will delay rx for tens or even hundreds of milliseconds when tx busy
polling udp packets with small length(e.g. 1byte udp payload), because setting
VHOST_NET_WEIGHT takes into account only sent-bytes but no single packet length.
Ping-Latencies shown below were tested between two Virtual Machines using
netperf (UDP_STREAM, len=1), and then another machine pinged the client:
vq size=256
Packet-Weight Ping-Latencies(millisecond)
min avg max
Origin 3.319 18.489 57.303
64 1.643 2.021 2.552
128 1.825 2.600 3.224
256 1.997 2.710 4.295
512 1.860 3.171 4.631
1024 2.002 4.173 9.056
2048 2.257 5.650 9.688
4096 2.093 8.508 15.943
vq size=512
Packet-Weight Ping-Latencies(millisecond)
min avg max
Origin 6.537 29.177 66.245
64 2.798 3.614 4.403
128 2.861 3.820 4.775
256 3.008 4.018 4.807
512 3.254 4.523 5.824
1024 3.079 5.335 7.747
2048 3.944 8.201 12.762
4096 4.158 11.057 19.985
Ring size is a hint from device about a burst size it can tolerate. Based on
benchmarks, set the weight to 2 * vq size.
To evaluate this change, another tests were done using netperf(RR, TX) between
two machines with Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz, and vq size was
tweaked through qemu. Results shown below does not show obvious changes.
vq size=256 TCP_RR vq size=512 TCP_RR
size/sessions/+thu%/+normalize% size/sessions/+thu%/+normalize%
1/ 1/ -7%/ -2% 1/ 1/ 0%/ -2%
1/ 4/ +1%/ 0% 1/ 4/ +1%/ 0%
1/ 8/ +1%/ -2% 1/ 8/ 0%/ +1%
64/ 1/ -6%/ 0% 64/ 1/ +7%/ +3%
64/ 4/ 0%/ +2% 64/ 4/ -1%/ +1%
64/ 8/ 0%/ 0% 64/ 8/ -1%/ -2%
256/ 1/ -3%/ -4% 256/ 1/ -4%/ -2%
256/ 4/ +3%/ +4% 256/ 4/ +1%/ +2%
256/ 8/ +2%/ 0% 256/ 8/ +1%/ -1%
vq size=256 UDP_RR vq size=512 UDP_RR
size/sessions/+thu%/+normalize% size/sessions/+thu%/+normalize%
1/ 1/ -5%/ +1% 1/ 1/ -3%/ -2%
1/ 4/ +4%/ +1% 1/ 4/ -2%/ +2%
1/ 8/ -1%/ -1% 1/ 8/ -1%/ 0%
64/ 1/ -2%/ -3% 64/ 1/ +1%/ +1%
64/ 4/ -5%/ -1% 64/ 4/ +2%/ 0%
64/ 8/ 0%/ -1% 64/ 8/ -2%/ +1%
256/ 1/ +7%/ +1% 256/ 1/ -7%/ 0%
256/ 4/ +1%/ +1% 256/ 4/ -3%/ -4%
256/ 8/ +2%/ +2% 256/ 8/ +1%/ +1%
vq size=256 TCP_STREAM vq size=512 TCP_STREAM
size/sessions/+thu%/+normalize% size/sessions/+thu%/+normalize%
64/ 1/ 0%/ -3% 64/ 1/ 0%/ 0%
64/ 4/ +3%/ -1% 64/ 4/ -2%/ +4%
64/ 8/ +9%/ -4% 64/ 8/ -1%/ +2%
256/ 1/ +1%/ -4% 256/ 1/ +1%/ +1%
256/ 4/ -1%/ -1% 256/ 4/ -3%/ 0%
256/ 8/ +7%/ +5% 256/ 8/ -3%/ 0%
512/ 1/ +1%/ 0% 512/ 1/ -1%/ -1%
512/ 4/ +1%/ -1% 512/ 4/ 0%/ 0%
512/ 8/ +7%/ -5% 512/ 8/ +6%/ -1%
1024/ 1/ 0%/ -1% 1024/ 1/ 0%/ +1%
1024/ 4/ +3%/ 0% 1024/ 4/ +1%/ 0%
1024/ 8/ +8%/ +5% 1024/ 8/ -1%/ 0%
2048/ 1/ +2%/ +2% 2048/ 1/ -1%/ 0%
2048/ 4/ +1%/ 0% 2048/ 4/ 0%/ -1%
2048/ 8/ -2%/ 0% 2048/ 8/ 5%/ -1%
4096/ 1/ -2%/ 0% 4096/ 1/ -2%/ 0%
4096/ 4/ +2%/ 0% 4096/ 4/ 0%/ 0%
4096/ 8/ +9%/ -2% 4096/ 8/ -5%/ -1%
Signed-off-by: Haibin Zhang <haibinzhang@tencent.com>
Signed-off-by: Yunfang Tai <yunfangtai@tencent.com>
Signed-off-by: Lidong Chen <lidongchen@tencent.com>
---
drivers/vhost/net.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 8139bc70ad7d..3563a305cc0a 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -44,6 +44,10 @@ MODULE_PARM_DESC(experimental_zcopytx, "Enable Zero Copy TX;"
* Using this limit prevents one virtqueue from starving others. */
#define VHOST_NET_WEIGHT 0x80000
+/* Max number of packets transferred before requeueing the job.
+ * Using this limit prevents one virtqueue from starving rx. */
+#define VHOST_NET_PKT_WEIGHT(vq) ((vq)->num * 2)
+
/* MAX number of TX used buffers for outstanding zerocopy */
#define VHOST_MAX_PEND 128
#define VHOST_GOODCOPY_LEN 256
@@ -473,6 +477,7 @@ static void handle_tx(struct vhost_net *net)
struct socket *sock;
struct vhost_net_ubuf_ref *uninitialized_var(ubufs);
bool zcopy, zcopy_used;
+ int sent_pkts = 0;
mutex_lock(&vq->mutex);
sock = vq->private_data;
@@ -580,7 +585,8 @@ static void handle_tx(struct vhost_net *net)
else
vhost_zerocopy_signal_used(net, vq);
vhost_net_tx_packet(net);
- if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
+ if (unlikely(total_len >= VHOST_NET_WEIGHT) ||
+ unlikely(++sent_pkts >= VHOST_NET_PKT_WEIGHT(vq))) {
vhost_poll_queue(&vq->poll);
break;
}
--
2.12.3
^ permalink raw reply related
* [PATCH net] net: dsa: mv88e6xxx: Fix receive time stamp race condition.
From: Richard Cochran @ 2018-04-09 7:03 UTC (permalink / raw)
To: netdev
Cc: linux-kernel, Andrew Lunn, David Miller, Florian Fainelli,
Vivien Didelot
The DSA stack passes received PTP frames to this driver via
mv88e6xxx_port_rxtstamp() for deferred delivery. The driver then
queues the frame and kicks the worker thread. The work callback reads
out the latched receive time stamp and then works through the queue,
delivering any non-matching frames without a time stamp.
If a new frame arrives after the worker thread has read out the time
stamp register but enters the queue before the worker finishes
processing the queue, that frame will be delivered without a time
stamp.
This patch fixes the race by moving the queue onto a list on the stack
before reading out the latched time stamp value.
Fixes: c6fe0ad2c3499 ("net: dsa: mv88e6xxx: add rx/tx timestamping support")
Signed-off-by: Richard Cochran <richardcochran@gmail.com>
---
drivers/net/dsa/mv88e6xxx/hwtstamp.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/drivers/net/dsa/mv88e6xxx/hwtstamp.c b/drivers/net/dsa/mv88e6xxx/hwtstamp.c
index ac7694c71266..a036c490b7ce 100644
--- a/drivers/net/dsa/mv88e6xxx/hwtstamp.c
+++ b/drivers/net/dsa/mv88e6xxx/hwtstamp.c
@@ -285,10 +285,18 @@ static void mv88e6xxx_get_rxts(struct mv88e6xxx_chip *chip,
struct sk_buff_head *rxq)
{
u16 buf[4] = { 0 }, status, seq_id;
- u64 ns, timelo, timehi;
struct skb_shared_hwtstamps *shwt;
+ struct sk_buff_head received;
+ u64 ns, timelo, timehi;
+ unsigned long flags;
int err;
+ /* The latched timestamp belongs to one of the received frames. */
+ __skb_queue_head_init(&received);
+ spin_lock_irqsave(&rxq->lock, flags);
+ skb_queue_splice_tail_init(rxq, &received);
+ spin_unlock_irqrestore(&rxq->lock, flags);
+
mutex_lock(&chip->reg_lock);
err = mv88e6xxx_port_ptp_read(chip, ps->port_id,
reg, buf, ARRAY_SIZE(buf));
@@ -311,7 +319,7 @@ static void mv88e6xxx_get_rxts(struct mv88e6xxx_chip *chip,
/* Since the device can only handle one time stamp at a time,
* we purge any extra frames from the queue.
*/
- for ( ; skb; skb = skb_dequeue(rxq)) {
+ for ( ; skb; skb = __skb_dequeue(&received)) {
if (mv88e6xxx_ts_valid(status) && seq_match(skb, seq_id)) {
ns = timehi << 16 | timelo;
--
2.11.0
^ permalink raw reply related
* Dear Talented
From: Lisa Clement @ 2018-04-09 6:45 UTC (permalink / raw)
To: Recipients
Dear Talented,
I am Talent Scout For BLUE SKY FILM STUDIO, Present Blue sky Studio a
Film Corporation Located in the United State, is Soliciting for the
Right to use Your Photo/Face and Personality as One of the Semi -Major
Role/ Character in our Upcoming ANIMATED Stereoscope 3D Movie-The Story
of Spies in Disguise (Spies in Disguise 2019) The Movie is Currently Filming (In
Production) Please Note That There Will Be No Auditions, Traveling or
Any Special / Professional Acting Skills, Since the Production of This
Movie Will Be Done with our State of Art Computer -Generating Imagery
Equipment. We Are Prepared to Pay the Total Sum of $620,000.00 USD. For
More Information/Understanding, Please Write us on the E-Mail Below.
CONTACT EMAIL: blueskystudios7@gmail.com
All Reply to: blueskystudios7@gmail.com
Note: Only the Response send to this mail will be Given a Prior
Consideration.
Talent Scout
Lisa Clement
^ permalink raw reply
* [PATCH net-next 3/3] net: ethernet: ave: add support for phy-mode setting of system controller
From: Kunihiko Hayashi @ 2018-04-09 6:38 UTC (permalink / raw)
To: David Miller, netdev
Cc: Andrew Lunn, Florian Fainelli, Rob Herring, Mark Rutland,
linux-arm-kernel, linux-kernel, devicetree, Masahiro Yamada,
Masami Hiramatsu, Jassi Brar, Kunihiko Hayashi
In-Reply-To: <1523255925-6469-1-git-send-email-hayashi.kunihiko@socionext.com>
This patch adds support for specifying system controller that configures
phy-mode setting.
According to the DT property "phy-mode", it's necessary to configure the
controller, which is used to choose the settings of the MAC suitable,
for example, mdio pin connections, internal clocks, and so on.
Supported phy-modes are SoC-dependent. The driver allows phy-mode to set
"internal" if the SoC has a built-in PHY, and {"mii", "rmii", "rgmii"}
if the SoC supports each mode. So we have to check whether the phy-mode
is valid or not.
This adds the following features for each SoC:
- check whether the SoC supports the specified phy-mode
- configure the controller accroding to phy-mode
The DT property accepts one argument to distinguish them for multiple MAC
instances.
ethernet@65000000 {
...
socionext,syscon-phy-mode = <&soc_glue 0>;
};
ethernet@65200000 {
...
socionext,syscon-phy-mode = <&soc_glue 1>;
};
Signed-off-by: Kunihiko Hayashi <hayashi.kunihiko@socionext.com>
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
---
drivers/net/ethernet/socionext/Kconfig | 2 +
drivers/net/ethernet/socionext/sni_ave.c | 150 ++++++++++++++++++++++++++++---
2 files changed, 140 insertions(+), 12 deletions(-)
diff --git a/drivers/net/ethernet/socionext/Kconfig b/drivers/net/ethernet/socionext/Kconfig
index 6bcfe27..b80048c 100644
--- a/drivers/net/ethernet/socionext/Kconfig
+++ b/drivers/net/ethernet/socionext/Kconfig
@@ -14,6 +14,8 @@ if NET_VENDOR_SOCIONEXT
config SNI_AVE
tristate "Socionext AVE ethernet support"
depends on (ARCH_UNIPHIER || COMPILE_TEST) && OF
+ depends on HAS_IOMEM
+ select MFD_SYSCON
select PHYLIB
---help---
Driver for gigabit ethernet MACs, called AVE, in the
diff --git a/drivers/net/ethernet/socionext/sni_ave.c b/drivers/net/ethernet/socionext/sni_ave.c
index 52940bd..f7eccee 100644
--- a/drivers/net/ethernet/socionext/sni_ave.c
+++ b/drivers/net/ethernet/socionext/sni_ave.c
@@ -11,6 +11,7 @@
#include <linux/interrupt.h>
#include <linux/io.h>
#include <linux/iopoll.h>
+#include <linux/mfd/syscon.h>
#include <linux/mii.h>
#include <linux/module.h>
#include <linux/netdevice.h>
@@ -18,6 +19,7 @@
#include <linux/of_mdio.h>
#include <linux/of_platform.h>
#include <linux/phy.h>
+#include <linux/regmap.h>
#include <linux/reset.h>
#include <linux/types.h>
#include <linux/u64_stats_sync.h>
@@ -197,6 +199,11 @@
#define AVE_INTM_COUNT 20
#define AVE_FORCE_TXINTCNT 1
+/* SG */
+#define SG_ETPINMODE 0x540
+#define SG_ETPINMODE_EXTPHY BIT(1) /* for LD11 */
+#define SG_ETPINMODE_RMII(ins) BIT(ins)
+
#define IS_DESC_64BIT(p) ((p)->data->is_desc_64bit)
#define AVE_MAX_CLKS 4
@@ -228,12 +235,6 @@ struct ave_desc_info {
struct ave_desc *desc; /* skb info related descriptor */
};
-struct ave_soc_data {
- bool is_desc_64bit;
- const char *clock_names[AVE_MAX_CLKS];
- const char *reset_names[AVE_MAX_RSTS];
-};
-
struct ave_stats {
struct u64_stats_sync syncp;
u64 packets;
@@ -257,6 +258,9 @@ struct ave_private {
phy_interface_t phy_mode;
struct phy_device *phydev;
struct mii_bus *mdio;
+ struct regmap *regmap;
+ unsigned int pinmode_mask;
+ unsigned int pinmode_val;
/* stats */
struct ave_stats stats_rx;
@@ -279,6 +283,14 @@ struct ave_private {
const struct ave_soc_data *data;
};
+struct ave_soc_data {
+ bool is_desc_64bit;
+ const char *clock_names[AVE_MAX_CLKS];
+ const char *reset_names[AVE_MAX_RSTS];
+ int (*get_pinmode)(struct ave_private *priv,
+ phy_interface_t phy_mode, u32 arg);
+};
+
static u32 ave_desc_read(struct net_device *ndev, enum desc_id id, int entry,
int offset)
{
@@ -1179,6 +1191,11 @@ static int ave_init(struct net_device *ndev)
}
}
+ ret = regmap_update_bits(priv->regmap, SG_ETPINMODE,
+ priv->pinmode_mask, priv->pinmode_val);
+ if (ret)
+ return ret;
+
ave_global_reset(ndev);
mdio_np = of_get_child_by_name(np, "mdio");
@@ -1537,6 +1554,7 @@ static int ave_probe(struct platform_device *pdev)
const struct ave_soc_data *data;
struct device *dev = &pdev->dev;
char buf[ETHTOOL_FWVERS_LEN];
+ struct of_phandle_args args;
phy_interface_t phy_mode;
struct ave_private *priv;
struct net_device *ndev;
@@ -1559,12 +1577,6 @@ static int ave_probe(struct platform_device *pdev)
dev_err(dev, "phy-mode not found\n");
return -EINVAL;
}
- if ((!phy_interface_mode_is_rgmii(phy_mode)) &&
- phy_mode != PHY_INTERFACE_MODE_RMII &&
- phy_mode != PHY_INTERFACE_MODE_MII) {
- dev_err(dev, "phy-mode is invalid\n");
- return -EINVAL;
- }
irq = platform_get_irq(pdev, 0);
if (irq < 0) {
@@ -1656,6 +1668,26 @@ static int ave_probe(struct platform_device *pdev)
priv->nrsts++;
}
+ ret = of_parse_phandle_with_fixed_args(np,
+ "socionext,syscon-phy-mode",
+ 1, 0, &args);
+ if (ret) {
+ netdev_err(ndev, "can't get syscon-phy-mode property\n");
+ goto out_free_netdev;
+ }
+ priv->regmap = syscon_node_to_regmap(args.np);
+ of_node_put(args.np);
+ if (IS_ERR(priv->regmap)) {
+ netdev_err(ndev, "can't map syscon-phy-mode\n");
+ ret = PTR_ERR(priv->regmap);
+ goto out_free_netdev;
+ }
+ ret = priv->data->get_pinmode(priv, phy_mode, args.args[0]);
+ if (ret) {
+ netdev_err(ndev, "invalid phy-mode setting\n");
+ goto out_free_netdev;
+ }
+
priv->mdio = devm_mdiobus_alloc(dev);
if (!priv->mdio) {
ret = -ENOMEM;
@@ -1715,6 +1747,95 @@ static int ave_remove(struct platform_device *pdev)
return 0;
}
+static int ave_pro4_get_pinmode(struct ave_private *priv,
+ phy_interface_t phy_mode, u32 arg)
+{
+ if (arg > 0)
+ return -EINVAL;
+
+ priv->pinmode_mask = SG_ETPINMODE_RMII(0);
+
+ switch (phy_mode) {
+ case PHY_INTERFACE_MODE_RMII:
+ priv->pinmode_val = SG_ETPINMODE_RMII(0);
+ break;
+ case PHY_INTERFACE_MODE_MII:
+ case PHY_INTERFACE_MODE_RGMII:
+ priv->pinmode_val = 0;
+ break;
+ default:
+ return -EINVAL;
+ }
+
+ return 0;
+}
+
+static int ave_ld11_get_pinmode(struct ave_private *priv,
+ phy_interface_t phy_mode, u32 arg)
+{
+ if (arg > 0)
+ return -EINVAL;
+
+ priv->pinmode_mask = SG_ETPINMODE_EXTPHY | SG_ETPINMODE_RMII(0);
+
+ switch (phy_mode) {
+ case PHY_INTERFACE_MODE_INTERNAL:
+ priv->pinmode_val = 0;
+ break;
+ case PHY_INTERFACE_MODE_RMII:
+ priv->pinmode_val = SG_ETPINMODE_EXTPHY | SG_ETPINMODE_RMII(0);
+ break;
+ default:
+ return -EINVAL;
+ }
+
+ return 0;
+}
+
+static int ave_ld20_get_pinmode(struct ave_private *priv,
+ phy_interface_t phy_mode, u32 arg)
+{
+ if (arg > 0)
+ return -EINVAL;
+
+ priv->pinmode_mask = SG_ETPINMODE_RMII(0);
+
+ switch (phy_mode) {
+ case PHY_INTERFACE_MODE_RMII:
+ priv->pinmode_val = SG_ETPINMODE_RMII(0);
+ break;
+ case PHY_INTERFACE_MODE_RGMII:
+ priv->pinmode_val = 0;
+ break;
+ default:
+ return -EINVAL;
+ }
+
+ return 0;
+}
+
+static int ave_pxs3_get_pinmode(struct ave_private *priv,
+ phy_interface_t phy_mode, u32 arg)
+{
+ if (arg > 1)
+ return -EINVAL;
+
+ priv->pinmode_mask = SG_ETPINMODE_RMII(arg);
+
+ switch (phy_mode) {
+ case PHY_INTERFACE_MODE_RMII:
+ priv->pinmode_val = SG_ETPINMODE_RMII(arg);
+ break;
+ case PHY_INTERFACE_MODE_RGMII:
+ priv->pinmode_val = 0;
+ break;
+ default:
+ return -EINVAL;
+ }
+
+ return 0;
+}
+
static const struct ave_soc_data ave_pro4_data = {
.is_desc_64bit = false,
.clock_names = {
@@ -1723,6 +1844,7 @@ static const struct ave_soc_data ave_pro4_data = {
.reset_names = {
"gio", "ether",
},
+ .get_pinmode = ave_pro4_get_pinmode,
};
static const struct ave_soc_data ave_pxs2_data = {
@@ -1733,6 +1855,7 @@ static const struct ave_soc_data ave_pxs2_data = {
.reset_names = {
"ether",
},
+ .get_pinmode = ave_pro4_get_pinmode,
};
static const struct ave_soc_data ave_ld11_data = {
@@ -1743,6 +1866,7 @@ static const struct ave_soc_data ave_ld11_data = {
.reset_names = {
"ether",
},
+ .get_pinmode = ave_ld11_get_pinmode,
};
static const struct ave_soc_data ave_ld20_data = {
@@ -1753,6 +1877,7 @@ static const struct ave_soc_data ave_ld20_data = {
.reset_names = {
"ether",
},
+ .get_pinmode = ave_ld20_get_pinmode,
};
static const struct ave_soc_data ave_pxs3_data = {
@@ -1763,6 +1888,7 @@ static const struct ave_soc_data ave_pxs3_data = {
.reset_names = {
"ether",
},
+ .get_pinmode = ave_pxs3_get_pinmode,
};
static const struct of_device_id of_ave_match[] = {
--
2.7.4
^ permalink raw reply related
* [PATCH net-next 2/3] dt-bindings: net: ave: add syscon-phy-mode property to configure phy-mode setting
From: Kunihiko Hayashi @ 2018-04-09 6:38 UTC (permalink / raw)
To: David Miller, netdev
Cc: Mark Rutland, Andrew Lunn, Florian Fainelli, Kunihiko Hayashi,
Masami Hiramatsu, devicetree, linux-kernel, Masahiro Yamada,
Rob Herring, Jassi Brar, linux-arm-kernel
In-Reply-To: <1523255925-6469-1-git-send-email-hayashi.kunihiko@socionext.com>
Add "socionext,syscon-phy-mode" property to specify system controller that
configures the settings about phy-mode.
Signed-off-by: Kunihiko Hayashi <hayashi.kunihiko@socionext.com>
---
Documentation/devicetree/bindings/net/socionext,uniphier-ave4.txt | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/Documentation/devicetree/bindings/net/socionext,uniphier-ave4.txt b/Documentation/devicetree/bindings/net/socionext,uniphier-ave4.txt
index 85e0c49..fc8f017 100644
--- a/Documentation/devicetree/bindings/net/socionext,uniphier-ave4.txt
+++ b/Documentation/devicetree/bindings/net/socionext,uniphier-ave4.txt
@@ -13,7 +13,8 @@ Required properties:
- reg: Address where registers are mapped and size of region.
- interrupts: Should contain the MAC interrupt.
- phy-mode: See ethernet.txt in the same directory. Allow to choose
- "rgmii", "rmii", or "mii" according to the PHY.
+ "rgmii", "rmii", "mii", or "internal" according to the PHY.
+ The acceptable mode is SoC-dependent.
- phy-handle: Should point to the external phy device.
See ethernet.txt file in the same directory.
- clocks: A phandle to the clock for the MAC.
@@ -27,6 +28,8 @@ Required properties:
- reset-names: Should contain
- "ether", "gio" for Pro4 SoC
- "ether" for others
+ - socionext,syscon-phy-mode: A phandle to syscon with one argument
+ that configures phy mode. The argument is the ID of MAC instance.
Optional properties:
- local-mac-address: See ethernet.txt in the same directory.
@@ -47,6 +50,7 @@ Example:
clocks = <&sys_clk 6>;
reset-names = "ether";
resets = <&sys_rst 6>;
+ socionext,syscon-phy-mode = <&soc_glue 0>;
local-mac-address = [00 00 00 00 00 00];
mdio {
--
2.7.4
^ permalink raw reply related
* [PATCH net-next 1/3] net: ethernet: ave: add multiple clocks and resets support as required property
From: Kunihiko Hayashi @ 2018-04-09 6:38 UTC (permalink / raw)
To: David Miller, netdev
Cc: Andrew Lunn, Florian Fainelli, Rob Herring, Mark Rutland,
linux-arm-kernel, linux-kernel, devicetree, Masahiro Yamada,
Masami Hiramatsu, Jassi Brar, Kunihiko Hayashi
In-Reply-To: <1523255925-6469-1-git-send-email-hayashi.kunihiko@socionext.com>
When the link is becoming up for Pro4 SoC, the kernel is stalled
due to some missing clocks and resets.
The AVE block for Pro4 is connected to the GIO bus in the SoC.
Without its clock/reset, the access to the AVE register makes the
system stall.
In the same way, another MAC clock for Giga-bit Connection and
the PHY clock are also required for Pro4 to activate the Giga-bit feature
and to recognize the PHY.
To satisfy these requirements, this patch adds support for multiple clocks
and resets, and adds the clock-names and reset-names to the binding because
we need to distinguish clock/reset for the AVE main block and the others.
Also, make the resets a required property. Currently, "reset is
optional" relies on that the bootloader or firmware has deasserted
the reset before booting the kernel. Drivers should work without
such expectation.
Fixes: 4c270b55a5af ("net: ethernet: socionext: add AVE ethernet driver")
Suggested-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Kunihiko Hayashi <hayashi.kunihiko@socionext.com>
---
.../bindings/net/socionext,uniphier-ave4.txt | 13 ++-
drivers/net/ethernet/socionext/sni_ave.c | 108 ++++++++++++++++-----
2 files changed, 96 insertions(+), 25 deletions(-)
diff --git a/Documentation/devicetree/bindings/net/socionext,uniphier-ave4.txt b/Documentation/devicetree/bindings/net/socionext,uniphier-ave4.txt
index 96398cc..85e0c49 100644
--- a/Documentation/devicetree/bindings/net/socionext,uniphier-ave4.txt
+++ b/Documentation/devicetree/bindings/net/socionext,uniphier-ave4.txt
@@ -17,9 +17,18 @@ Required properties:
- phy-handle: Should point to the external phy device.
See ethernet.txt file in the same directory.
- clocks: A phandle to the clock for the MAC.
+ For Pro4 SoC, that is "socionext,uniphier-pro4-ave4",
+ another MAC clock, GIO bus clock and PHY clock are also required.
+ - clock-names: Should contain
+ - "ether", "ether-gb", "gio", "ether-phy" for Pro4 SoC
+ - "ether" for others
+ - resets: A phandle to the reset control for the MAC. For Pro4 SoC,
+ GIO bus reset is also required.
+ - reset-names: Should contain
+ - "ether", "gio" for Pro4 SoC
+ - "ether" for others
Optional properties:
- - resets: A phandle to the reset control for the MAC.
- local-mac-address: See ethernet.txt in the same directory.
Required subnode:
@@ -34,7 +43,9 @@ Example:
interrupts = <0 66 4>;
phy-mode = "rgmii";
phy-handle = <ðphy>;
+ clock-names = "ether";
clocks = <&sys_clk 6>;
+ reset-names = "ether";
resets = <&sys_rst 6>;
local-mac-address = [00 00 00 00 00 00];
diff --git a/drivers/net/ethernet/socionext/sni_ave.c b/drivers/net/ethernet/socionext/sni_ave.c
index 0b3b7a4..52940bd 100644
--- a/drivers/net/ethernet/socionext/sni_ave.c
+++ b/drivers/net/ethernet/socionext/sni_ave.c
@@ -199,6 +199,9 @@
#define IS_DESC_64BIT(p) ((p)->data->is_desc_64bit)
+#define AVE_MAX_CLKS 4
+#define AVE_MAX_RSTS 2
+
enum desc_id {
AVE_DESCID_RX,
AVE_DESCID_TX,
@@ -227,6 +230,8 @@ struct ave_desc_info {
struct ave_soc_data {
bool is_desc_64bit;
+ const char *clock_names[AVE_MAX_CLKS];
+ const char *reset_names[AVE_MAX_RSTS];
};
struct ave_stats {
@@ -245,8 +250,10 @@ struct ave_private {
int phy_id;
unsigned int desc_size;
u32 msg_enable;
- struct clk *clk;
- struct reset_control *rst;
+ int nclks;
+ struct clk *clk[AVE_MAX_CLKS];
+ int nrsts;
+ struct reset_control *rst[AVE_MAX_RSTS];
phy_interface_t phy_mode;
struct phy_device *phydev;
struct mii_bus *mdio;
@@ -1153,18 +1160,23 @@ static int ave_init(struct net_device *ndev)
struct device_node *np = dev->of_node;
struct device_node *mdio_np;
struct phy_device *phydev;
- int ret;
+ int nc, nr, ret;
/* enable clk because of hw access until ndo_open */
- ret = clk_prepare_enable(priv->clk);
- if (ret) {
- dev_err(dev, "can't enable clock\n");
- return ret;
+ for (nc = 0; nc < priv->nclks; nc++) {
+ ret = clk_prepare_enable(priv->clk[nc]);
+ if (ret) {
+ dev_err(dev, "can't enable clock\n");
+ goto out_clk_disable;
+ }
}
- ret = reset_control_deassert(priv->rst);
- if (ret) {
- dev_err(dev, "can't deassert reset\n");
- goto out_clk_disable;
+
+ for (nr = 0; nr < priv->nrsts; nr++) {
+ ret = reset_control_deassert(priv->rst[nr]);
+ if (ret) {
+ dev_err(dev, "can't deassert reset\n");
+ goto out_reset_assert;
+ }
}
ave_global_reset(ndev);
@@ -1207,9 +1219,11 @@ static int ave_init(struct net_device *ndev)
out_mdio_unregister:
mdiobus_unregister(priv->mdio);
out_reset_assert:
- reset_control_assert(priv->rst);
+ while (--nr >= 0)
+ reset_control_assert(priv->rst[nr]);
out_clk_disable:
- clk_disable_unprepare(priv->clk);
+ while (--nc >= 0)
+ clk_disable_unprepare(priv->clk[nc]);
return ret;
}
@@ -1217,13 +1231,16 @@ static int ave_init(struct net_device *ndev)
static void ave_uninit(struct net_device *ndev)
{
struct ave_private *priv = netdev_priv(ndev);
+ int i;
phy_disconnect(priv->phydev);
mdiobus_unregister(priv->mdio);
/* disable clk because of hw access after ndo_stop */
- reset_control_assert(priv->rst);
- clk_disable_unprepare(priv->clk);
+ for (i = 0; i < priv->nrsts; i++)
+ reset_control_assert(priv->rst[i]);
+ for (i = 0; i < priv->nclks; i++)
+ clk_disable_unprepare(priv->clk[i]);
}
static int ave_open(struct net_device *ndev)
@@ -1527,8 +1544,9 @@ static int ave_probe(struct platform_device *pdev)
struct resource *res;
const void *mac_addr;
void __iomem *base;
+ const char *name;
+ int i, irq, ret;
u64 dma_mask;
- int irq, ret;
u32 ave_id;
data = of_device_get_match_data(dev);
@@ -1614,16 +1632,28 @@ static int ave_probe(struct platform_device *pdev)
u64_stats_init(&priv->stats_tx.syncp);
u64_stats_init(&priv->stats_rx.syncp);
- priv->clk = devm_clk_get(dev, NULL);
- if (IS_ERR(priv->clk)) {
- ret = PTR_ERR(priv->clk);
- goto out_free_netdev;
+ for (i = 0; i < AVE_MAX_CLKS; i++) {
+ name = priv->data->clock_names[i];
+ if (!name)
+ break;
+ priv->clk[i] = devm_clk_get(dev, name);
+ if (IS_ERR(priv->clk[i])) {
+ ret = PTR_ERR(priv->clk[i]);
+ goto out_free_netdev;
+ }
+ priv->nclks++;
}
- priv->rst = devm_reset_control_get_optional_shared(dev, NULL);
- if (IS_ERR(priv->rst)) {
- ret = PTR_ERR(priv->rst);
- goto out_free_netdev;
+ for (i = 0; i < AVE_MAX_RSTS; i++) {
+ name = priv->data->reset_names[i];
+ if (!name)
+ break;
+ priv->rst[i] = devm_reset_control_get_shared(dev, name);
+ if (IS_ERR(priv->rst[i])) {
+ ret = PTR_ERR(priv->rst[i]);
+ goto out_free_netdev;
+ }
+ priv->nrsts++;
}
priv->mdio = devm_mdiobus_alloc(dev);
@@ -1687,22 +1717,52 @@ static int ave_remove(struct platform_device *pdev)
static const struct ave_soc_data ave_pro4_data = {
.is_desc_64bit = false,
+ .clock_names = {
+ "gio", "ether", "ether-gb", "ether-phy",
+ },
+ .reset_names = {
+ "gio", "ether",
+ },
};
static const struct ave_soc_data ave_pxs2_data = {
.is_desc_64bit = false,
+ .clock_names = {
+ "ether",
+ },
+ .reset_names = {
+ "ether",
+ },
};
static const struct ave_soc_data ave_ld11_data = {
.is_desc_64bit = false,
+ .clock_names = {
+ "ether",
+ },
+ .reset_names = {
+ "ether",
+ },
};
static const struct ave_soc_data ave_ld20_data = {
.is_desc_64bit = true,
+ .clock_names = {
+ "ether",
+ },
+ .reset_names = {
+ "ether",
+ },
};
static const struct ave_soc_data ave_pxs3_data = {
.is_desc_64bit = false,
+ .clock_names = {
+ "ether",
+ },
+ .reset_names = {
+ "ether",
+ },
};
static const struct of_device_id of_ave_match[] = {
--
2.7.4
^ permalink raw reply related
* [PATCH net-next 0/3] ave: fix the activation issues for some UniPhier SoCs
From: Kunihiko Hayashi @ 2018-04-09 6:38 UTC (permalink / raw)
To: David Miller, netdev
Cc: Andrew Lunn, Florian Fainelli, Rob Herring, Mark Rutland,
linux-arm-kernel, linux-kernel, devicetree, Masahiro Yamada,
Masami Hiramatsu, Jassi Brar, Kunihiko Hayashi
This add the following stuffs to fix the activation issues and satisfy
requirements for AVE ethernet driver implemented on some UniPhier SoCs.
- Add support for additional necessary clocks and resets, because the kernel
is stalled on Pro4 due to lack of them.
- Check whether the SoC supports the specified phy-mode
- Add DT property support indicating system controller that has the feature
for configurating phy-mode including built-in phy on LD11.
Kunihiko Hayashi (3):
net: ethernet: ave: add multiple clocks and resets support as required
property
dt-bindings: net: ave: add syscon-phy-mode property to configure
phy-mode setting
net: ethernet: ave: add support for phy-mode setting of system
controller
.../bindings/net/socionext,uniphier-ave4.txt | 19 +-
drivers/net/ethernet/socionext/Kconfig | 2 +
drivers/net/ethernet/socionext/sni_ave.c | 252 ++++++++++++++++++---
3 files changed, 238 insertions(+), 35 deletions(-)
--
2.7.4
^ permalink raw reply
* Re: WARNING in ip_rt_bug
From: Dmitry Vyukov @ 2018-04-09 6:06 UTC (permalink / raw)
To: syzbot
Cc: David Miller, Alexey Kuznetsov, LKML, netdev, syzkaller-bugs,
Hideaki YOSHIFUJI, Eric Dumazet
In-Reply-To: <001a11408d9c43ff4a0569641a3f@google.com>
On Mon, Apr 9, 2018 at 7:59 AM, syzbot
<syzbot+b09ac67a2af842b12eab@syzkaller.appspotmail.com> wrote:
> Hello,
>
> syzbot hit the following crash on net-next commit
> 8bde261e535257e81087d39ff808414e2f5aa39d (Sun Apr 1 02:31:43 2018 +0000)
> Merge tag 'mlx5-updates-2018-03-30' of
> git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
> syzbot dashboard link:
> https://syzkaller.appspot.com/bug?extid=b09ac67a2af842b12eab
>
> Unfortunately, I don't have any reproducer for this crash yet.
> Raw console output:
> https://syzkaller.appspot.com/x/log.txt?id=5991727739437056
> Kernel config:
> https://syzkaller.appspot.com/x/.config?id=3327544840960562528
> compiler: gcc (GCC) 7.1.1 20170620
>
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+b09ac67a2af842b12eab@syzkaller.appspotmail.com
> It will help syzbot understand when the bug is fixed. See footer for
> details.
> If you forward the report, please keep this part and the footer.
+Eric said that perhaps we just need to revert:
commit c378a9c019cf5e017d1ed24954b54fae7bebd2bc
Date: Sat May 21 07:16:42 2011 +0000
ipv4: Give backtrace in ip_rt_bug().
> netlink: 'syz-executor6': attribute type 3 has an invalid length.
> WARNING: CPU: 0 PID: 11678 at net/ipv4/route.c:1213 ip_rt_bug+0x15/0x20
> net/ipv4/route.c:1212
> Kernel panic - not syncing: panic_on_warn set ...
>
> CPU: 0 PID: 11678 Comm: kworker/u4:7 Not tainted 4.16.0-rc6+ #289
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Call Trace:
> <IRQ>
> __dump_stack lib/dump_stack.c:17 [inline]
> dump_stack+0x194/0x24d lib/dump_stack.c:53
> panic+0x1e4/0x41c kernel/panic.c:183
> __warn+0x1dc/0x200 kernel/panic.c:547
> report_bug+0x1f4/0x2b0 lib/bug.c:186
> fixup_bug.part.10+0x37/0x80 arch/x86/kernel/traps.c:178
> fixup_bug arch/x86/kernel/traps.c:247 [inline]
> do_error_trap+0x2d7/0x3e0 arch/x86/kernel/traps.c:296
> do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
> invalid_op+0x1b/0x40 arch/x86/entry/entry_64.S:986
> RIP: 0010:ip_rt_bug+0x15/0x20 net/ipv4/route.c:1212
> RSP: 0018:ffff8801db007290 EFLAGS: 00010282
> RAX: dffffc0000000000 RBX: ffff8801d8dda3c0 RCX: ffffffff856c31ca
> RDX: 0000000000000100 RSI: ffffffff8858c300 RDI: 0000000000000282
> RBP: ffff8801db007298 R08: 1ffff1003b600de1 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801d8dda3c0
> R13: ffff88019bdb2200 R14: ffff88019bdeed80 R15: ffff8801d8dda418
> dst_output include/net/dst.h:444 [inline]
> ip_local_out+0x95/0x160 net/ipv4/ip_output.c:124
> ip_send_skb+0x3c/0xc0 net/ipv4/ip_output.c:1414
> ip_push_pending_frames+0x64/0x80 net/ipv4/ip_output.c:1434
> icmp_push_reply+0x395/0x4f0 net/ipv4/icmp.c:394
> icmp_send+0x1136/0x19b0 net/ipv4/icmp.c:741
> ipv4_link_failure+0x2a/0x1b0 net/ipv4/route.c:1200
> dst_link_failure include/net/dst.h:427 [inline]
> arp_error_report+0xae/0x180 net/ipv4/arp.c:297
> neigh_invalidate+0x225/0x530 net/core/neighbour.c:883
> neigh_timer_handler+0x897/0xd60 net/core/neighbour.c:969
> call_timer_fn+0x228/0x820 kernel/time/timer.c:1326
> expire_timers kernel/time/timer.c:1363 [inline]
> __run_timers+0x7ee/0xb70 kernel/time/timer.c:1666
> run_timer_softirq+0x4c/0x70 kernel/time/timer.c:1692
> __do_softirq+0x2d7/0xb85 kernel/softirq.c:285
> invoke_softirq kernel/softirq.c:365 [inline]
> irq_exit+0x1cc/0x200 kernel/softirq.c:405
> exiting_irq arch/x86/include/asm/apic.h:541 [inline]
> smp_apic_timer_interrupt+0x16b/0x700 arch/x86/kernel/apic/apic.c:1052
> apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:857
> </IRQ>
> RIP: 0010:arch_local_irq_restore arch/x86/include/asm/paravirt.h:778
> [inline]
> RIP: 0010:lock_acquire+0x256/0x580 kernel/locking/lockdep.c:3923
> RSP: 0018:ffff880197b3f980 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff12
> RAX: dffffc0000000000 RBX: ffff8801d225e400 RCX: 0000000000000000
> RDX: 1ffffffff10a24e5 RSI: 00000000b98b8227 RDI: 0000000000000282
> RBP: ffff880197b3fa78 R08: 1ffff10032f67e93 R09: 0000000000000004
> R10: ffff880197b3f960 R11: 0000000000000003 R12: 1ffff10032f67f36
> R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
> down_write_killable+0x8a/0x140 kernel/locking/rwsem.c:84
> __bprm_mm_init fs/exec.c:297 [inline]
> bprm_mm_init fs/exec.c:414 [inline]
> do_execveat_common.isra.30+0xc8e/0x23c0 fs/exec.c:1771
> do_execve+0x31/0x40 fs/exec.c:1847
> call_usermodehelper_exec_async+0x457/0x8f0 kernel/umh.c:100
> ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:406
> Dumping ftrace buffer:
> (ftrace buffer empty)
> Kernel Offset: disabled
> Rebooting in 86400 seconds..
>
>
> ---
> This bug is generated by a dumb bot. It may contain errors.
> See https://goo.gl/tpsmEJ for details.
> Direct all questions to syzkaller@googlegroups.com.
>
> syzbot will keep track of this bug report.
> If you forgot to add the Reported-by tag, once the fix for this bug is
> merged
> into any tree, please reply to this email with:
> #syz fix: exact-commit-title
> To mark this as a duplicate of another syzbot report, please reply with:
> #syz dup: exact-subject-of-another-report
> If it's a one-off invalid bug report, please reply with:
> #syz invalid
> Note: if the crash happens again, it will cause creation of a new bug
> report.
> Note: all commands must start from beginning of the line in the email body.
>
> --
> You received this message because you are subscribed to the Google Groups
> "syzkaller-bugs" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to syzkaller-bugs+unsubscribe@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/syzkaller-bugs/001a11408d9c43ff4a0569641a3f%40google.com.
>
> For more options, visit https://groups.google.com/d/optout.
^ permalink raw reply
* WARNING in ip_rt_bug
From: syzbot @ 2018-04-09 5:59 UTC (permalink / raw)
To: davem, kuznet, linux-kernel, netdev, syzkaller-bugs, yoshfuji
Hello,
syzbot hit the following crash on net-next commit
8bde261e535257e81087d39ff808414e2f5aa39d (Sun Apr 1 02:31:43 2018 +0000)
Merge tag 'mlx5-updates-2018-03-30' of
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
syzbot dashboard link:
https://syzkaller.appspot.com/bug?extid=b09ac67a2af842b12eab
Unfortunately, I don't have any reproducer for this crash yet.
Raw console output:
https://syzkaller.appspot.com/x/log.txt?id=5991727739437056
Kernel config:
https://syzkaller.appspot.com/x/.config?id=3327544840960562528
compiler: gcc (GCC) 7.1.1 20170620
IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+b09ac67a2af842b12eab@syzkaller.appspotmail.com
It will help syzbot understand when the bug is fixed. See footer for
details.
If you forward the report, please keep this part and the footer.
netlink: 'syz-executor6': attribute type 3 has an invalid length.
WARNING: CPU: 0 PID: 11678 at net/ipv4/route.c:1213 ip_rt_bug+0x15/0x20
net/ipv4/route.c:1212
Kernel panic - not syncing: panic_on_warn set ...
CPU: 0 PID: 11678 Comm: kworker/u4:7 Not tainted 4.16.0-rc6+ #289
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
<IRQ>
__dump_stack lib/dump_stack.c:17 [inline]
dump_stack+0x194/0x24d lib/dump_stack.c:53
panic+0x1e4/0x41c kernel/panic.c:183
__warn+0x1dc/0x200 kernel/panic.c:547
report_bug+0x1f4/0x2b0 lib/bug.c:186
fixup_bug.part.10+0x37/0x80 arch/x86/kernel/traps.c:178
fixup_bug arch/x86/kernel/traps.c:247 [inline]
do_error_trap+0x2d7/0x3e0 arch/x86/kernel/traps.c:296
do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
invalid_op+0x1b/0x40 arch/x86/entry/entry_64.S:986
RIP: 0010:ip_rt_bug+0x15/0x20 net/ipv4/route.c:1212
RSP: 0018:ffff8801db007290 EFLAGS: 00010282
RAX: dffffc0000000000 RBX: ffff8801d8dda3c0 RCX: ffffffff856c31ca
RDX: 0000000000000100 RSI: ffffffff8858c300 RDI: 0000000000000282
RBP: ffff8801db007298 R08: 1ffff1003b600de1 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801d8dda3c0
R13: ffff88019bdb2200 R14: ffff88019bdeed80 R15: ffff8801d8dda418
dst_output include/net/dst.h:444 [inline]
ip_local_out+0x95/0x160 net/ipv4/ip_output.c:124
ip_send_skb+0x3c/0xc0 net/ipv4/ip_output.c:1414
ip_push_pending_frames+0x64/0x80 net/ipv4/ip_output.c:1434
icmp_push_reply+0x395/0x4f0 net/ipv4/icmp.c:394
icmp_send+0x1136/0x19b0 net/ipv4/icmp.c:741
ipv4_link_failure+0x2a/0x1b0 net/ipv4/route.c:1200
dst_link_failure include/net/dst.h:427 [inline]
arp_error_report+0xae/0x180 net/ipv4/arp.c:297
neigh_invalidate+0x225/0x530 net/core/neighbour.c:883
neigh_timer_handler+0x897/0xd60 net/core/neighbour.c:969
call_timer_fn+0x228/0x820 kernel/time/timer.c:1326
expire_timers kernel/time/timer.c:1363 [inline]
__run_timers+0x7ee/0xb70 kernel/time/timer.c:1666
run_timer_softirq+0x4c/0x70 kernel/time/timer.c:1692
__do_softirq+0x2d7/0xb85 kernel/softirq.c:285
invoke_softirq kernel/softirq.c:365 [inline]
irq_exit+0x1cc/0x200 kernel/softirq.c:405
exiting_irq arch/x86/include/asm/apic.h:541 [inline]
smp_apic_timer_interrupt+0x16b/0x700 arch/x86/kernel/apic/apic.c:1052
apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:857
</IRQ>
RIP: 0010:arch_local_irq_restore arch/x86/include/asm/paravirt.h:778
[inline]
RIP: 0010:lock_acquire+0x256/0x580 kernel/locking/lockdep.c:3923
RSP: 0018:ffff880197b3f980 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff12
RAX: dffffc0000000000 RBX: ffff8801d225e400 RCX: 0000000000000000
RDX: 1ffffffff10a24e5 RSI: 00000000b98b8227 RDI: 0000000000000282
RBP: ffff880197b3fa78 R08: 1ffff10032f67e93 R09: 0000000000000004
R10: ffff880197b3f960 R11: 0000000000000003 R12: 1ffff10032f67f36
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
down_write_killable+0x8a/0x140 kernel/locking/rwsem.c:84
__bprm_mm_init fs/exec.c:297 [inline]
bprm_mm_init fs/exec.c:414 [inline]
do_execveat_common.isra.30+0xc8e/0x23c0 fs/exec.c:1771
do_execve+0x31/0x40 fs/exec.c:1847
call_usermodehelper_exec_async+0x457/0x8f0 kernel/umh.c:100
ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:406
Dumping ftrace buffer:
(ftrace buffer empty)
Kernel Offset: disabled
Rebooting in 86400 seconds..
---
This bug is generated by a dumb bot. It may contain errors.
See https://goo.gl/tpsmEJ for details.
Direct all questions to syzkaller@googlegroups.com.
syzbot will keep track of this bug report.
If you forgot to add the Reported-by tag, once the fix for this bug is
merged
into any tree, please reply to this email with:
#syz fix: exact-commit-title
To mark this as a duplicate of another syzbot report, please reply with:
#syz dup: exact-subject-of-another-report
If it's a one-off invalid bug report, please reply with:
#syz invalid
Note: if the crash happens again, it will cause creation of a new bug
report.
Note: all commands must start from beginning of the line in the email body.
^ permalink raw reply
* WARNING in ip_rt_bug
From: syzbot @ 2018-04-09 5:59 UTC (permalink / raw)
To: davem, kuznet, linux-kernel, netdev, syzkaller-bugs, yoshfuji
Hello,
syzbot hit the following crash on net-next commit
8bde261e535257e81087d39ff808414e2f5aa39d (Sun Apr 1 02:31:43 2018 +0000)
Merge tag 'mlx5-updates-2018-03-30' of
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
syzbot dashboard link:
https://syzkaller.appspot.com/bug?extid=b09ac67a2af842b12eab
Unfortunately, I don't have any reproducer for this crash yet.
Raw console output:
https://syzkaller.appspot.com/x/log.txt?id=5991727739437056
Kernel config:
https://syzkaller.appspot.com/x/.config?id=3327544840960562528
compiler: gcc (GCC) 7.1.1 20170620
IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+b09ac67a2af842b12eab@syzkaller.appspotmail.com
It will help syzbot understand when the bug is fixed. See footer for
details.
If you forward the report, please keep this part and the footer.
netlink: 'syz-executor6': attribute type 3 has an invalid length.
WARNING: CPU: 0 PID: 11678 at net/ipv4/route.c:1213 ip_rt_bug+0x15/0x20
net/ipv4/route.c:1212
Kernel panic - not syncing: panic_on_warn set ...
CPU: 0 PID: 11678 Comm: kworker/u4:7 Not tainted 4.16.0-rc6+ #289
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
<IRQ>
__dump_stack lib/dump_stack.c:17 [inline]
dump_stack+0x194/0x24d lib/dump_stack.c:53
panic+0x1e4/0x41c kernel/panic.c:183
__warn+0x1dc/0x200 kernel/panic.c:547
report_bug+0x1f4/0x2b0 lib/bug.c:186
fixup_bug.part.10+0x37/0x80 arch/x86/kernel/traps.c:178
fixup_bug arch/x86/kernel/traps.c:247 [inline]
do_error_trap+0x2d7/0x3e0 arch/x86/kernel/traps.c:296
do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
invalid_op+0x1b/0x40 arch/x86/entry/entry_64.S:986
RIP: 0010:ip_rt_bug+0x15/0x20 net/ipv4/route.c:1212
RSP: 0018:ffff8801db007290 EFLAGS: 00010282
RAX: dffffc0000000000 RBX: ffff8801d8dda3c0 RCX: ffffffff856c31ca
RDX: 0000000000000100 RSI: ffffffff8858c300 RDI: 0000000000000282
RBP: ffff8801db007298 R08: 1ffff1003b600de1 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801d8dda3c0
R13: ffff88019bdb2200 R14: ffff88019bdeed80 R15: ffff8801d8dda418
dst_output include/net/dst.h:444 [inline]
ip_local_out+0x95/0x160 net/ipv4/ip_output.c:124
ip_send_skb+0x3c/0xc0 net/ipv4/ip_output.c:1414
ip_push_pending_frames+0x64/0x80 net/ipv4/ip_output.c:1434
icmp_push_reply+0x395/0x4f0 net/ipv4/icmp.c:394
icmp_send+0x1136/0x19b0 net/ipv4/icmp.c:741
ipv4_link_failure+0x2a/0x1b0 net/ipv4/route.c:1200
dst_link_failure include/net/dst.h:427 [inline]
arp_error_report+0xae/0x180 net/ipv4/arp.c:297
neigh_invalidate+0x225/0x530 net/core/neighbour.c:883
neigh_timer_handler+0x897/0xd60 net/core/neighbour.c:969
call_timer_fn+0x228/0x820 kernel/time/timer.c:1326
expire_timers kernel/time/timer.c:1363 [inline]
__run_timers+0x7ee/0xb70 kernel/time/timer.c:1666
run_timer_softirq+0x4c/0x70 kernel/time/timer.c:1692
__do_softirq+0x2d7/0xb85 kernel/softirq.c:285
invoke_softirq kernel/softirq.c:365 [inline]
irq_exit+0x1cc/0x200 kernel/softirq.c:405
exiting_irq arch/x86/include/asm/apic.h:541 [inline]
smp_apic_timer_interrupt+0x16b/0x700 arch/x86/kernel/apic/apic.c:1052
apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:857
</IRQ>
RIP: 0010:arch_local_irq_restore arch/x86/include/asm/paravirt.h:778
[inline]
RIP: 0010:lock_acquire+0x256/0x580 kernel/locking/lockdep.c:3923
RSP: 0018:ffff880197b3f980 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff12
RAX: dffffc0000000000 RBX: ffff8801d225e400 RCX: 0000000000000000
RDX: 1ffffffff10a24e5 RSI: 00000000b98b8227 RDI: 0000000000000282
RBP: ffff880197b3fa78 R08: 1ffff10032f67e93 R09: 0000000000000004
R10: ffff880197b3f960 R11: 0000000000000003 R12: 1ffff10032f67f36
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
down_write_killable+0x8a/0x140 kernel/locking/rwsem.c:84
__bprm_mm_init fs/exec.c:297 [inline]
bprm_mm_init fs/exec.c:414 [inline]
do_execveat_common.isra.30+0xc8e/0x23c0 fs/exec.c:1771
do_execve+0x31/0x40 fs/exec.c:1847
call_usermodehelper_exec_async+0x457/0x8f0 kernel/umh.c:100
ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:406
Dumping ftrace buffer:
(ftrace buffer empty)
Kernel Offset: disabled
Rebooting in 86400 seconds..
---
This bug is generated by a dumb bot. It may contain errors.
See https://goo.gl/tpsmEJ for details.
Direct all questions to syzkaller@googlegroups.com.
syzbot will keep track of this bug report.
If you forgot to add the Reported-by tag, once the fix for this bug is
merged
into any tree, please reply to this email with:
#syz fix: exact-commit-title
To mark this as a duplicate of another syzbot report, please reply with:
#syz dup: exact-subject-of-another-report
If it's a one-off invalid bug report, please reply with:
#syz invalid
Note: if the crash happens again, it will cause creation of a new bug
report.
Note: all commands must start from beginning of the line in the email body.
^ permalink raw reply
* Re: KASAN: slab-out-of-bounds Read in pfkey_add
From: Kevin Easton @ 2018-04-09 5:56 UTC (permalink / raw)
To: Eric Biggers
Cc: davem, herbert, linux-kernel, netdev, steffen.klassert,
syzkaller-bugs, syzbot
In-Reply-To: <20180409040433.GJ685@sol.localdomain>
On Sun, Apr 08, 2018 at 09:04:33PM -0700, Eric Biggers wrote:
...
>
> Looks like this is going to be fixed by
> https://patchwork.kernel.org/patch/10327883/ ("af_key: Always verify length of
> provided sadb_key"), but it's not applied yet to the ipsec tree yet. Kevin, for
> future reference, for syzbot bugs it would be helpful to reply to the original
> bug report and say that a patch was sent out, or even better send the patch as a
> reply to the bug report email, e.g.
>
> git format-patch --in-reply-to="<001a114292fadd3e2505607060a8@google.com>"
>
> for this one (and the Message ID can be found in the syzkaller-bugs archive even
> if the email isn't in your inbox).
Sure, I can do that.
- Kevin
^ permalink raw reply
* Re: [PATCH] vhost-net: set packet weight of tx polling to 2 * vq size
From: Michael S. Tsirkin @ 2018-04-09 5:46 UTC (permalink / raw)
To: haibinzhang(张海斌)
Cc: kvm@vger.kernel.org, netdev@vger.kernel.org,
linux-kernel@vger.kernel.org,
virtualization@lists.linux-foundation.org,
yunfangtai(台运方),
lidongchen(陈立东)
In-Reply-To: <88D661ADF6AFBF42B2AB88D8E7682B0901FC49BE@EXMBX-SZMAIL011.tencent.com>
On Mon, Apr 09, 2018 at 04:09:20AM +0000, haibinzhang(张海斌) wrote:
>
> > On Fri, Apr 06, 2018 at 08:22:37AM +0000, haibinzhang(张海斌) wrote:
> > > handle_tx will delay rx for tens or even hundreds of milliseconds when tx busy
> > > polling udp packets with small length(e.g. 1byte udp payload), because setting
> > > VHOST_NET_WEIGHT takes into account only sent-bytes but no single packet length.
> > >
> > > Ping-Latencies shown below were tested between two Virtual Machines using
> > > netperf (UDP_STREAM, len=1), and then another machine pinged the client:
> > >
> > > Packet-Weight Ping-Latencies(millisecond)
> > > min avg max
> > > Origin 3.319 18.489 57.303
> > > 64 1.643 2.021 2.552
> > > 128 1.825 2.600 3.224
> > > 256 1.997 2.710 4.295
> > > 512 1.860 3.171 4.631
> > > 1024 2.002 4.173 9.056
> > > 2048 2.257 5.650 9.688
> > > 4096 2.093 8.508 15.943
> >
> > And this is with Q size 256 right?
>
> Yes. Ping-latencies with 512 VQ size show below.
>
> Packet-Weight Ping-Latencies(millisecond)
> min avg max
> Origin 6.357 29.177 66.245
> 64 2.798 3.614 4.403
> 128 2.861 3.820 4.775
> 256 3.008 4.018 4.807
> 512 3.254 4.523 5.824
> 1024 3.079 5.335 7.747
> 2048 3.944 8.201 12.762
> 4096 4.158 11.057 19.985
>
> We will submit again. Is there anything else?
Seems pretty consistent, a small dip at 2 VQ sizes.
Acked-by: Michael S. Tsirkin <mst@redhat.com>
> >
> > > Ring size is a hint from device about a burst size it can tolerate. Based on
> > > benchmarks, set the weight to 2 * vq size.
> > >
> > > To evaluate this change, another tests were done using netperf(RR, TX) between
> > > two machines with Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz, and vq size was
> > > tweaked through qemu. Results shown below does not show obvious changes.
> >
> > What I asked for is ping-latency with different VQ sizes,
> > streaming below does not show anything.
> >
> > > vq size=256 TCP_RR vq size=512 TCP_RR
> > > size/sessions/+thu%/+normalize% size/sessions/+thu%/+normalize%
> > > 1/ 1/ -7%/ -2% 1/ 1/ 0%/ -2%
> > > 1/ 4/ +1%/ 0% 1/ 4/ +1%/ 0%
> > > 1/ 8/ +1%/ -2% 1/ 8/ 0%/ +1%
> > > 64/ 1/ -6%/ 0% 64/ 1/ +7%/ +3%
> > > 64/ 4/ 0%/ +2% 64/ 4/ -1%/ +1%
> > > 64/ 8/ 0%/ 0% 64/ 8/ -1%/ -2%
> > > 256/ 1/ -3%/ -4% 256/ 1/ -4%/ -2%
> > > 256/ 4/ +3%/ +4% 256/ 4/ +1%/ +2%
> > > 256/ 8/ +2%/ 0% 256/ 8/ +1%/ -1%
> > >
> > > vq size=256 UDP_RR vq size=512 UDP_RR
> > > size/sessions/+thu%/+normalize% size/sessions/+thu%/+normalize%
> > > 1/ 1/ -5%/ +1% 1/ 1/ -3%/ -2%
> > > 1/ 4/ +4%/ +1% 1/ 4/ -2%/ +2%
> > > 1/ 8/ -1%/ -1% 1/ 8/ -1%/ 0%
> > > 64/ 1/ -2%/ -3% 64/ 1/ +1%/ +1%
> > > 64/ 4/ -5%/ -1% 64/ 4/ +2%/ 0%
> > > 64/ 8/ 0%/ -1% 64/ 8/ -2%/ +1%
> > > 256/ 1/ +7%/ +1% 256/ 1/ -7%/ 0%
> > > 256/ 4/ +1%/ +1% 256/ 4/ -3%/ -4%
> > > 256/ 8/ +2%/ +2% 256/ 8/ +1%/ +1%
> > >
> > > vq size=256 TCP_STREAM vq size=512 TCP_STREAM
> > > size/sessions/+thu%/+normalize% size/sessions/+thu%/+normalize%
> > > 64/ 1/ 0%/ -3% 64/ 1/ 0%/ 0%
> > > 64/ 4/ +3%/ -1% 64/ 4/ -2%/ +4%
> > > 64/ 8/ +9%/ -4% 64/ 8/ -1%/ +2%
> > > 256/ 1/ +1%/ -4% 256/ 1/ +1%/ +1%
> > > 256/ 4/ -1%/ -1% 256/ 4/ -3%/ 0%
> > > 256/ 8/ +7%/ +5% 256/ 8/ -3%/ 0%
> > > 512/ 1/ +1%/ 0% 512/ 1/ -1%/ -1%
> > > 512/ 4/ +1%/ -1% 512/ 4/ 0%/ 0%
> > > 512/ 8/ +7%/ -5% 512/ 8/ +6%/ -1%
> > > 1024/ 1/ 0%/ -1% 1024/ 1/ 0%/ +1%
> > > 1024/ 4/ +3%/ 0% 1024/ 4/ +1%/ 0%
> > > 1024/ 8/ +8%/ +5% 1024/ 8/ -1%/ 0%
> > > 2048/ 1/ +2%/ +2% 2048/ 1/ -1%/ 0%
> > > 2048/ 4/ +1%/ 0% 2048/ 4/ 0%/ -1%
> > > 2048/ 8/ -2%/ 0% 2048/ 8/ 5%/ -1%
> > > 4096/ 1/ -2%/ 0% 4096/ 1/ -2%/ 0%
> > > 4096/ 4/ +2%/ 0% 4096/ 4/ 0%/ 0%
> > > 4096/ 8/ +9%/ -2% 4096/ 8/ -5%/ -1%
> > >
> > > Signed-off-by: Haibin Zhang <haibinzhang@tencent.com>
> > > Signed-off-by: Yunfang Tai <yunfangtai@tencent.com>
> > > Signed-off-by: Lidong Chen <lidongchen@tencent.com>
> >
> > Code is fine but I'd like to see validation of the heuristic
> > 2*vq->num with another vq size.
> >
> >
> >
> > > ---
> > > drivers/vhost/net.c | 8 +++++++-
> > > 1 file changed, 7 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > > index 8139bc70ad7d..3563a305cc0a 100644
> > > --- a/drivers/vhost/net.c
> > > +++ b/drivers/vhost/net.c
> > > @@ -44,6 +44,10 @@ MODULE_PARM_DESC(experimental_zcopytx, "Enable Zero Copy TX;"
> > > * Using this limit prevents one virtqueue from starving others. */
> > > #define VHOST_NET_WEIGHT 0x80000
> > >
> > > +/* Max number of packets transferred before requeueing the job.
> > > + * Using this limit prevents one virtqueue from starving rx. */
> > > +#define VHOST_NET_PKT_WEIGHT(vq) ((vq)->num * 2)
> > > +
> > > /* MAX number of TX used buffers for outstanding zerocopy */
> > > #define VHOST_MAX_PEND 128
> > > #define VHOST_GOODCOPY_LEN 256
> > > @@ -473,6 +477,7 @@ static void handle_tx(struct vhost_net *net)
> > > struct socket *sock;
> > > struct vhost_net_ubuf_ref *uninitialized_var(ubufs);
> > > bool zcopy, zcopy_used;
> > > + int sent_pkts = 0;
> > >
> > > mutex_lock(&vq->mutex);
> > > sock = vq->private_data;
> > > @@ -580,7 +585,8 @@ static void handle_tx(struct vhost_net *net)
> > > else
> > > vhost_zerocopy_signal_used(net, vq);
> > > vhost_net_tx_packet(net);
> > > - if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
> > > + if (unlikely(total_len >= VHOST_NET_WEIGHT) ||
> > > + unlikely(++sent_pkts >= VHOST_NET_PKT_WEIGHT(vq))) {
> > > vhost_poll_queue(&vq->poll);
> > > break;
> > > }
> > > --
> > > 2.12.3
> > >
>
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [RFC PATCH bpf-next 2/6] bpf: add bpf_get_stack helper
From: Alexei Starovoitov @ 2018-04-09 5:02 UTC (permalink / raw)
To: Yonghong Song, daniel, netdev; +Cc: kernel-team
In-Reply-To: <625de1bb-7cb2-5f9e-01c3-a863cd27b0e6@fb.com>
On 4/8/18 9:53 PM, Yonghong Song wrote:
>>> @@ -1004,7 +1007,8 @@ static void __bpf_prog_put(struct bpf_prog
>>> *prog, bool do_idr_lock)
>>> bpf_prog_kallsyms_del(prog->aux->func[i]);
>>> bpf_prog_kallsyms_del(prog);
>>>
>>> - call_rcu(&prog->aux->rcu, __bpf_prog_put_rcu);
>>> + synchronize_rcu();
>>> + __bpf_prog_put_rcu(&prog->aux->rcu);
>>
>> there should have been lockdep splat.
>> We cannot call synchronize_rcu here, since we cannot sleep
>> in some cases.
>
> Let me double check this. The following is the reason
> why I am using synchronize_rcu().
>
> With call_rcu(&prog->aux->rcu, __bpf_prog_put_rcu) and
> _bpf_prog_put_rcu calls put_callchain_buffers() which
> calls mutex_lock(), the runtime with CONFIG_DEBUG_ATOMIC_SLEEP=y
> will complains since potential sleep inside the call_rcu is not
> allowed.
I see. Indeed. We cannot call put_callchain_buffers() from rcu callback,
but doing synchronize_rcu() here is also not possible.
How about moving put_callchain into bpf_prog_free_deferred() ?
^ permalink raw reply
* Re: [RFC PATCH bpf-next 2/6] bpf: add bpf_get_stack helper
From: Yonghong Song @ 2018-04-09 4:53 UTC (permalink / raw)
To: Alexei Starovoitov, daniel, netdev; +Cc: kernel-team
In-Reply-To: <e532530c-d5ea-a889-6467-1d8dd2b4f098@fb.com>
On 4/8/18 8:34 PM, Alexei Starovoitov wrote:
> On 4/6/18 2:48 PM, Yonghong Song wrote:
>> Currently, stackmap and bpf_get_stackid helper are provided
>> for bpf program to get the stack trace. This approach has
>> a limitation though. If two stack traces have the same hash,
>> only one will get stored in the stackmap table,
>> so some stack traces are missing from user perspective.
>>
>> This patch implements a new helper, bpf_get_stack, will
>> send stack traces directly to bpf program. The bpf program
>> is able to see all stack traces, and then can do in-kernel
>> processing or send stack traces to user space through
>> shared map or bpf_perf_event_output.
>>
>> Signed-off-by: Yonghong Song <yhs@fb.com>
>> ---
>> include/linux/bpf.h | 1 +
>> include/linux/filter.h | 3 ++-
>> include/uapi/linux/bpf.h | 17 +++++++++++++--
>> kernel/bpf/stackmap.c | 56
>> ++++++++++++++++++++++++++++++++++++++++++++++++
>> kernel/bpf/syscall.c | 12 ++++++++++-
>> kernel/bpf/verifier.c | 3 +++
>> kernel/trace/bpf_trace.c | 50 +++++++++++++++++++++++++++++++++++++++++-
>> 7 files changed, 137 insertions(+), 5 deletions(-)
>>
>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>> index 95a7abd..72ccb9a 100644
>> --- a/include/linux/bpf.h
>> +++ b/include/linux/bpf.h
>> @@ -676,6 +676,7 @@ extern const struct bpf_func_proto
>> bpf_get_current_comm_proto;
>> extern const struct bpf_func_proto bpf_skb_vlan_push_proto;
>> extern const struct bpf_func_proto bpf_skb_vlan_pop_proto;
>> extern const struct bpf_func_proto bpf_get_stackid_proto;
>> +extern const struct bpf_func_proto bpf_get_stack_proto;
>> extern const struct bpf_func_proto bpf_sock_map_update_proto;
>>
>> /* Shared helpers among cBPF and eBPF. */
>> diff --git a/include/linux/filter.h b/include/linux/filter.h
>> index fc4e8f9..9b64f63 100644
>> --- a/include/linux/filter.h
>> +++ b/include/linux/filter.h
>> @@ -467,7 +467,8 @@ struct bpf_prog {
>> dst_needed:1, /* Do we need dst entry? */
>> blinded:1, /* Was blinded */
>> is_func:1, /* program is a bpf function */
>> - kprobe_override:1; /* Do we override a kprobe? */
>> + kprobe_override:1, /* Do we override a kprobe? */
>> + need_callchain_buf:1; /* Needs callchain buffer? */
>> enum bpf_prog_type type; /* Type of BPF program */
>> enum bpf_attach_type expected_attach_type; /* For some prog
>> types */
>> u32 len; /* Number of filter blocks */
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index c5ec897..a4ff5b7 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -517,6 +517,17 @@ union bpf_attr {
>> * other bits - reserved
>> * Return: >= 0 stackid on success or negative error
>> *
>> + * int bpf_get_stack(ctx, buf, size, flags)
>> + * walk user or kernel stack and store the ips in buf
>> + * @ctx: struct pt_regs*
>> + * @buf: user buffer to fill stack
>> + * @size: the buf size
>> + * @flags: bits 0-7 - numer of stack frames to skip
>> + * bit 8 - collect user stack instead of kernel
>> + * bit 11 - get build-id as well if user stack
>> + * other bits - reserved
>> + * Return: >= 0 size copied on success or negative error
>> + *
>> * s64 bpf_csum_diff(from, from_size, to, to_size, seed)
>> * calculate csum diff
>> * @from: raw from buffer
>> @@ -821,7 +832,8 @@ union bpf_attr {
>> FN(msg_apply_bytes), \
>> FN(msg_cork_bytes), \
>> FN(msg_pull_data), \
>> - FN(bind),
>> + FN(bind), \
>> + FN(get_stack),
>>
>> /* integer value in 'imm' field of BPF_CALL instruction selects which
>> helper
>> * function eBPF program intends to call
>> @@ -855,11 +867,12 @@ enum bpf_func_id {
>> /* BPF_FUNC_skb_set_tunnel_key and BPF_FUNC_skb_get_tunnel_key flags. */
>> #define BPF_F_TUNINFO_IPV6 (1ULL << 0)
>>
>> -/* BPF_FUNC_get_stackid flags. */
>> +/* BPF_FUNC_get_stackid and BPF_FUNC_get_stack flags. */
>> #define BPF_F_SKIP_FIELD_MASK 0xffULL
>> #define BPF_F_USER_STACK (1ULL << 8)
>> #define BPF_F_FAST_STACK_CMP (1ULL << 9)
>> #define BPF_F_REUSE_STACKID (1ULL << 10)
>> +#define BPF_F_USER_BUILD_ID (1ULL << 11)
>
> the comment above is not quite correct.
> This new flag is only available for new helper.
Right, some flags are used for both helpers and some are only used for
one of them. Will make it clear in the next revision.
>
>>
>> /* BPF_FUNC_skb_set_tunnel_key flags. */
>> #define BPF_F_ZERO_CSUM_TX (1ULL << 1)
>> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
>> index 04f6ec1..371c72e 100644
>> --- a/kernel/bpf/stackmap.c
>> +++ b/kernel/bpf/stackmap.c
>> @@ -402,6 +402,62 @@ const struct bpf_func_proto bpf_get_stackid_proto
>> = {
>> .arg3_type = ARG_ANYTHING,
>> };
>>
>> +BPF_CALL_4(bpf_get_stack, struct pt_regs *, regs, void *, buf, u32,
>> size,
>> + u64, flags)
>> +{
>> + u32 init_nr, trace_nr, copy_len, elem_size, num_elem;
>> + bool user_build_id = flags & BPF_F_USER_BUILD_ID;
>> + u32 skip = flags & BPF_F_SKIP_FIELD_MASK;
>> + bool user = flags & BPF_F_USER_STACK;
>> + struct perf_callchain_entry *trace;
>> + bool kernel = !user;
>> + u64 *ips;
>> +
>> + if (unlikely(flags & ~(BPF_F_SKIP_FIELD_MASK | BPF_F_USER_STACK |
>> + BPF_F_USER_BUILD_ID)))
>> + return -EINVAL;
>> +
>> + elem_size = (user && user_build_id) ? sizeof(struct
>> bpf_stack_build_id)
>> + : sizeof(u64);
>> + if (unlikely(size % elem_size))
>> + return -EINVAL;
>> +
>> + num_elem = size / elem_size;
>> + if (sysctl_perf_event_max_stack < num_elem)
>> + init_nr = 0;
>
> prog's buffer should be zero padded in this case since it
> points to uninit_mem.
Will make the change in the next revision.
>
>> + else
>> + init_nr = sysctl_perf_event_max_stack - num_elem;
>> + trace = get_perf_callchain(regs, init_nr, kernel, user,
>> + sysctl_perf_event_max_stack, false, false);
>> + if (unlikely(!trace))
>> + return -EFAULT;
>> +
>> + trace_nr = trace->nr - init_nr;
>> + if (trace_nr <= skip)
>> + return -EFAULT;
>> +
>> + trace_nr -= skip;
>> + trace_nr = (trace_nr <= num_elem) ? trace_nr : num_elem;
>> + copy_len = trace_nr * elem_size;
>> + ips = trace->ip + skip + init_nr;
>> + if (user && user_build_id)
>
> the combination of kern + user_build_id should probably be rejected
> earlier with einval.
Right, I missed this case. It should be rejected.
>
>> + stack_map_get_build_id_offset(buf, ips, trace_nr, user);
>> + else
>> + memcpy(buf, ips, copy_len);
>> +
>> + return copy_len;
>> +}
>> +
>> +const struct bpf_func_proto bpf_get_stack_proto = {
>> + .func = bpf_get_stack,
>> + .gpl_only = true,
>> + .ret_type = RET_INTEGER,
>> + .arg1_type = ARG_PTR_TO_CTX,
>> + .arg2_type = ARG_PTR_TO_UNINIT_MEM,
>> + .arg3_type = ARG_CONST_SIZE_OR_ZERO,
>
> why allow zero size?
> I'm not sure the helper will work correctly when size=0
The only reason I had is to make bpf program easier so
they do not need to test zero size, similiar to
bpf_probe_read/bpf_perf_event_output.
I have double checked my implementation and it should
handle zero size properly.
Let me double check whether whether not allowing zero
can still have reasonable bpf program coding which
passes verifier.
>
>> + .arg4_type = ARG_ANYTHING,
>> +};
>> +
>> /* Called from eBPF program */
>> static void *stack_map_lookup_elem(struct bpf_map *map, void *key)
>> {
>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> index 0244973..2aa3a65 100644
>> --- a/kernel/bpf/syscall.c
>> +++ b/kernel/bpf/syscall.c
>> @@ -984,10 +984,13 @@ void bpf_prog_free_id(struct bpf_prog *prog,
>> bool do_idr_lock)
>> static void __bpf_prog_put_rcu(struct rcu_head *rcu)
>> {
>> struct bpf_prog_aux *aux = container_of(rcu, struct bpf_prog_aux,
>> rcu);
>> + bool need_callchain_buf = aux->prog->need_callchain_buf;
>>
>> free_used_maps(aux);
>> bpf_prog_uncharge_memlock(aux->prog);
>> security_bpf_prog_free(aux);
>> + if (need_callchain_buf)
>> + put_callchain_buffers();
>> bpf_prog_free(aux->prog);
>> }
>>
>> @@ -1004,7 +1007,8 @@ static void __bpf_prog_put(struct bpf_prog
>> *prog, bool do_idr_lock)
>> bpf_prog_kallsyms_del(prog->aux->func[i]);
>> bpf_prog_kallsyms_del(prog);
>>
>> - call_rcu(&prog->aux->rcu, __bpf_prog_put_rcu);
>> + synchronize_rcu();
>> + __bpf_prog_put_rcu(&prog->aux->rcu);
>
> there should have been lockdep splat.
> We cannot call synchronize_rcu here, since we cannot sleep
> in some cases.
Let me double check this. The following is the reason
why I am using synchronize_rcu().
With call_rcu(&prog->aux->rcu, __bpf_prog_put_rcu) and
_bpf_prog_put_rcu calls put_callchain_buffers() which
calls mutex_lock(), the runtime with CONFIG_DEBUG_ATOMIC_SLEEP=y
will complains since potential sleep inside the call_rcu is not
allowed.
^ permalink raw reply
* Re: [PATCH] vhost-net: set packet weight of tx polling to 2 * vq size
From: haibinzhang(张海斌) @ 2018-04-09 4:09 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Jason Wang, kvm@vger.kernel.org,
virtualization@lists.linux-foundation.org, netdev@vger.kernel.org,
linux-kernel@vger.kernel.org,
lidongchen(陈立东),
yunfangtai(台运方)
> On Fri, Apr 06, 2018 at 08:22:37AM +0000, haibinzhang(张海斌) wrote:
> > handle_tx will delay rx for tens or even hundreds of milliseconds when tx busy
> > polling udp packets with small length(e.g. 1byte udp payload), because setting
> > VHOST_NET_WEIGHT takes into account only sent-bytes but no single packet length.
> >
> > Ping-Latencies shown below were tested between two Virtual Machines using
> > netperf (UDP_STREAM, len=1), and then another machine pinged the client:
> >
> > Packet-Weight Ping-Latencies(millisecond)
> > min avg max
> > Origin 3.319 18.489 57.303
> > 64 1.643 2.021 2.552
> > 128 1.825 2.600 3.224
> > 256 1.997 2.710 4.295
> > 512 1.860 3.171 4.631
> > 1024 2.002 4.173 9.056
> > 2048 2.257 5.650 9.688
> > 4096 2.093 8.508 15.943
>
> And this is with Q size 256 right?
Yes. Ping-latencies with 512 VQ size show below.
Packet-Weight Ping-Latencies(millisecond)
min avg max
Origin 6.357 29.177 66.245
64 2.798 3.614 4.403
128 2.861 3.820 4.775
256 3.008 4.018 4.807
512 3.254 4.523 5.824
1024 3.079 5.335 7.747
2048 3.944 8.201 12.762
4096 4.158 11.057 19.985
We will submit again. Is there anything else?
>
> > Ring size is a hint from device about a burst size it can tolerate. Based on
> > benchmarks, set the weight to 2 * vq size.
> >
> > To evaluate this change, another tests were done using netperf(RR, TX) between
> > two machines with Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz, and vq size was
> > tweaked through qemu. Results shown below does not show obvious changes.
>
> What I asked for is ping-latency with different VQ sizes,
> streaming below does not show anything.
>
> > vq size=256 TCP_RR vq size=512 TCP_RR
> > size/sessions/+thu%/+normalize% size/sessions/+thu%/+normalize%
> > 1/ 1/ -7%/ -2% 1/ 1/ 0%/ -2%
> > 1/ 4/ +1%/ 0% 1/ 4/ +1%/ 0%
> > 1/ 8/ +1%/ -2% 1/ 8/ 0%/ +1%
> > 64/ 1/ -6%/ 0% 64/ 1/ +7%/ +3%
> > 64/ 4/ 0%/ +2% 64/ 4/ -1%/ +1%
> > 64/ 8/ 0%/ 0% 64/ 8/ -1%/ -2%
> > 256/ 1/ -3%/ -4% 256/ 1/ -4%/ -2%
> > 256/ 4/ +3%/ +4% 256/ 4/ +1%/ +2%
> > 256/ 8/ +2%/ 0% 256/ 8/ +1%/ -1%
> >
> > vq size=256 UDP_RR vq size=512 UDP_RR
> > size/sessions/+thu%/+normalize% size/sessions/+thu%/+normalize%
> > 1/ 1/ -5%/ +1% 1/ 1/ -3%/ -2%
> > 1/ 4/ +4%/ +1% 1/ 4/ -2%/ +2%
> > 1/ 8/ -1%/ -1% 1/ 8/ -1%/ 0%
> > 64/ 1/ -2%/ -3% 64/ 1/ +1%/ +1%
> > 64/ 4/ -5%/ -1% 64/ 4/ +2%/ 0%
> > 64/ 8/ 0%/ -1% 64/ 8/ -2%/ +1%
> > 256/ 1/ +7%/ +1% 256/ 1/ -7%/ 0%
> > 256/ 4/ +1%/ +1% 256/ 4/ -3%/ -4%
> > 256/ 8/ +2%/ +2% 256/ 8/ +1%/ +1%
> >
> > vq size=256 TCP_STREAM vq size=512 TCP_STREAM
> > size/sessions/+thu%/+normalize% size/sessions/+thu%/+normalize%
> > 64/ 1/ 0%/ -3% 64/ 1/ 0%/ 0%
> > 64/ 4/ +3%/ -1% 64/ 4/ -2%/ +4%
> > 64/ 8/ +9%/ -4% 64/ 8/ -1%/ +2%
> > 256/ 1/ +1%/ -4% 256/ 1/ +1%/ +1%
> > 256/ 4/ -1%/ -1% 256/ 4/ -3%/ 0%
> > 256/ 8/ +7%/ +5% 256/ 8/ -3%/ 0%
> > 512/ 1/ +1%/ 0% 512/ 1/ -1%/ -1%
> > 512/ 4/ +1%/ -1% 512/ 4/ 0%/ 0%
> > 512/ 8/ +7%/ -5% 512/ 8/ +6%/ -1%
> > 1024/ 1/ 0%/ -1% 1024/ 1/ 0%/ +1%
> > 1024/ 4/ +3%/ 0% 1024/ 4/ +1%/ 0%
> > 1024/ 8/ +8%/ +5% 1024/ 8/ -1%/ 0%
> > 2048/ 1/ +2%/ +2% 2048/ 1/ -1%/ 0%
> > 2048/ 4/ +1%/ 0% 2048/ 4/ 0%/ -1%
> > 2048/ 8/ -2%/ 0% 2048/ 8/ 5%/ -1%
> > 4096/ 1/ -2%/ 0% 4096/ 1/ -2%/ 0%
> > 4096/ 4/ +2%/ 0% 4096/ 4/ 0%/ 0%
> > 4096/ 8/ +9%/ -2% 4096/ 8/ -5%/ -1%
> >
> > Signed-off-by: Haibin Zhang <haibinzhang@tencent.com>
> > Signed-off-by: Yunfang Tai <yunfangtai@tencent.com>
> > Signed-off-by: Lidong Chen <lidongchen@tencent.com>
>
> Code is fine but I'd like to see validation of the heuristic
> 2*vq->num with another vq size.
>
>
>
> > ---
> > drivers/vhost/net.c | 8 +++++++-
> > 1 file changed, 7 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > index 8139bc70ad7d..3563a305cc0a 100644
> > --- a/drivers/vhost/net.c
> > +++ b/drivers/vhost/net.c
> > @@ -44,6 +44,10 @@ MODULE_PARM_DESC(experimental_zcopytx, "Enable Zero Copy TX;"
> > * Using this limit prevents one virtqueue from starving others. */
> > #define VHOST_NET_WEIGHT 0x80000
> >
> > +/* Max number of packets transferred before requeueing the job.
> > + * Using this limit prevents one virtqueue from starving rx. */
> > +#define VHOST_NET_PKT_WEIGHT(vq) ((vq)->num * 2)
> > +
> > /* MAX number of TX used buffers for outstanding zerocopy */
> > #define VHOST_MAX_PEND 128
> > #define VHOST_GOODCOPY_LEN 256
> > @@ -473,6 +477,7 @@ static void handle_tx(struct vhost_net *net)
> > struct socket *sock;
> > struct vhost_net_ubuf_ref *uninitialized_var(ubufs);
> > bool zcopy, zcopy_used;
> > + int sent_pkts = 0;
> >
> > mutex_lock(&vq->mutex);
> > sock = vq->private_data;
> > @@ -580,7 +585,8 @@ static void handle_tx(struct vhost_net *net)
> > else
> > vhost_zerocopy_signal_used(net, vq);
> > vhost_net_tx_packet(net);
> > - if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
> > + if (unlikely(total_len >= VHOST_NET_WEIGHT) ||
> > + unlikely(++sent_pkts >= VHOST_NET_PKT_WEIGHT(vq))) {
> > vhost_poll_queue(&vq->poll);
> > break;
> > }
> > --
> > 2.12.3
> >
^ permalink raw reply
* Re: KASAN: slab-out-of-bounds Read in pfkey_add
From: Eric Biggers @ 2018-04-09 4:04 UTC (permalink / raw)
To: Kevin Easton
Cc: davem, herbert, linux-kernel, netdev, steffen.klassert,
syzkaller-bugs, syzbot
In-Reply-To: <001a114292fadd3e2505607060a8@google.com>
On Fri, Dec 15, 2017 at 11:51:01PM -0800, syzbot wrote:
> Hello,
>
> syzkaller hit the following crash on
> 50c4c4e268a2d7a3e58ebb698ac74da0de40ae36
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/master
> compiler: gcc (GCC) 7.1.1 20170620
> .config is attached
> Raw console output is attached.
> C reproducer is attached
> syzkaller reproducer is attached. See https://goo.gl/kgGztJ
> for information about syzkaller reproducers
>
>
> audit: type=1400 audit(1513021744.055:7): avc: denied { map } for
> pid=3149 comm="syzkaller428285" path="/root/syzkaller428285483" dev="sda1"
> ino=16481 scontext=unconfined_u:system_r:insmod_t:s0-s0:c0.c1023
> tcontext=unconfined_u:object_r:user_home_t:s0 tclass=file permissive=1
> ==================================================================
> BUG: KASAN: slab-out-of-bounds in memcpy include/linux/string.h:341 [inline]
> BUG: KASAN: slab-out-of-bounds in pfkey_msg2xfrm_state net/key/af_key.c:1212
> [inline]
> BUG: KASAN: slab-out-of-bounds in pfkey_add+0x1634/0x3270
> net/key/af_key.c:1491
> Read of size 8192 at addr ffff8801c5197318 by task syzkaller428285/3149
>
> CPU: 0 PID: 3149 Comm: syzkaller428285 Not tainted 4.15.0-rc3+ #127
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Call Trace:
> __dump_stack lib/dump_stack.c:17 [inline]
> dump_stack+0x194/0x257 lib/dump_stack.c:53
> print_address_description+0x73/0x250 mm/kasan/report.c:252
> kasan_report_error mm/kasan/report.c:351 [inline]
> kasan_report+0x25b/0x340 mm/kasan/report.c:409
> check_memory_region_inline mm/kasan/kasan.c:260 [inline]
> check_memory_region+0x137/0x190 mm/kasan/kasan.c:267
> memcpy+0x23/0x50 mm/kasan/kasan.c:302
> memcpy include/linux/string.h:341 [inline]
> pfkey_msg2xfrm_state net/key/af_key.c:1212 [inline]
> pfkey_add+0x1634/0x3270 net/key/af_key.c:1491
> pfkey_process+0x60b/0x720 net/key/af_key.c:2809
> pfkey_sendmsg+0x4d6/0x9f0 net/key/af_key.c:3648
> sock_sendmsg_nosec net/socket.c:636 [inline]
> sock_sendmsg+0xca/0x110 net/socket.c:646
> ___sys_sendmsg+0x75b/0x8a0 net/socket.c:2026
> __sys_sendmsg+0xe5/0x210 net/socket.c:2060
> C_SYSC_sendmsg net/compat.c:739 [inline]
> compat_SyS_sendmsg+0x2a/0x40 net/compat.c:737
> do_syscall_32_irqs_on arch/x86/entry/common.c:327 [inline]
> do_fast_syscall_32+0x3ee/0xf9d arch/x86/entry/common.c:389
> entry_SYSENTER_compat+0x51/0x60 arch/x86/entry/entry_64_compat.S:125
> RIP: 0023:0xf7fd4c79
> RSP: 002b:00000000ff9d7c1c EFLAGS: 00000203 ORIG_RAX: 0000000000000172
> RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00000000205f5000
> RDX: 0000000000000000 RSI: 0000000000000167 RDI: 000000000000000f
> RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
>
Looks like this is going to be fixed by
https://patchwork.kernel.org/patch/10327883/ ("af_key: Always verify length of
provided sadb_key"), but it's not applied yet to the ipsec tree yet. Kevin, for
future reference, for syzbot bugs it would be helpful to reply to the original
bug report and say that a patch was sent out, or even better send the patch as a
reply to the bug report email, e.g.
git format-patch --in-reply-to="<001a114292fadd3e2505607060a8@google.com>"
for this one (and the Message ID can be found in the syzkaller-bugs archive even
if the email isn't in your inbox). Otherwise people may not know that a patch
was sent out and do redundant work. Thanks!
I also simplified the reproducer for this, so here it is just in case someone
wants it anyway:
#include <sys/socket.h>
#include <unistd.h>
int main()
{
int fd = socket(AF_KEY, SOCK_RAW, 2);
char msg[96] =
"\x02\x03\x00\x02\x0c\x00\x00\x00\x00\x00\x00\x01\x02\x00\x00\x00"
"\x03\x00\x05\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00"
"\x00\x00\x00\x00\x00\x00\x00\x00"
"\x03\x00\x06\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00"
"\x00\x00\x00\x00\x00\x00\x00\x00"
"\x02\x00\x01\x00\x00\x00\x00\x00\x00\x00\xfb\x00\x00\x00\x00\x00"
"\x02\x00\x08\x00\xff\xff\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00";
write(fd, msg, sizeof(msg));
}
It causes a 8192-byte out-of-bounds read.
Eric
^ permalink raw reply
* Re: [RFC bpf-next] bpf: document eBPF helpers and add a script to generate man page
From: Alexei Starovoitov @ 2018-04-09 3:48 UTC (permalink / raw)
To: Quentin Monnet; +Cc: daniel, ast, netdev, oss-drivers, linux-doc, linux-man
In-Reply-To: <20180406111122.11038-1-quentin.monnet@netronome.com>
On Fri, Apr 06, 2018 at 12:11:22PM +0100, Quentin Monnet wrote:
> eBPF helper functions can be called from within eBPF programs to perform
> a variety of tasks that would be otherwise hard or impossible to do with
> eBPF itself. There is a growing number of such helper functions in the
> kernel, but documentation is scarce. The main user space header file
> does contain a short commented description of most helpers, but it is
> somewhat outdated and not complete. It is more a "cheat sheet" than a
> real documentation accessible to new eBPF developers.
>
> This commit attempts to improve the situation by replacing the existing
> overview for the helpers with a more developed description. Furthermore,
> a Python script is added to generate a manual page for eBPF helpers. The
> workflow is the following, and requires the rst2man utility:
>
> $ ./scripts/bpf_helpers_doc.py \
> --filename include/uapi/linux/bpf.h > /tmp/bpf-helpers.rst
> $ rst2man /tmp/bpf-helpers.rst > /tmp/bpf-helpers.7
> $ man /tmp/bpf-helpers.7
>
> The objective is to keep all documentation related to the helpers in a
> single place, and to be able to generate from here a manual page that
> could be packaged in the man-pages repository and shipped with most
> distributions [1].
>
> Additionally, parsing the prototypes of the helper functions could
> hopefully be reused, with a different Printer object, to generate
> header files needed in some eBPF-related projects.
>
> Regarding the description of each helper, it comprises several items:
>
> - The function prototype.
> - A description of the function and of its arguments (except for a
> couple of cases, when there are no arguments and the return value
> makes the function usage really obvious).
> - A description of return values (if not void).
> - A listing of eBPF program types (if relevant, map types) compatible
> with the helper.
> - Information about the helper being restricted to GPL programs, or not.
> - The kernel version in which the helper was introduced.
> - The commit that introduced the helper (this is mostly to have it in
> the source of the man page, as it can be used to track changes and
> update the page).
>
> For several helpers, descriptions are inspired (at times, nearly copied)
> from the commit logs introducing them in the kernel--Many thanks to
> their respective authors! They were completed as much as possible, the
> objective being to have something easily accessible even for people just
> starting with eBPF. There is probably a bit more work to do in this
> direction for some helpers.
>
> Some RST formatting is used in the descriptions (not in function
> prototypes, to keep them readable, but the Python script provided in
> order to generate the RST for the manual page does add formatting to
> prototypes, to produce something pretty) to get "bold" and "italics" in
> manual pages. Hopefully, the descriptions in bpf.h file remains
> perfectly readable. Note that the few trailing white spaces are
> intentional, removing them would break paragraphs for rst2man.
>
> The descriptions should ideally be updated each time someone adds a new
> helper, or updates the behaviour (compatibility extended to new program
> types, new socket option supported...) or the interface (new flags
> available, ...) of existing ones.
>
> [1] I have not contacted people from the man-pages project prior to
> sending this RFC, so I can offer no guaranty at this time that they
> would accept to take the generated man page.
>
> Cc: linux-doc@vger.kernel.org
> Cc: linux-man@vger.kernel.org
> Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
> ---
> include/uapi/linux/bpf.h | 2237 ++++++++++++++++++++++++++++++++++++--------
> scripts/bpf_helpers_doc.py | 568 +++++++++++
> 2 files changed, 2429 insertions(+), 376 deletions(-)
> create mode 100755 scripts/bpf_helpers_doc.py
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index c5ec89732a8d..f47aeddbbe0a 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -367,394 +367,1879 @@ union bpf_attr {
>
> /* BPF helper function descriptions:
> *
> - * void *bpf_map_lookup_elem(&map, &key)
> - * Return: Map value or NULL
> - *
> - * int bpf_map_update_elem(&map, &key, &value, flags)
> - * Return: 0 on success or negative error
> - *
> - * int bpf_map_delete_elem(&map, &key)
> - * Return: 0 on success or negative error
> - *
> - * int bpf_probe_read(void *dst, int size, void *src)
> - * Return: 0 on success or negative error
> + * void *bpf_map_lookup_elem(struct bpf_map *map, void *key)
> + * Description
> + * Perform a lookup in *map* for an entry associated to *key*.
> + * Return
> + * Map value associated to *key*, or **NULL** if no entry was
> + * found.
> + * For
> + * All types of programs. Limited to maps of types
> + * **BPF_MAP_TYPE_HASH**,
> + * **BPF_MAP_TYPE_ARRAY**,
> + * **BPF_MAP_TYPE_PERCPU_HASH**,
> + * **BPF_MAP_TYPE_PERCPU_ARRAY**,
> + * **BPF_MAP_TYPE_LRU_HASH**,
> + * **BPF_MAP_TYPE_LRU_PERCPU_HASH**,
> + * **BPF_MAP_TYPE_LPM_TRIE**,
> + * **BPF_MAP_TYPE_ARRAY_OF_MAPS**,
> + * and **BPF_MAP_TYPE_HASH_OF_MAPS**.
> + * GPL only
> + * No
> + * Since
> + * Linux 3.19
I think we should give it a try. There is a risk that it will become stale
and to reduce that I'd propose to remove 'For', 'GPL only' and 'Since',
since for new helpers 'Since' is kinda hard to fill in before it lands
all the way, and 'For' keeps changing as new types are added.
'Description' is the most useful and it needs separate thorough review
for every helper.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox