* pull-request: bpf 2018-03-21
From: Daniel Borkmann @ 2018-03-21 1:50 UTC (permalink / raw)
To: davem; +Cc: daniel, ast, netdev
Hi David,
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Follow-up fix to the fault injection framework to prevent jump
optimization on the kprobe by installing a dummy post-handler,
from Masami.
2) Drop bpf_perf_prog_read_value helper from tracepoint type programs
which was mistakenly added there and would otherwise crash due to
wrong input context, from Yonghong.
3) Fix a crash in BPF fs when compiled with clang. Code appears to
be fine just that clang tries to overly aggressive optimize in
non C conform ways, therefore fix the kernel's Makefile to
generally prevent such issues, from Daniel.
4) Skip unnecessary capability checks in bpf syscall, which is otherwise
triggering unnecessary security hooks on capability checking and
causing false alarms on unprivileged processes trying to access
CAP_SYS_ADMIN restricted infra, from Chenbo.
5) Fix the test_bpf.ko module when CONFIG_BPF_JIT_ALWAYS_ON is set
with regards to a test case that is really just supposed to fail
on x8_64 JIT but not others, from Thadeu.
Please consider pulling these changes from:
git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git
Thanks a lot!
----------------------------------------------------------------
The following changes since commit 9e5fb7207024e53700bdac23f53d1e44d530a7f6:
Merge branch 'bnxt_en-Bug-fixes' (2018-03-12 10:58:28 -0400)
are available in the git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git
for you to fetch changes up to 87e0d4f0f37fb0c8c4aeeac46fff5e957738df79:
kbuild: disable clang's default use of -fmerge-all-constants (2018-03-20 17:43:15 -0700)
----------------------------------------------------------------
Chenbo Feng (1):
bpf: skip unnecessary capability check
Daniel Borkmann (1):
kbuild: disable clang's default use of -fmerge-all-constants
Masami Hiramatsu (1):
error-injection: Fix to prohibit jump optimization
Thadeu Lima de Souza Cascardo (1):
test_bpf: Fix testing with CONFIG_BPF_JIT_ALWAYS_ON=y on other arches
Yonghong Song (1):
trace/bpf: remove helper bpf_perf_prog_read_value from tracepoint type programs
Makefile | 9 +++++++
kernel/bpf/syscall.c | 2 +-
kernel/fail_function.c | 10 +++++++
kernel/trace/bpf_trace.c | 68 ++++++++++++++++++++++++++++--------------------
lib/test_bpf.c | 2 +-
5 files changed, 61 insertions(+), 30 deletions(-)
^ permalink raw reply
* [PATCH net-next] devlink: Remove top_hierarchy arg to devlink_resource_register
From: David Ahern @ 2018-03-21 2:31 UTC (permalink / raw)
To: netdev; +Cc: arkadis, jiri, David Ahern
top_hierarchy arg can be determined by comparing parent_resource_id to
DEVLINK_RESOURCE_ID_PARENT_TOP so it does not need to be a separate
argument.
Signed-off-by: David Ahern <dsahern@gmail.com>
---
drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 9 ++++-----
drivers/net/ethernet/mellanox/mlxsw/spectrum_kvdl.c | 6 +++---
include/net/devlink.h | 1 -
net/core/devlink.c | 4 +++-
4 files changed, 10 insertions(+), 10 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
index 7884e8a2de35..180f49fbebe4 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
@@ -3880,8 +3880,7 @@ static int mlxsw_sp_resources_register(struct mlxsw_core *mlxsw_core)
kvd_size = MLXSW_CORE_RES_GET(mlxsw_core, KVD_SIZE);
err = devlink_resource_register(devlink, MLXSW_SP_RESOURCE_NAME_KVD,
- true, kvd_size,
- MLXSW_SP_RESOURCE_KVD,
+ kvd_size, MLXSW_SP_RESOURCE_KVD,
DEVLINK_RESOURCE_ID_PARENT_TOP,
&kvd_size_params,
NULL);
@@ -3890,7 +3889,7 @@ static int mlxsw_sp_resources_register(struct mlxsw_core *mlxsw_core)
linear_size = profile->kvd_linear_size;
err = devlink_resource_register(devlink, MLXSW_SP_RESOURCE_NAME_KVD_LINEAR,
- false, linear_size,
+ linear_size,
MLXSW_SP_RESOURCE_KVD_LINEAR,
MLXSW_SP_RESOURCE_KVD,
&linear_size_params,
@@ -3908,7 +3907,7 @@ static int mlxsw_sp_resources_register(struct mlxsw_core *mlxsw_core)
profile->kvd_hash_single_parts;
double_size = rounddown(double_size, profile->kvd_hash_granularity);
err = devlink_resource_register(devlink, MLXSW_SP_RESOURCE_NAME_KVD_HASH_DOUBLE,
- false, double_size,
+ double_size,
MLXSW_SP_RESOURCE_KVD_HASH_DOUBLE,
MLXSW_SP_RESOURCE_KVD,
&hash_double_size_params,
@@ -3918,7 +3917,7 @@ static int mlxsw_sp_resources_register(struct mlxsw_core *mlxsw_core)
single_size = kvd_size - double_size - linear_size;
err = devlink_resource_register(devlink, MLXSW_SP_RESOURCE_NAME_KVD_HASH_SINGLE,
- false, single_size,
+ single_size,
MLXSW_SP_RESOURCE_KVD_HASH_SINGLE,
MLXSW_SP_RESOURCE_KVD,
&hash_single_size_params,
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_kvdl.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_kvdl.c
index 4c9bff2fa055..85503e93b93f 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_kvdl.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_kvdl.c
@@ -459,7 +459,7 @@ int mlxsw_sp_kvdl_resources_register(struct devlink *devlink)
mlxsw_sp_kvdl_resource_size_params_prepare(devlink);
err = devlink_resource_register(devlink, MLXSW_SP_RESOURCE_NAME_KVD_LINEAR_SINGLES,
- false, MLXSW_SP_KVDL_SINGLE_SIZE,
+ MLXSW_SP_KVDL_SINGLE_SIZE,
MLXSW_SP_RESOURCE_KVD_LINEAR_SINGLE,
MLXSW_SP_RESOURCE_KVD_LINEAR,
&mlxsw_sp_kvdl_single_size_params,
@@ -468,7 +468,7 @@ int mlxsw_sp_kvdl_resources_register(struct devlink *devlink)
return err;
err = devlink_resource_register(devlink, MLXSW_SP_RESOURCE_NAME_KVD_LINEAR_CHUNKS,
- false, MLXSW_SP_KVDL_CHUNKS_SIZE,
+ MLXSW_SP_KVDL_CHUNKS_SIZE,
MLXSW_SP_RESOURCE_KVD_LINEAR_CHUNKS,
MLXSW_SP_RESOURCE_KVD_LINEAR,
&mlxsw_sp_kvdl_chunks_size_params,
@@ -477,7 +477,7 @@ int mlxsw_sp_kvdl_resources_register(struct devlink *devlink)
return err;
err = devlink_resource_register(devlink, MLXSW_SP_RESOURCE_NAME_KVD_LINEAR_LARGE_CHUNKS,
- false, MLXSW_SP_KVDL_LARGE_CHUNKS_SIZE,
+ MLXSW_SP_KVDL_LARGE_CHUNKS_SIZE,
MLXSW_SP_RESOURCE_KVD_LINEAR_LARGE_CHUNKS,
MLXSW_SP_RESOURCE_KVD_LINEAR,
&mlxsw_sp_kvdl_large_chunks_size_params,
diff --git a/include/net/devlink.h b/include/net/devlink.h
index c83125ad20ff..d5b707375e48 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -406,7 +406,6 @@ extern struct devlink_dpipe_header devlink_dpipe_header_ipv6;
int devlink_resource_register(struct devlink *devlink,
const char *resource_name,
- bool top_hierarchy,
u64 resource_size,
u64 resource_id,
u64 parent_resource_id,
diff --git a/net/core/devlink.c b/net/core/devlink.c
index f23e5ed7c90f..d03b96f87c25 100644
--- a/net/core/devlink.c
+++ b/net/core/devlink.c
@@ -3174,7 +3174,6 @@ EXPORT_SYMBOL_GPL(devlink_dpipe_table_unregister);
*/
int devlink_resource_register(struct devlink *devlink,
const char *resource_name,
- bool top_hierarchy,
u64 resource_size,
u64 resource_id,
u64 parent_resource_id,
@@ -3183,8 +3182,11 @@ int devlink_resource_register(struct devlink *devlink,
{
struct devlink_resource *resource;
struct list_head *resource_list;
+ bool top_hierarchy;
int err = 0;
+ top_hierarchy = parent_resource_id == DEVLINK_RESOURCE_ID_PARENT_TOP;
+
mutex_lock(&devlink->lock);
resource = devlink_resource_find(devlink, NULL, resource_id);
if (resource) {
--
2.11.0
^ permalink raw reply related
* Re: [PATCH net-next v3 1/2] net: permit skb_segment on head_frag frag_list skb
From: Yonghong Song @ 2018-03-21 5:02 UTC (permalink / raw)
To: Alexander Duyck
Cc: Eric Dumazet, ast, Daniel Borkmann, diptanu, Netdev, Kernel Team
In-Reply-To: <CAKgT0Ud3=qh-7mAHGfKQfhF+Q3scmKfhVhZgtnMfQ5Sy6-Msag@mail.gmail.com>
On 3/20/18 4:50 PM, Alexander Duyck wrote:
> On Tue, Mar 20, 2018 at 4:21 PM, Yonghong Song <yhs@fb.com> wrote:
>> One of our in-house projects, bpf-based NAT, hits a kernel BUG_ON at
>> function skb_segment(), line 3667. The bpf program attaches to
>> clsact ingress, calls bpf_skb_change_proto to change protocol
>> from ipv4 to ipv6 or from ipv6 to ipv4, and then calls bpf_redirect
>> to send the changed packet out.
>>
>> 3472 struct sk_buff *skb_segment(struct sk_buff *head_skb,
>> 3473 netdev_features_t features)
>> 3474 {
>> 3475 struct sk_buff *segs = NULL;
>> 3476 struct sk_buff *tail = NULL;
>> ...
>> 3665 while (pos < offset + len) {
>> 3666 if (i >= nfrags) {
>> 3667 BUG_ON(skb_headlen(list_skb));
>> 3668
>> 3669 i = 0;
>> 3670 nfrags = skb_shinfo(list_skb)->nr_frags;
>> 3671 frag = skb_shinfo(list_skb)->frags;
>> 3672 frag_skb = list_skb;
>> ...
>>
>> call stack:
>> ...
>> #1 [ffff883ffef03558] __crash_kexec at ffffffff8110c525
>> #2 [ffff883ffef03620] crash_kexec at ffffffff8110d5cc
>> #3 [ffff883ffef03640] oops_end at ffffffff8101d7e7
>> #4 [ffff883ffef03668] die at ffffffff8101deb2
>> #5 [ffff883ffef03698] do_trap at ffffffff8101a700
>> #6 [ffff883ffef036e8] do_error_trap at ffffffff8101abfe
>> #7 [ffff883ffef037a0] do_invalid_op at ffffffff8101acd0
>> #8 [ffff883ffef037b0] invalid_op at ffffffff81a00bab
>> [exception RIP: skb_segment+3044]
>> RIP: ffffffff817e4dd4 RSP: ffff883ffef03860 RFLAGS: 00010216
>> RAX: 0000000000002bf6 RBX: ffff883feb7aaa00 RCX: 0000000000000011
>> RDX: ffff883fb87910c0 RSI: 0000000000000011 RDI: ffff883feb7ab500
>> RBP: ffff883ffef03928 R8: 0000000000002ce2 R9: 00000000000027da
>> R10: 000001ea00000000 R11: 0000000000002d82 R12: ffff883f90a1ee80
>> R13: ffff883fb8791120 R14: ffff883feb7abc00 R15: 0000000000002ce2
>> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
>> #9 [ffff883ffef03930] tcp_gso_segment at ffffffff818713e7
>> --- <IRQ stack> ---
>> ...
>>
>> The triggering input skb has the following properties:
>> list_skb = skb->frag_list;
>> skb->nfrags != NULL && skb_headlen(list_skb) != 0
>> and skb_segment() is not able to handle a frag_list skb
>> if its headlen (list_skb->len - list_skb->data_len) is not 0.
>>
>> This patch addressed the issue by handling skb_headlen(list_skb) != 0
>> case properly if list_skb->head_frag is true, which is expected in
>> most cases. The head frag is processed before list_skb->frags
>> are processed.
>>
>> Reported-by: Diptanu Gon Choudhury <diptanu@fb.com>
>> Signed-off-by: Yonghong Song <yhs@fb.com>
>> ---
>> net/core/skbuff.c | 51 +++++++++++++++++++++++++++++++++++++--------------
>> 1 file changed, 37 insertions(+), 14 deletions(-)
>>
>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> index 715c134..59bbc06 100644
>> --- a/net/core/skbuff.c
>> +++ b/net/core/skbuff.c
>> @@ -3475,7 +3475,7 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
>> struct sk_buff *segs = NULL;
>> struct sk_buff *tail = NULL;
>> struct sk_buff *list_skb = skb_shinfo(head_skb)->frag_list;
>> - skb_frag_t *frag = skb_shinfo(head_skb)->frags;
>> + skb_frag_t *frag = skb_shinfo(head_skb)->frags, *head_frag = NULL;
>
> I think you misunderstood me. I wasn't saying you allocate head_frag.
> I was saying you could move the declaration down.
Sorry for my misunderstanding. I did understand your intention of moving
the declaration down in order to save stack space. I thought that we
cannot really move declaration down (although it works in C, but
semantically it is not quite right, more later), so I moved on to
use runtime allocation. But indeed skb_frag_t is not big (16 bytes), it
could live on the stack.
>
>> unsigned int mss = skb_shinfo(head_skb)->gso_size;
>> unsigned int doffset = head_skb->data - skb_mac_header(head_skb);
>> struct sk_buff *frag_skb = head_skb;
>> @@ -3664,19 +3664,39 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
>>
>> while (pos < offset + len) {
>
> So right here in the loop you could add a "skb_frag_t head_frag;" just
> so we declare it here and save ourselves the stack space.
I actually tried to move "skb_frag_t head_frag". The stack size remains
the same, 0xc0. This is related to how C compiler allocates stack space.
The declaration place won't decide the stack size as long as the
declaration dictates the usage. The stack size is really determined by
liveness analysis.
Further, we have code like:
do {
....
while (pos < offset + len) {
if (i >= nfrags) {
...
head_frag = ...
}
... = head_frag; // head_frag access guaranteed after
// above definition, but it may not
// be in the same outer do-while loop.
}
...
} while (((offset += len) < head_skb->len);
So the use of head_frag maybe in different outer loop iterations.
So I feel the definition of head_frag should be outside the
outer do-while loop, which is the main function scope. I will add some
comments here.
>
>> if (i >= nfrags) {
>> - BUG_ON(skb_headlen(list_skb));
>> -
>> i = 0;
>> + if (skb_headlen(list_skb)) {
>> + struct page *page;
>> +
>> + BUG_ON(!list_skb->head_frag);
>> +
>> + page = virt_to_head_page(list_skb->head);
>> + if (!head_frag) {
>> + head_frag = kmalloc(sizeof(skb_frag_t),
>> + GFP_KERNEL);
>> + if (!head_frag)
>> + goto err;
>> + }
>
> Please no memory allocation. I just meant you could allocate it on the
> stack later.
>
>> + head_frag->page.p = page;
>> + head_frag->page_offset = list_skb->data -
>> + (unsigned char *)page_address(page);
>> + head_frag->size = skb_headlen(list_skb);
>> + /* set i = -1 so we will pick head_frag
>> + * instead of skb_shinfo(list_skb)->frags
>> + * when i == -1.
>> + */
>> + i = -1;
>> + }
>
> So it took me a bit to pick up on the fact that line below wasn't
> removed. So we are basically trying to do this all in one pass now. Do
> I have that right?
>
> One thing you could look at doing to save yourself the extra "if"
> later would be to pull frag pointer before you go through skb_headlen
> check above. Then if you are going to use a head_frag you could just
> do a "i--; frag--;" combination just to rewind and make the room for
> the increment to come later. That way you don't have an invalid frag
> pointer floating around. That way you only have to do this once
> instead of having to do a conditional check per fragment.
Right. This indeed make code more cleaner.
>
>> nfrags = skb_shinfo(list_skb)->nr_frags;
>> - frag = skb_shinfo(list_skb)->frags;
>
> This patch might be more readable if you were to just insert the
> skb_headlen() bits down here and left the i=0 through frag = .. in one
> piece.
Right. Will implement as suggested.
>
>> - frag_skb = list_skb;
>> -
>> - BUG_ON(!nfrags);
>> -
>> - if (skb_orphan_frags(frag_skb, GFP_ATOMIC) ||
>> - skb_zerocopy_clone(nskb, frag_skb,
>> - GFP_ATOMIC))
>> - goto err;
>> + if (nfrags) {
>> + frag = skb_shinfo(list_skb)->frags;
>> + frag_skb = list_skb;
>> +
>> + if (skb_orphan_frags(frag_skb, GFP_ATOMIC) ||
>> + skb_zerocopy_clone(nskb, frag_skb,
>> + GFP_ATOMIC))
>> + goto err;
>> + }
>>
>> list_skb = list_skb->next;
>> }
>> @@ -3689,7 +3709,7 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
>> goto err;
>> }
>>
>> - *nskb_frag = *frag;
>> + *nskb_frag = (i == -1) ? *head_frag : *frag;
>
> So this would be better as "*nskb_frag = (i < 0) ? head_frag : *frag;".
Good suggestion. Will implement as suggested.
>
>> __skb_frag_ref(nskb_frag);
>> size = skb_frag_size(nskb_frag);
>>
>> @@ -3702,7 +3722,8 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
>>
>> if (pos + size <= offset + len) {
>> i++;
>> - frag++;
>> + if (i != 0)
>> + frag++;
>> pos += size;
>> } else {
>> skb_frag_size_sub(nskb_frag, pos + size - (offset + len));
>> @@ -3774,10 +3795,12 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
>> swap(tail->destructor, head_skb->destructor);
>> swap(tail->sk, head_skb->sk);
>> }
>> + kfree(head_frag);
>> return segs;
>>
>> err:
>> kfree_skb_list(segs);
>> + kfree(head_frag);
>> return ERR_PTR(err);
>> }
>> EXPORT_SYMBOL_GPL(skb_segment);
>> --
>> 2.9.5
>>
^ permalink raw reply
* Re: [PATCH net-next v3 2/2] net: bpf: add a test for skb_segment in test_bpf module
From: Yonghong Song @ 2018-03-21 5:15 UTC (permalink / raw)
To: Eric Dumazet, edumazet, ast, daniel, diptanu, netdev,
alexander.duyck
Cc: kernel-team
In-Reply-To: <890b597a-85b6-a3fc-3419-8cace6d0f2b7@gmail.com>
On 3/20/18 5:44 PM, Eric Dumazet wrote:
>
>
> On 03/20/2018 04:21 PM, Yonghong Song wrote:
>> Without the previous commit,
>> "modprobe test_bpf" will have the following errors:
>> ...
>> [ 98.149165] ------------[ cut here ]------------
>> [ 98.159362] kernel BUG at net/core/skbuff.c:3667!
>> [ 98.169756] invalid opcode: 0000 [#1] SMP PTI
>> [ 98.179370] Modules linked in:
>> [ 98.179371] test_bpf(+)
>> ...
>> which triggers the bug the previous commit intends to fix.
>>
>> The skbs are constructed to mimic what mlx5 may generate.
>> The packet size/header may not mimic real cases in production. But
>> the processing flow is similar.
>>
>> Signed-off-by: Yonghong Song <yhs@fb.com>
>> ---
>> lib/test_bpf.c | 71 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>> 1 file changed, 70 insertions(+), 1 deletion(-)
>>
>> diff --git a/lib/test_bpf.c b/lib/test_bpf.c
>> index 2efb213..045d7d3 100644
>> --- a/lib/test_bpf.c
>> +++ b/lib/test_bpf.c
>> @@ -6574,6 +6574,72 @@ static bool exclude_test(int test_id)
>> return test_id < test_range[0] || test_id > test_range[1];
>> }
>>
>> +static struct sk_buff *build_test_skb(void *page)
>> +{
>> + u32 headroom = NET_SKB_PAD + NET_IP_ALIGN + ETH_HLEN;
>> + struct sk_buff *skb[2];
>> + int i, data_size = 8;
>> +
>> + for (i = 0; i < 2; i++) {
>> + /* this will set skb[i]->head_frag */
>> + skb[i] = build_skb(page, headroom);
>> + if (!skb[i])
>> + return NULL;
>
> You are using the same virtual address (page) for both skb ?
>
> So we have 2 skbs having skb->head pointing to the same location ?
Thanks, Eric. This is purely due to my 'laziness' to make it work as I
know that skb_segment does not really enforce this. I will address
all of your comments in the next revision.
>
> This is illegal.
>
> Please use instead : skb = dev_alloc_skb(headroom + data_size)
>
>> +
>> + skb_reserve(skb[i], headroom);
>> + skb_put(skb[i], data_size);
>> + skb[i]->protocol = htons(ETH_P_IP);
>> + skb_reset_network_header(skb[i]);
>> + skb_set_mac_header(skb[i], -ETH_HLEN);
>> +
>> + skb_add_rx_frag(skb[i],
>
> skb_shinfo(skb[i])->nr_frags,
>
> 0 ?
>
>> + page, 0, 64, 64);
>
> get_page(page) ?
>
>> + // skb: skb_headlen(skb[i]): 8, skb[i]->head_frag = 1
>> + }
>> +
>> + /* setup shinfo */
>> + skb_shinfo(skb[0])->gso_size = 1448;
>> + skb_shinfo(skb[0])->gso_type = SKB_GSO_TCPV4;
>> + skb_shinfo(skb[0])->gso_type |= SKB_GSO_DODGY;
>> + skb_shinfo(skb[0])->gso_segs = 0;
>> + skb_shinfo(skb[0])->frag_list = skb[1];
>> +
>> + /* adjust skb[0]'s len */
>> + skb[0]->len += skb[1]->len;
>> + skb[0]->data_len += skb[1]->data_len;
>> + skb[0]->truesize += skb[1]->truesize;
>> +
>> + return skb[0];
>> +}
>> +
>> +static __init int test_skb_segment(void)
>> +{
>> + netdev_features_t features;
>> + struct sk_buff *skb;
>> + void *page;
>> + int ret = -1;
>> +
>> + page = (void *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
>> + if (!page) {
>> + pr_info("%s: failed to get_free_page!", __func__);
>> + return ret;
>> + }
>> +
>> + features = NETIF_F_SG | NETIF_F_GSO_PARTIAL | NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM;
>> + features |= NETIF_F_RXCSUM;
>> + skb = build_test_skb(page);
>> + if (!skb) {
>> + pr_info("%s: failed to build_test_skb", __func__);
>> + } else if (skb_segment(skb, features)) {
>> + ret = 0;
>> + pr_info("%s: success in skb_segment!", __func__);
>> + } else {
>> + pr_info("%s: failed in skb_segment!", __func__);
>> + }
>> + free_page((unsigned long)page);
>
>
> Where are the skbs freed ?
>
>
>> + return ret;
>> +}
>> +
>> static __init int test_bpf(void)
>> {
>> int i, err_cnt = 0, pass_cnt = 0;
>> @@ -6632,8 +6698,11 @@ static int __init test_bpf_init(void)
>> return ret;
>>
>> ret = test_bpf();
>> -
>> destroy_bpf_tests();
>> + if (ret)
>> + return ret;
>> +
>> + ret = test_skb_segment();
>> return ret;
>> }
>>
>>
^ permalink raw reply
* Re: [PATCH net-next 5/6] tls: RX path for ktls
From: Boris Pismenny @ 2018-03-21 5:20 UTC (permalink / raw)
To: Dave Watson, David S. Miller, Tom Herbert, Alexei Starovoitov,
herbert, linux-crypto, netdev
Cc: Atul Gupta, Vakul Garg, Hannes Frederic Sowa, Steffen Klassert,
John Fastabend, Daniel Borkmann
In-Reply-To: <20180320175434.GA23938@davejwatson-mba.local>
On 3/20/2018 7:54 PM, Dave Watson wrote:
> Add rx path for tls software implementation.
>
> recvmsg, splice_read, and poll implemented.
>
> An additional sockopt TLS_RX is added, with the same interface as
> TLS_TX. Either TLX_RX or TLX_TX may be provided separately, or
> together (with two different setsockopt calls with appropriate keys).
>
> Control messages are passed via CMSG in a similar way to transmit.
> If no cmsg buffer is passed, then only application data records
> will be passed to userspace, and EIO is returned for other types of
> alerts.
>
> EBADMSG is passed for decryption errors, and EMSGSIZE is passed for
> framing errors (either framing too big *or* too small with crypto
> overhead). EINVAL is returned for TLS versions that do not match the
> original setsockopt call. All are unrecoverable.
>
> strparser is used to parse TLS framing. Decryption is done directly
> in to userspace buffers if they are large enough to support it, otherwise
> sk_cow_data is called (similar to ipsec), and buffers are decrypted in
> place and copied. splice_read always decrypts in place, since no
> buffers are provided to decrypt in to.
>
> sk_poll is overridden, and only returns POLLIN if a full TLS message is
> received. Otherwise we wait for strparser to finish reading a full frame.
> Actual decryption is only done during recvmsg or splice_read calls.
>
> Signed-off-by: Dave Watson <davejwatson@fb.com>
> ---
...
> +
> +static int tls_read_size(struct strparser *strp, struct sk_buff *skb)
> +{
> + struct tls_context *tls_ctx = tls_get_ctx(strp->sk);
> + struct tls_sw_context *ctx = tls_sw_ctx(tls_ctx);
> + char header[tls_ctx->rx.prepend_size];
> + struct strp_msg *rxm = strp_msg(skb);
> + size_t cipher_overhead;
> + size_t data_len = 0;
> + int ret;
> +
> + /* Verify that we have a full TLS header, or wait for more data */
> + if (rxm->offset + tls_ctx->rx.prepend_size > skb->len)
> + return 0;
> +
> + /* Linearize header to local buffer */
> + ret = skb_copy_bits(skb, rxm->offset, header, tls_ctx->rx.prepend_size);
> +
> + if (ret < 0)
> + goto read_failure;
> +
> + ctx->control = header[0];
> +
> + data_len = ((header[4] & 0xFF) | (header[3] << 8));
> +
> + cipher_overhead = tls_ctx->rx.tag_size + tls_ctx->rx.iv_size;
> +
> + if (data_len > TLS_MAX_PAYLOAD_SIZE + cipher_overhead) {
> + ret = -EMSGSIZE;
> + goto read_failure;
> + }
> + if (data_len < cipher_overhead) {
> + ret = -EMSGSIZE;
I think this should be considered EBADMSG, because this error is cipher
dependent. At least, that's what happens within OpenSSL. Also, EMSGSIZE
is usually used only for too long messages.
> + goto read_failure;
> + }
> +
> + if (header[1] != TLS_VERSION_MINOR(tls_ctx->crypto_recv.version) ||
> + header[2] != TLS_VERSION_MAJOR(tls_ctx->crypto_recv.version)) {
> + ret = -EINVAL;
> + goto read_failure;
> + }
> +
> + return data_len + TLS_HEADER_SIZE;
> +
> +read_failure:
> + tls_err_abort(strp->sk, ret);
> +
> + return ret;
> +}
> +
...
^ permalink raw reply
* Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access
From: Ingo Molnar @ 2018-03-21 6:32 UTC (permalink / raw)
To: Linus Torvalds
Cc: Thomas Gleixner, David Laight, Rahul Lakkireddy, x86@kernel.org,
linux-kernel@vger.kernel.org, netdev@vger.kernel.org,
mingo@redhat.com, hpa@zytor.com, davem@davemloft.net,
akpm@linux-foundation.org, ganeshgr@chelsio.com,
nirranjan@chelsio.com, indranil@chelsio.com, Andy Lutomirski,
Peter Zijlstra, Fenghua Yu, Eric Biggers
In-Reply-To: <CA+55aFybTvLz47mw=AG21jCJv_hE2vaVyiJP_F-4vAD-3Gnc7Q@mail.gmail.com>
* Linus Torvalds <torvalds@linux-foundation.org> wrote:
> And even if you ignore that "maintenance problems down the line" issue
> ("we can fix them when they happen") I don't want to see games like
> this, because I'm pretty sure it breaks the optimized xsave by tagging
> the state as being dirty.
That's true - and it would penalize the context switch cost of the affected task
for the rest of its lifetime, as I don't think there's much that clears XINUSE
other than a FINIT, which is rarely done by user-space.
> So no. Don't use vector stuff in the kernel. It's not worth the pain.
I agree, but:
> The *only* valid use is pretty much crypto, and even there it has had issues.
> Benchmarks use big arrays and/or dense working sets etc to "prove" how good the
> vector version is, and then you end up in situations where it's used once per
> fairly small packet for an interrupt, and it's actually much worse than doing it
> by hand.
That's mainly because the XSAVE/XRESTOR done by kernel_fpu_begin()/end() is so
expensive, so this argument is somewhat circular.
IFF it was safe to just use the vector unit then vector unit based crypto would be
very fast for small buffer as well, and would be even faster for larger buffer
sizes as well. Saving and restoring up to ~1.5K of context is not cheap.
Thanks,
Ingo
^ permalink raw reply
* [PATCH net-next v4 0/2] net: permit skb_segment on head_frag frag_list skb
From: Yonghong Song @ 2018-03-21 6:47 UTC (permalink / raw)
To: edumazet, ast, daniel, diptanu, netdev; +Cc: kernel-team
One of our in-house projects, bpf-based NAT, hits a kernel BUG_ON at
function skb_segment(), line 3667. The bpf program attaches to
clsact ingress, calls bpf_skb_change_proto to change protocol
from ipv4 to ipv6 or from ipv6 to ipv4, and then calls bpf_redirect
to send the changed packet out.
...
3665 while (pos < offset + len) {
3666 if (i >= nfrags) {
3667 BUG_ON(skb_headlen(list_skb));
...
The triggering input skb has the following properties:
list_skb = skb->frag_list;
skb->nfrags != NULL && skb_headlen(list_skb) != 0
and skb_segment() is not able to handle a frag_list skb
if its headlen (list_skb->len - list_skb->data_len) is not 0.
Patch #1 provides a simple solution to avoid BUG_ON. If
list_skb->head_frag is true, its page-backed frag will
be processed before the list_skb->frags.
Patch #2 provides a test case in test_bpf module which
constructs a skb and calls skb_segment() directly. The test
case is able to trigger the BUG_ON without Patch #1.
The patch has been tested in the following setup:
ipv6_host <-> nat_server <-> ipv4_host
where nat_server has a bpf program doing ipv4<->ipv6
translation and forwarding through clsact hook
bpf_skb_change_proto.
Changelog:
v3 -> v4:
. Remove dynamic memory allocation and use rewinding
for both index and frag to remove one branch in fast path,
from Alexander.
. Fix a bunch of issues in test_bpf skb_segment() test,
including proper way to allocate skb, proper function
argument for skb_add_rx_frag and not freeint skb, etc.,
from Eric.
v2 -> v3:
. Use starting frag index -1 (instead of 0) to
special process head_frag before other frags in the skb,
from Alexander Duyck.
v1 -> v2:
. Removed never-hit BUG_ON, spotted by Linyu Yuan.
Yonghong Song (2):
net: permit skb_segment on head_frag frag_list skb
net: bpf: add a test for skb_segment in test_bpf module
lib/test_bpf.c | 91 +++++++++++++++++++++++++++++++++++++++++++++++++++++--
net/core/skbuff.c | 36 +++++++++++++++-------
2 files changed, 114 insertions(+), 13 deletions(-)
--
2.9.5
^ permalink raw reply
* [PATCH net-next v4 1/2] net: permit skb_segment on head_frag frag_list skb
From: Yonghong Song @ 2018-03-21 6:47 UTC (permalink / raw)
To: edumazet, ast, daniel, diptanu, netdev; +Cc: kernel-team
In-Reply-To: <20180321064722.1411857-1-yhs@fb.com>
One of our in-house projects, bpf-based NAT, hits a kernel BUG_ON at
function skb_segment(), line 3667. The bpf program attaches to
clsact ingress, calls bpf_skb_change_proto to change protocol
from ipv4 to ipv6 or from ipv6 to ipv4, and then calls bpf_redirect
to send the changed packet out.
3472 struct sk_buff *skb_segment(struct sk_buff *head_skb,
3473 netdev_features_t features)
3474 {
3475 struct sk_buff *segs = NULL;
3476 struct sk_buff *tail = NULL;
...
3665 while (pos < offset + len) {
3666 if (i >= nfrags) {
3667 BUG_ON(skb_headlen(list_skb));
3668
3669 i = 0;
3670 nfrags = skb_shinfo(list_skb)->nr_frags;
3671 frag = skb_shinfo(list_skb)->frags;
3672 frag_skb = list_skb;
...
call stack:
...
#1 [ffff883ffef03558] __crash_kexec at ffffffff8110c525
#2 [ffff883ffef03620] crash_kexec at ffffffff8110d5cc
#3 [ffff883ffef03640] oops_end at ffffffff8101d7e7
#4 [ffff883ffef03668] die at ffffffff8101deb2
#5 [ffff883ffef03698] do_trap at ffffffff8101a700
#6 [ffff883ffef036e8] do_error_trap at ffffffff8101abfe
#7 [ffff883ffef037a0] do_invalid_op at ffffffff8101acd0
#8 [ffff883ffef037b0] invalid_op at ffffffff81a00bab
[exception RIP: skb_segment+3044]
RIP: ffffffff817e4dd4 RSP: ffff883ffef03860 RFLAGS: 00010216
RAX: 0000000000002bf6 RBX: ffff883feb7aaa00 RCX: 0000000000000011
RDX: ffff883fb87910c0 RSI: 0000000000000011 RDI: ffff883feb7ab500
RBP: ffff883ffef03928 R8: 0000000000002ce2 R9: 00000000000027da
R10: 000001ea00000000 R11: 0000000000002d82 R12: ffff883f90a1ee80
R13: ffff883fb8791120 R14: ffff883feb7abc00 R15: 0000000000002ce2
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#9 [ffff883ffef03930] tcp_gso_segment at ffffffff818713e7
--- <IRQ stack> ---
...
The triggering input skb has the following properties:
list_skb = skb->frag_list;
skb->nfrags != NULL && skb_headlen(list_skb) != 0
and skb_segment() is not able to handle a frag_list skb
if its headlen (list_skb->len - list_skb->data_len) is not 0.
This patch addressed the issue by handling skb_headlen(list_skb) != 0
case properly if list_skb->head_frag is true, which is expected in
most cases. The head frag is processed before list_skb->frags
are processed.
Reported-by: Diptanu Gon Choudhury <diptanu@fb.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
---
net/core/skbuff.c | 36 +++++++++++++++++++++++++-----------
1 file changed, 25 insertions(+), 11 deletions(-)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 715c134..09f4c24 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3475,7 +3475,7 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
struct sk_buff *segs = NULL;
struct sk_buff *tail = NULL;
struct sk_buff *list_skb = skb_shinfo(head_skb)->frag_list;
- skb_frag_t *frag = skb_shinfo(head_skb)->frags;
+ skb_frag_t *frag = skb_shinfo(head_skb)->frags, head_frag;
unsigned int mss = skb_shinfo(head_skb)->gso_size;
unsigned int doffset = head_skb->data - skb_mac_header(head_skb);
struct sk_buff *frag_skb = head_skb;
@@ -3664,19 +3664,30 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
while (pos < offset + len) {
if (i >= nfrags) {
- BUG_ON(skb_headlen(list_skb));
-
i = 0;
nfrags = skb_shinfo(list_skb)->nr_frags;
frag = skb_shinfo(list_skb)->frags;
- frag_skb = list_skb;
-
- BUG_ON(!nfrags);
+ if (skb_headlen(list_skb)) {
+ struct page *page;
+
+ BUG_ON(!list_skb->head_frag);
+
+ page = virt_to_head_page(list_skb->head);
+ head_frag.page.p = page;
+ head_frag.page_offset = list_skb->data -
+ (unsigned char *)page_address(page);
+ head_frag.size = skb_headlen(list_skb);
+ /* to make room for head_frag. */
+ i--; frag--;
+ }
+ if (nfrags) {
+ frag_skb = list_skb;
- if (skb_orphan_frags(frag_skb, GFP_ATOMIC) ||
- skb_zerocopy_clone(nskb, frag_skb,
- GFP_ATOMIC))
- goto err;
+ if (skb_orphan_frags(frag_skb, GFP_ATOMIC) ||
+ skb_zerocopy_clone(nskb, frag_skb,
+ GFP_ATOMIC))
+ goto err;
+ }
list_skb = list_skb->next;
}
@@ -3689,7 +3700,10 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
goto err;
}
- *nskb_frag = *frag;
+ /* head_frag could be defined in previous outer do/while
+ * loop iterations.
+ */
+ *nskb_frag = (i < 0) ? head_frag : *frag;
__skb_frag_ref(nskb_frag);
size = skb_frag_size(nskb_frag);
--
2.9.5
^ permalink raw reply related
* [PATCH net-next v4 2/2] net: bpf: add a test for skb_segment in test_bpf module
From: Yonghong Song @ 2018-03-21 6:47 UTC (permalink / raw)
To: edumazet, ast, daniel, diptanu, netdev; +Cc: kernel-team
In-Reply-To: <20180321064722.1411857-1-yhs@fb.com>
Without the previous commit,
"modprobe test_bpf" will have the following errors:
...
[ 98.149165] ------------[ cut here ]------------
[ 98.159362] kernel BUG at net/core/skbuff.c:3667!
[ 98.169756] invalid opcode: 0000 [#1] SMP PTI
[ 98.179370] Modules linked in:
[ 98.179371] test_bpf(+)
...
which triggers the bug the previous commit intends to fix.
The skbs are constructed to mimic what mlx5 may generate.
The packet size/header may not mimic real cases in production. But
the processing flow is similar.
Signed-off-by: Yonghong Song <yhs@fb.com>
---
lib/test_bpf.c | 91 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 89 insertions(+), 2 deletions(-)
diff --git a/lib/test_bpf.c b/lib/test_bpf.c
index 2efb213..086a231 100644
--- a/lib/test_bpf.c
+++ b/lib/test_bpf.c
@@ -6574,6 +6574,91 @@ static bool exclude_test(int test_id)
return test_id < test_range[0] || test_id > test_range[1];
}
+static __init struct sk_buff *build_test_skb(void)
+{
+ u32 headroom = NET_SKB_PAD + NET_IP_ALIGN + ETH_HLEN;
+ struct sk_buff *skb[2];
+ struct page *page[2];
+ int i, data_size = 8;
+
+ for (i = 0; i < 2; i++) {
+ page[i] = alloc_page(GFP_KERNEL);
+ if (!page[i]) {
+ if (i == 0)
+ goto err_page0;
+ else
+ goto err_page1;
+ }
+
+ /* this will set skb[i]->head_frag */
+ skb[i] = dev_alloc_skb(headroom + data_size);
+ if (!skb[i]) {
+ if (i == 0)
+ goto err_skb0;
+ else
+ goto err_skb1;
+ }
+
+ skb_reserve(skb[i], headroom);
+ skb_put(skb[i], data_size);
+ skb[i]->protocol = htons(ETH_P_IP);
+ skb_reset_network_header(skb[i]);
+ skb_set_mac_header(skb[i], -ETH_HLEN);
+
+ skb_add_rx_frag(skb[i], 0, page[i], 0, 64, 64);
+ // skb_headlen(skb[i]): 8, skb[i]->head_frag = 1
+ }
+
+ /* setup shinfo */
+ skb_shinfo(skb[0])->gso_size = 1448;
+ skb_shinfo(skb[0])->gso_type = SKB_GSO_TCPV4;
+ skb_shinfo(skb[0])->gso_type |= SKB_GSO_DODGY;
+ skb_shinfo(skb[0])->gso_segs = 0;
+ skb_shinfo(skb[0])->frag_list = skb[1];
+
+ /* adjust skb[0]'s len */
+ skb[0]->len += skb[1]->len;
+ skb[0]->data_len += skb[1]->data_len;
+ skb[0]->truesize += skb[1]->truesize;
+
+ return skb[0];
+
+err_skb1:
+ __free_page(page[1]);
+err_page1:
+ kfree_skb(skb[0]);
+err_skb0:
+ __free_page(page[0]);
+err_page0:
+ return NULL;
+}
+
+static __init int test_skb_segment(void)
+{
+ netdev_features_t features;
+ struct sk_buff *skb;
+ int ret = -1;
+
+ features = NETIF_F_SG | NETIF_F_GSO_PARTIAL | NETIF_F_IP_CSUM |
+ NETIF_F_IPV6_CSUM;
+ features |= NETIF_F_RXCSUM;
+ skb = build_test_skb();
+ if (!skb) {
+ pr_info("%s: failed to build_test_skb", __func__);
+ goto done;
+ }
+
+ if (skb_segment(skb, features)) {
+ ret = 0;
+ pr_info("%s: success in skb_segment!", __func__);
+ } else {
+ pr_info("%s: failed in skb_segment!", __func__);
+ }
+ kfree_skb(skb);
+done:
+ return ret;
+}
+
static __init int test_bpf(void)
{
int i, err_cnt = 0, pass_cnt = 0;
@@ -6632,9 +6717,11 @@ static int __init test_bpf_init(void)
return ret;
ret = test_bpf();
-
destroy_bpf_tests();
- return ret;
+ if (ret)
+ return ret;
+
+ return test_skb_segment();
}
static void __exit test_bpf_exit(void)
--
2.9.5
^ permalink raw reply related
* Re: [PATCH RFC 2/2] virtio_ring: support packed ring
From: Tiwei Bie @ 2018-03-21 7:30 UTC (permalink / raw)
To: Jason Wang; +Cc: mst, netdev, linux-kernel, virtualization, wexu
In-Reply-To: <094ca28b-d8af-bf7a-ea7e-0d0bf7518bda@redhat.com>
On Fri, Mar 16, 2018 at 07:36:47PM +0800, Jason Wang wrote:
> On 2018年03月16日 18:04, Tiwei Bie wrote:
> > On Fri, Mar 16, 2018 at 04:34:28PM +0800, Jason Wang wrote:
> > > On 2018年03月16日 15:40, Tiwei Bie wrote:
> > > > On Fri, Mar 16, 2018 at 02:44:12PM +0800, Jason Wang wrote:
> > > > > On 2018年03月16日 14:10, Tiwei Bie wrote:
> > > > > > On Fri, Mar 16, 2018 at 12:03:25PM +0800, Jason Wang wrote:
> > > > > > > On 2018年02月23日 19:18, Tiwei Bie wrote:
> > > > > > > > Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
> > > > > > > > ---
> > > > > > > > drivers/virtio/virtio_ring.c | 699 +++++++++++++++++++++++++++++++++++++------
> > > > > > > > include/linux/virtio_ring.h | 8 +-
> > > > > > > > 2 files changed, 618 insertions(+), 89 deletions(-)
[...]
> > > @@ -1096,17 +1599,21 @@ struct virtqueue *vring_create_virtqueue(
> > > > > > > > if (!queue) {
> > > > > > > > /* Try to get a single page. You are my only hope! */
> > > > > > > > - queue = vring_alloc_queue(vdev, vring_size(num, vring_align),
> > > > > > > > + queue = vring_alloc_queue(vdev, __vring_size(num, vring_align,
> > > > > > > > + packed),
> > > > > > > > &dma_addr, GFP_KERNEL|__GFP_ZERO);
> > > > > > > > }
> > > > > > > > if (!queue)
> > > > > > > > return NULL;
> > > > > > > > - queue_size_in_bytes = vring_size(num, vring_align);
> > > > > > > > - vring_init(&vring, num, queue, vring_align);
> > > > > > > > + queue_size_in_bytes = __vring_size(num, vring_align, packed);
> > > > > > > > + if (packed)
> > > > > > > > + vring_packed_init(&vring.vring_packed, num, queue, vring_align);
> > > > > > > > + else
> > > > > > > > + vring_init(&vring.vring_split, num, queue, vring_align);
> > > > > > > Let's rename vring_init to vring_init_split() like other helpers?
> > > > > > The vring_init() is a public API in include/uapi/linux/virtio_ring.h.
> > > > > > I don't think we can rename it.
> > > > > I see, then this need more thoughts to unify the API.
> > > > My thought is to keep the old API as is, and introduce
> > > > new types and helpers for packed ring.
> > > I admit it's not a fault of this patch. But we'd better think of this in the
> > > future, consider we may have new kinds of ring.
> > >
> > > > More details can be found in this patch:
> > > > https://lkml.org/lkml/2018/2/23/243
> > > > (PS. The type which has bit fields is just for reference,
> > > > and will be changed in next version.)
> > > >
> > > > Do you have any other suggestions?
> > > No.
> > Hmm.. Sorry, I didn't describe my question well.
> > I mean do you have any suggestions about the API
> > design for packed ring in uapi header? Currently
> > I introduced below two new helpers:
> >
> > static inline void vring_packed_init(struct vring_packed *vr, unsigned int num,
> > void *p, unsigned long align);
> > static inline unsigned vring_packed_size(unsigned int num, unsigned long align);
> >
> > When new rings are introduced in the future, above
> > helpers can't be reused. Maybe we should make the
> > helpers be able to determine the ring type?
>
> Let's wait for Michael's comment here. Generally, I fail to understand why
> vring_init() become a part of uapi. Git grep shows the only use cases are
> virtio_test/vringh_test.
Thank you very much for the review on this patch!
I'll send out a new version ASAP to address these
comments. :)
Best regards,
Tiwei Bie
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* [PATCH net-next 0/2] mlxsw: Update supported firmware version
From: Ido Schimmel @ 2018-03-21 7:34 UTC (permalink / raw)
To: netdev; +Cc: davem, jiri, talb, dsahern, mlxsw, Ido Schimmel
Hi,
The first patch bumps the firmware version supported by the driver. The
second patch enables a feature introduced in the new version,
auto-negotiation disable.
Tal Bar (2):
mlxsw: spectrum: Update the supported firmware to version 13.1620.192
mlxsw: spectrum: Add support for auto-negotiation disable mode
drivers/net/ethernet/mellanox/mlxsw/reg.h | 11 ++++++++++-
drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 12 ++++++------
drivers/net/ethernet/mellanox/mlxsw/switchx2.c | 8 ++++----
3 files changed, 20 insertions(+), 11 deletions(-)
--
2.14.3
^ permalink raw reply
* [PATCH net-next 1/2] mlxsw: spectrum: Update the supported firmware to version 13.1620.192
From: Ido Schimmel @ 2018-03-21 7:34 UTC (permalink / raw)
To: netdev; +Cc: davem, jiri, talb, dsahern, mlxsw, Ido Schimmel
In-Reply-To: <20180321073406.23131-1-idosch@mellanox.com>
From: Tal Bar <talb@mellanox.com>
This new firmware contains:
- Support for auto-neg disable mode
Signed-off-by: Tal Bar <talb@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
---
drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
index a120602bca26..da8aef7029c8 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
@@ -75,8 +75,8 @@
#include "../mlxfw/mlxfw.h"
#define MLXSW_FWREV_MAJOR 13
-#define MLXSW_FWREV_MINOR 1530
-#define MLXSW_FWREV_SUBMINOR 152
+#define MLXSW_FWREV_MINOR 1620
+#define MLXSW_FWREV_SUBMINOR 192
#define MLXSW_FWREV_MINOR_TO_BRANCH(minor) ((minor) / 100)
#define MLXSW_SP_FW_FILENAME \
--
2.14.3
^ permalink raw reply related
* [PATCH net-next 2/2] mlxsw: spectrum: Add support for auto-negotiation disable mode
From: Ido Schimmel @ 2018-03-21 7:34 UTC (permalink / raw)
To: netdev; +Cc: davem, jiri, talb, dsahern, mlxsw, Ido Schimmel
In-Reply-To: <20180321073406.23131-1-idosch@mellanox.com>
From: Tal Bar <talb@mellanox.com>
In 'auto-neg off' the device have sent AN (auto-negotiation) frames
with the forced speed. Thus, fix it using an_disable_admin field in
Port type and speed (PTYS) register. This field indicates if speed
negotiation frames would be send by the port or not.
Add the field and enable/disable it for 'auto-neg on/off', make the
port to start/stop sending AN (auto-negotiation) frames. Note that for
SwitchX2 the behavior doesn't change (i.e support only AN enabled with
forced speed).
Signed-off-by: Tal Bar <talb@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
---
drivers/net/ethernet/mellanox/mlxsw/reg.h | 11 ++++++++++-
drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 8 ++++----
drivers/net/ethernet/mellanox/mlxsw/switchx2.c | 8 ++++----
3 files changed, 18 insertions(+), 9 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlxsw/reg.h b/drivers/net/ethernet/mellanox/mlxsw/reg.h
index cb5f77f09f8e..e002398364c8 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/reg.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/reg.h
@@ -2872,6 +2872,14 @@ static inline void mlxsw_reg_pmtu_pack(char *payload, u8 local_port,
MLXSW_REG_DEFINE(ptys, MLXSW_REG_PTYS_ID, MLXSW_REG_PTYS_LEN);
+/* an_disable_admin
+ * Auto negotiation disable administrative configuration
+ * 0 - Device doesn't support AN disable.
+ * 1 - Device supports AN disable.
+ * Access: RW
+ */
+MLXSW_ITEM32(reg, ptys, an_disable_admin, 0x00, 30, 1);
+
/* reg_ptys_local_port
* Local port number.
* Access: Index
@@ -3000,12 +3008,13 @@ MLXSW_ITEM32(reg, ptys, ib_proto_oper, 0x28, 0, 16);
MLXSW_ITEM32(reg, ptys, eth_proto_lp_advertise, 0x30, 0, 32);
static inline void mlxsw_reg_ptys_eth_pack(char *payload, u8 local_port,
- u32 proto_admin)
+ u32 proto_admin, bool autoneg)
{
MLXSW_REG_ZERO(ptys, payload);
mlxsw_reg_ptys_local_port_set(payload, local_port);
mlxsw_reg_ptys_proto_mask_set(payload, MLXSW_REG_PTYS_PROTO_MASK_ETH);
mlxsw_reg_ptys_eth_proto_admin_set(payload, proto_admin);
+ mlxsw_reg_ptys_an_disable_admin_set(payload, !autoneg);
}
static inline void mlxsw_reg_ptys_eth_unpack(char *payload,
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
index da8aef7029c8..3f2add1b218d 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
@@ -2390,7 +2390,7 @@ static int mlxsw_sp_port_get_link_ksettings(struct net_device *dev,
int err;
autoneg = mlxsw_sp_port->link.autoneg;
- mlxsw_reg_ptys_eth_pack(ptys_pl, mlxsw_sp_port->local_port, 0);
+ mlxsw_reg_ptys_eth_pack(ptys_pl, mlxsw_sp_port->local_port, 0, false);
err = mlxsw_reg_query(mlxsw_sp->core, MLXSW_REG(ptys), ptys_pl);
if (err)
return err;
@@ -2424,7 +2424,7 @@ mlxsw_sp_port_set_link_ksettings(struct net_device *dev,
bool autoneg;
int err;
- mlxsw_reg_ptys_eth_pack(ptys_pl, mlxsw_sp_port->local_port, 0);
+ mlxsw_reg_ptys_eth_pack(ptys_pl, mlxsw_sp_port->local_port, 0, false);
err = mlxsw_reg_query(mlxsw_sp->core, MLXSW_REG(ptys), ptys_pl);
if (err)
return err;
@@ -2442,7 +2442,7 @@ mlxsw_sp_port_set_link_ksettings(struct net_device *dev,
}
mlxsw_reg_ptys_eth_pack(ptys_pl, mlxsw_sp_port->local_port,
- eth_proto_new);
+ eth_proto_new, autoneg);
err = mlxsw_reg_write(mlxsw_sp->core, MLXSW_REG(ptys), ptys_pl);
if (err)
return err;
@@ -2653,7 +2653,7 @@ mlxsw_sp_port_speed_by_width_set(struct mlxsw_sp_port *mlxsw_sp_port, u8 width)
eth_proto_admin = mlxsw_sp_to_ptys_upper_speed(upper_speed);
mlxsw_reg_ptys_eth_pack(ptys_pl, mlxsw_sp_port->local_port,
- eth_proto_admin);
+ eth_proto_admin, mlxsw_sp_port->link.autoneg);
return mlxsw_reg_write(mlxsw_sp->core, MLXSW_REG(ptys), ptys_pl);
}
diff --git a/drivers/net/ethernet/mellanox/mlxsw/switchx2.c b/drivers/net/ethernet/mellanox/mlxsw/switchx2.c
index f3c29bbf07e2..c87b0934a405 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/switchx2.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/switchx2.c
@@ -789,7 +789,7 @@ mlxsw_sx_port_get_link_ksettings(struct net_device *dev,
u32 supported, advertising, lp_advertising;
int err;
- mlxsw_reg_ptys_eth_pack(ptys_pl, mlxsw_sx_port->local_port, 0);
+ mlxsw_reg_ptys_eth_pack(ptys_pl, mlxsw_sx_port->local_port, 0, false);
err = mlxsw_reg_query(mlxsw_sx->core, MLXSW_REG(ptys), ptys_pl);
if (err) {
netdev_err(dev, "Failed to get proto");
@@ -879,7 +879,7 @@ mlxsw_sx_port_set_link_ksettings(struct net_device *dev,
mlxsw_sx_to_ptys_advert_link(advertising) :
mlxsw_sx_to_ptys_speed(speed);
- mlxsw_reg_ptys_eth_pack(ptys_pl, mlxsw_sx_port->local_port, 0);
+ mlxsw_reg_ptys_eth_pack(ptys_pl, mlxsw_sx_port->local_port, 0, false);
err = mlxsw_reg_query(mlxsw_sx->core, MLXSW_REG(ptys), ptys_pl);
if (err) {
netdev_err(dev, "Failed to get proto");
@@ -897,7 +897,7 @@ mlxsw_sx_port_set_link_ksettings(struct net_device *dev,
return 0;
mlxsw_reg_ptys_eth_pack(ptys_pl, mlxsw_sx_port->local_port,
- eth_proto_new);
+ eth_proto_new, true);
err = mlxsw_reg_write(mlxsw_sx->core, MLXSW_REG(ptys), ptys_pl);
if (err) {
netdev_err(dev, "Failed to set proto admin");
@@ -1029,7 +1029,7 @@ mlxsw_sx_port_speed_by_width_set(struct mlxsw_sx_port *mlxsw_sx_port, u8 width)
eth_proto_admin = mlxsw_sx_to_ptys_upper_speed(upper_speed);
mlxsw_reg_ptys_eth_pack(ptys_pl, mlxsw_sx_port->local_port,
- eth_proto_admin);
+ eth_proto_admin, true);
return mlxsw_reg_write(mlxsw_sx->core, MLXSW_REG(ptys), ptys_pl);
}
--
2.14.3
^ permalink raw reply related
* Re: [PATCH RFC 2/2] virtio_ring: support packed ring
From: Tiwei Bie @ 2018-03-21 7:35 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: netdev, linux-kernel, virtualization, wexu
In-Reply-To: <20180316145702-mutt-send-email-mst@kernel.org>
On Fri, Mar 16, 2018 at 04:30:02PM +0200, Michael S. Tsirkin wrote:
> On Fri, Mar 16, 2018 at 07:36:47PM +0800, Jason Wang wrote:
> > > > @@ -1096,17 +1599,21 @@ struct virtqueue *vring_create_virtqueue(
> > > > > > > > > if (!queue) {
> > > > > > > > > /* Try to get a single page. You are my only hope! */
> > > > > > > > > - queue = vring_alloc_queue(vdev, vring_size(num, vring_align),
> > > > > > > > > + queue = vring_alloc_queue(vdev, __vring_size(num, vring_align,
> > > > > > > > > + packed),
> > > > > > > > > &dma_addr, GFP_KERNEL|__GFP_ZERO);
> > > > > > > > > }
> > > > > > > > > if (!queue)
> > > > > > > > > return NULL;
> > > > > > > > > - queue_size_in_bytes = vring_size(num, vring_align);
> > > > > > > > > - vring_init(&vring, num, queue, vring_align);
> > > > > > > > > + queue_size_in_bytes = __vring_size(num, vring_align, packed);
> > > > > > > > > + if (packed)
> > > > > > > > > + vring_packed_init(&vring.vring_packed, num, queue, vring_align);
> > > > > > > > > + else
> > > > > > > > > + vring_init(&vring.vring_split, num, queue, vring_align);
> > > > > > > > Let's rename vring_init to vring_init_split() like other helpers?
> > > > > > > The vring_init() is a public API in include/uapi/linux/virtio_ring.h.
> > > > > > > I don't think we can rename it.
> > > > > > I see, then this need more thoughts to unify the API.
> > > > > My thought is to keep the old API as is, and introduce
> > > > > new types and helpers for packed ring.
> > > > I admit it's not a fault of this patch. But we'd better think of this in the
> > > > future, consider we may have new kinds of ring.
> > > >
> > > > > More details can be found in this patch:
> > > > > https://lkml.org/lkml/2018/2/23/243
> > > > > (PS. The type which has bit fields is just for reference,
> > > > > and will be changed in next version.)
> > > > >
> > > > > Do you have any other suggestions?
> > > > No.
> > > Hmm.. Sorry, I didn't describe my question well.
> > > I mean do you have any suggestions about the API
> > > design for packed ring in uapi header? Currently
> > > I introduced below two new helpers:
> > >
> > > static inline void vring_packed_init(struct vring_packed *vr, unsigned int num,
> > > void *p, unsigned long align);
> > > static inline unsigned vring_packed_size(unsigned int num, unsigned long align);
> > >
> > > When new rings are introduced in the future, above
> > > helpers can't be reused. Maybe we should make the
> > > helpers be able to determine the ring type?
> >
> > Let's wait for Michael's comment here. Generally, I fail to understand why
> > vring_init() become a part of uapi. Git grep shows the only use cases are
> > virtio_test/vringh_test.
> >
> > Thanks
>
> For init - I think it's a mistake that stems from lguest which sometimes
> made it less than obvious which code is where. I don't see a reason to
> add to it.
Got it! I'll move vring_packed_init() out of uapi. Many thanks! :)
Best regards,
Tiwei Bie
>
> --
> MST
^ permalink raw reply
* aio poll and a new in-kernel poll API V6
From: Christoph Hellwig @ 2018-03-21 7:40 UTC (permalink / raw)
To: viro; +Cc: Avi Kivity, linux-aio, linux-fsdevel, netdev, linux-api,
linux-kernel
Hi all,
this series adds support for the IOCB_CMD_POLL operation to poll for the
readyness of file descriptors using the aio subsystem. The API is based
on patches that existed in RHAS2.1 and RHEL3, which means it already is
supported by libaio. To implement the poll support efficiently new
methods to poll are introduced in struct file_operations: get_poll_head
and poll_mask. The first one returns a wait_queue_head to wait on
(lifetime is bound by the file), and the second does a non-blocking
check for the POLL* events. This allows aio poll to work without
any additional context switches, unlike epoll.
This series sits on top of the aio-fsync series that also includes
support for io_pgetevents.
The changes were sponsored by Scylladb, and improve performance
of the seastar framework up to 10%, while also removing the need
for a privileged SCHED_FIFO epoll listener thread.
git://git.infradead.org/users/hch/vfs.git aio-poll.6
Gitweb:
http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/aio-poll.6
Libaio changes:
https://pagure.io/libaio.git io-poll
Seastar changes (not updated for the new io_pgetevens ABI yet):
https://github.com/avikivity/seastar/commits/aio
Changes since V6:
- small changelog updates
- rebased on top of the aio-fsync changes
Changes since V4:
- rebased ontop of Linux 4.16-rc4
Changes since V3:
- remove the pre-sleep ->poll_mask call in vfs_poll,
allow ->get_poll_head to return POLL* values.
Changes since V2:
- removed a double initialization
- new vfs_get_poll_head helper
- document that ->get_poll_head can return NULL
- call ->poll_mask before sleeping
- various ACKs
- add conversion of random to ->poll_mask
- add conversion of af_alg to ->poll_mask
- lacking ->poll_mask support now returns -EINVAL for IOCB_CMD_POLL
- reshuffled the series so that prep patches and everything not
requiring the new in-kernel poll API is in the beginning
Changes since V1:
- handle the NULL ->poll case in vfs_poll
- dropped the file argument to the ->poll_mask socket operation
- replace the ->pre_poll socket operation with ->get_poll_head as
in the file operations
--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org. For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>
^ permalink raw reply
* [PATCH 01/28] fs: unexport poll_schedule_timeout
From: Christoph Hellwig @ 2018-03-21 7:40 UTC (permalink / raw)
To: viro; +Cc: Avi Kivity, linux-aio, linux-fsdevel, netdev, linux-api,
linux-kernel
In-Reply-To: <20180321074032.14211-1-hch@lst.de>
No users outside of select.c.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
---
fs/select.c | 3 +--
include/linux/poll.h | 2 --
2 files changed, 1 insertion(+), 4 deletions(-)
diff --git a/fs/select.c b/fs/select.c
index b6c36254028a..686de7b3a1db 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -233,7 +233,7 @@ static void __pollwait(struct file *filp, wait_queue_head_t *wait_address,
add_wait_queue(wait_address, &entry->wait);
}
-int poll_schedule_timeout(struct poll_wqueues *pwq, int state,
+static int poll_schedule_timeout(struct poll_wqueues *pwq, int state,
ktime_t *expires, unsigned long slack)
{
int rc = -EINTR;
@@ -258,7 +258,6 @@ int poll_schedule_timeout(struct poll_wqueues *pwq, int state,
return rc;
}
-EXPORT_SYMBOL(poll_schedule_timeout);
/**
* poll_select_set_timeout - helper function to setup the timeout value
diff --git a/include/linux/poll.h b/include/linux/poll.h
index f45ebd017eaa..a3576da63377 100644
--- a/include/linux/poll.h
+++ b/include/linux/poll.h
@@ -96,8 +96,6 @@ struct poll_wqueues {
extern void poll_initwait(struct poll_wqueues *pwq);
extern void poll_freewait(struct poll_wqueues *pwq);
-extern int poll_schedule_timeout(struct poll_wqueues *pwq, int state,
- ktime_t *expires, unsigned long slack);
extern u64 select_estimate_accuracy(struct timespec64 *tv);
#define MAX_INT64_SECONDS (((s64)(~((u64)0)>>1)/HZ)-1)
--
2.14.2
--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org. For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>
^ permalink raw reply related
* [PATCH 02/28] fs: cleanup do_pollfd
From: Christoph Hellwig @ 2018-03-21 7:40 UTC (permalink / raw)
To: viro; +Cc: Avi Kivity, linux-aio, linux-fsdevel, netdev, linux-api,
linux-kernel
In-Reply-To: <20180321074032.14211-1-hch@lst.de>
Use straigline code with failure handling gotos instead of a lot
of nested conditionals.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
---
fs/select.c | 48 +++++++++++++++++++++++-------------------------
1 file changed, 23 insertions(+), 25 deletions(-)
diff --git a/fs/select.c b/fs/select.c
index 686de7b3a1db..c6c504a814f9 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -806,34 +806,32 @@ static inline __poll_t do_pollfd(struct pollfd *pollfd, poll_table *pwait,
bool *can_busy_poll,
__poll_t busy_flag)
{
- __poll_t mask;
- int fd;
-
- mask = 0;
- fd = pollfd->fd;
- if (fd >= 0) {
- struct fd f = fdget(fd);
- mask = EPOLLNVAL;
- if (f.file) {
- /* userland u16 ->events contains POLL... bitmap */
- __poll_t filter = demangle_poll(pollfd->events) |
- EPOLLERR | EPOLLHUP;
- mask = DEFAULT_POLLMASK;
- if (f.file->f_op->poll) {
- pwait->_key = filter;
- pwait->_key |= busy_flag;
- mask = f.file->f_op->poll(f.file, pwait);
- if (mask & busy_flag)
- *can_busy_poll = true;
- }
- /* Mask out unneeded events. */
- mask &= filter;
- fdput(f);
- }
+ int fd = pollfd->fd;
+ __poll_t mask = 0, filter;
+ struct fd f;
+
+ if (fd < 0)
+ goto out;
+ mask = EPOLLNVAL;
+ f = fdget(fd);
+ if (!f.file)
+ goto out;
+
+ /* userland u16 ->events contains POLL... bitmap */
+ filter = demangle_poll(pollfd->events) | EPOLLERR | EPOLLHUP;
+ mask = DEFAULT_POLLMASK;
+ if (f.file->f_op->poll) {
+ pwait->_key = filter | busy_flag;
+ mask = f.file->f_op->poll(f.file, pwait);
+ if (mask & busy_flag)
+ *can_busy_poll = true;
}
+ mask &= filter; /* Mask out unneeded events. */
+ fdput(f);
+
+out:
/* ... and so does ->revents */
pollfd->revents = mangle_poll(mask);
-
return mask;
}
--
2.14.2
--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org. For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>
^ permalink raw reply related
* [PATCH 03/28] fs: update documentation to mention __poll_t
From: Christoph Hellwig @ 2018-03-21 7:40 UTC (permalink / raw)
To: viro; +Cc: Avi Kivity, linux-aio, linux-fsdevel, netdev, linux-api,
linux-kernel
In-Reply-To: <20180321074032.14211-1-hch@lst.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
Documentation/filesystems/Locking | 2 +-
Documentation/filesystems/vfs.txt | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index 75d2d57e2c44..220bba28f72b 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -439,7 +439,7 @@ prototypes:
ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
int (*iterate) (struct file *, struct dir_context *);
- unsigned int (*poll) (struct file *, struct poll_table_struct *);
+ __poll_t (*poll) (struct file *, struct poll_table_struct *);
long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
int (*mmap) (struct file *, struct vm_area_struct *);
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index 5fd325df59e2..f608180ad59d 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -856,7 +856,7 @@ struct file_operations {
ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
int (*iterate) (struct file *, struct dir_context *);
- unsigned int (*poll) (struct file *, struct poll_table_struct *);
+ __poll_t (*poll) (struct file *, struct poll_table_struct *);
long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
int (*mmap) (struct file *, struct vm_area_struct *);
--
2.14.2
--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org. For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>
^ permalink raw reply related
* [PATCH 04/28] fs: add new vfs_poll and file_can_poll helpers
From: Christoph Hellwig @ 2018-03-21 7:40 UTC (permalink / raw)
To: viro; +Cc: Avi Kivity, linux-aio, linux-fsdevel, netdev, linux-api,
linux-kernel
In-Reply-To: <20180321074032.14211-1-hch@lst.de>
These abstract out calls to the poll method in preparation for changes
in how we poll.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
---
drivers/staging/comedi/drivers/serial2002.c | 4 ++--
drivers/vfio/virqfd.c | 2 +-
drivers/vhost/vhost.c | 2 +-
fs/eventpoll.c | 5 ++---
fs/select.c | 23 ++++++++---------------
include/linux/poll.h | 12 ++++++++++++
mm/memcontrol.c | 2 +-
net/9p/trans_fd.c | 18 ++++--------------
virt/kvm/eventfd.c | 2 +-
9 files changed, 32 insertions(+), 38 deletions(-)
diff --git a/drivers/staging/comedi/drivers/serial2002.c b/drivers/staging/comedi/drivers/serial2002.c
index b3f3b4a201af..5471b2212a62 100644
--- a/drivers/staging/comedi/drivers/serial2002.c
+++ b/drivers/staging/comedi/drivers/serial2002.c
@@ -113,7 +113,7 @@ static void serial2002_tty_read_poll_wait(struct file *f, int timeout)
long elapsed;
__poll_t mask;
- mask = f->f_op->poll(f, &table.pt);
+ mask = vfs_poll(f, &table.pt);
if (mask & (EPOLLRDNORM | EPOLLRDBAND | EPOLLIN |
EPOLLHUP | EPOLLERR)) {
break;
@@ -136,7 +136,7 @@ static int serial2002_tty_read(struct file *f, int timeout)
result = -1;
if (!IS_ERR(f)) {
- if (f->f_op->poll) {
+ if (file_can_poll(f)) {
serial2002_tty_read_poll_wait(f, timeout);
if (kernel_read(f, &ch, 1, &pos) == 1)
diff --git a/drivers/vfio/virqfd.c b/drivers/vfio/virqfd.c
index 085700f1be10..2a1be859ee71 100644
--- a/drivers/vfio/virqfd.c
+++ b/drivers/vfio/virqfd.c
@@ -166,7 +166,7 @@ int vfio_virqfd_enable(void *opaque,
init_waitqueue_func_entry(&virqfd->wait, virqfd_wakeup);
init_poll_funcptr(&virqfd->pt, virqfd_ptable_queue_proc);
- events = irqfd.file->f_op->poll(irqfd.file, &virqfd->pt);
+ events = vfs_poll(irqfd.file, &virqfd->pt);
/*
* Check if there was an event already pending on the eventfd
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 1b3e8d2d5c8b..4d27e288bb1d 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -208,7 +208,7 @@ int vhost_poll_start(struct vhost_poll *poll, struct file *file)
if (poll->wqh)
return 0;
- mask = file->f_op->poll(file, &poll->table);
+ mask = vfs_poll(file, &poll->table);
if (mask)
vhost_poll_wakeup(&poll->wait, 0, 0, poll_to_key(mask));
if (mask & EPOLLERR) {
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 0f3494ed3ed0..2bebae5a38cf 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -884,8 +884,7 @@ static __poll_t ep_item_poll(const struct epitem *epi, poll_table *pt,
pt->_key = epi->event.events;
if (!is_file_epoll(epi->ffd.file))
- return epi->ffd.file->f_op->poll(epi->ffd.file, pt) &
- epi->event.events;
+ return vfs_poll(epi->ffd.file, pt) & epi->event.events;
ep = epi->ffd.file->private_data;
poll_wait(epi->ffd.file, &ep->poll_wait, pt);
@@ -2020,7 +2019,7 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
/* The target file descriptor must support poll */
error = -EPERM;
- if (!tf.file->f_op->poll)
+ if (!file_can_poll(tf.file))
goto error_tgt_fput;
/* Check if EPOLLWAKEUP is allowed */
diff --git a/fs/select.c b/fs/select.c
index c6c504a814f9..ba91103707ea 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -502,14 +502,10 @@ static int do_select(int n, fd_set_bits *fds, struct timespec64 *end_time)
continue;
f = fdget(i);
if (f.file) {
- const struct file_operations *f_op;
- f_op = f.file->f_op;
- mask = DEFAULT_POLLMASK;
- if (f_op->poll) {
- wait_key_set(wait, in, out,
- bit, busy_flag);
- mask = (*f_op->poll)(f.file, wait);
- }
+ wait_key_set(wait, in, out, bit,
+ busy_flag);
+ mask = vfs_poll(f.file, wait);
+
fdput(f);
if ((mask & POLLIN_SET) && (in & bit)) {
res_in |= bit;
@@ -819,13 +815,10 @@ static inline __poll_t do_pollfd(struct pollfd *pollfd, poll_table *pwait,
/* userland u16 ->events contains POLL... bitmap */
filter = demangle_poll(pollfd->events) | EPOLLERR | EPOLLHUP;
- mask = DEFAULT_POLLMASK;
- if (f.file->f_op->poll) {
- pwait->_key = filter | busy_flag;
- mask = f.file->f_op->poll(f.file, pwait);
- if (mask & busy_flag)
- *can_busy_poll = true;
- }
+ pwait->_key = filter | busy_flag;
+ mask = vfs_poll(f.file, pwait);
+ if (mask & busy_flag)
+ *can_busy_poll = true;
mask &= filter; /* Mask out unneeded events. */
fdput(f);
diff --git a/include/linux/poll.h b/include/linux/poll.h
index a3576da63377..7e0fdcf905d2 100644
--- a/include/linux/poll.h
+++ b/include/linux/poll.h
@@ -74,6 +74,18 @@ static inline void init_poll_funcptr(poll_table *pt, poll_queue_proc qproc)
pt->_key = ~(__poll_t)0; /* all events enabled */
}
+static inline bool file_can_poll(struct file *file)
+{
+ return file->f_op->poll;
+}
+
+static inline __poll_t vfs_poll(struct file *file, struct poll_table_struct *pt)
+{
+ if (unlikely(!file->f_op->poll))
+ return DEFAULT_POLLMASK;
+ return file->f_op->poll(file, pt);
+}
+
struct poll_table_entry {
struct file *filp;
__poll_t key;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 670e99b68aa6..8774ece5c3c3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3849,7 +3849,7 @@ static ssize_t memcg_write_event_control(struct kernfs_open_file *of,
if (ret)
goto out_put_css;
- efile.file->f_op->poll(efile.file, &event->pt);
+ vfs_poll(efile.file, &event->pt);
spin_lock(&memcg->event_list_lock);
list_add(&event->list, &memcg->event_list);
diff --git a/net/9p/trans_fd.c b/net/9p/trans_fd.c
index 0cfba919d167..3811775692d0 100644
--- a/net/9p/trans_fd.c
+++ b/net/9p/trans_fd.c
@@ -231,7 +231,7 @@ static void p9_conn_cancel(struct p9_conn *m, int err)
static __poll_t
p9_fd_poll(struct p9_client *client, struct poll_table_struct *pt, int *err)
{
- __poll_t ret, n;
+ __poll_t ret;
struct p9_trans_fd *ts = NULL;
if (client && client->status == Connected)
@@ -243,19 +243,9 @@ p9_fd_poll(struct p9_client *client, struct poll_table_struct *pt, int *err)
return EPOLLERR;
}
- if (!ts->rd->f_op->poll)
- ret = DEFAULT_POLLMASK;
- else
- ret = ts->rd->f_op->poll(ts->rd, pt);
-
- if (ts->rd != ts->wr) {
- if (!ts->wr->f_op->poll)
- n = DEFAULT_POLLMASK;
- else
- n = ts->wr->f_op->poll(ts->wr, pt);
- ret = (ret & ~EPOLLOUT) | (n & ~EPOLLIN);
- }
-
+ ret = vfs_poll(ts->rd, pt);
+ if (ts->rd != ts->wr)
+ ret = (ret & ~EPOLLOUT) | (vfs_poll(ts->wr, pt) & ~EPOLLIN);
return ret;
}
diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 6e865e8b5b10..90d30fbe95ae 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -397,7 +397,7 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
* Check if there was an event already pending on the eventfd
* before we registered, and trigger it as if we didn't miss it.
*/
- events = f.file->f_op->poll(f.file, &irqfd->pt);
+ events = vfs_poll(f.file, &irqfd->pt);
if (events & EPOLLIN)
schedule_work(&irqfd->inject);
--
2.14.2
--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org. For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>
^ permalink raw reply related
* [PATCH 05/28] fs: introduce new ->get_poll_head and ->poll_mask methods
From: Christoph Hellwig @ 2018-03-21 7:40 UTC (permalink / raw)
To: viro; +Cc: Avi Kivity, linux-aio, linux-fsdevel, netdev, linux-api,
linux-kernel
In-Reply-To: <20180321074032.14211-1-hch@lst.de>
->get_poll_head returns the waitqueue that the poll operation is going
to sleep on. Note that this means we can only use a single waitqueue
for the poll, unlike some current drivers that use two waitqueues for
different events. But now that we have keyed wakeups and heavily use
those for poll there aren't that many good reason left to keep the
multiple waitqueues, and if there are any ->poll is still around, the
driver just won't support aio poll.
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
Documentation/filesystems/Locking | 7 ++++++-
Documentation/filesystems/vfs.txt | 13 +++++++++++++
fs/select.c | 28 ++++++++++++++++++++++++++++
include/linux/fs.h | 2 ++
include/linux/poll.h | 27 +++++++++++++++++++++++----
5 files changed, 72 insertions(+), 5 deletions(-)
diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index 220bba28f72b..6d227f9d7bd9 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -440,6 +440,8 @@ prototypes:
ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
int (*iterate) (struct file *, struct dir_context *);
__poll_t (*poll) (struct file *, struct poll_table_struct *);
+ struct wait_queue_head * (*get_poll_head)(struct file *, __poll_t);
+ __poll_t (*poll_mask) (struct file *, __poll_t);
long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
int (*mmap) (struct file *, struct vm_area_struct *);
@@ -470,7 +472,7 @@ prototypes:
};
locking rules:
- All may block.
+ All except for ->poll_mask may block.
->llseek() locking has moved from llseek to the individual llseek
implementations. If your fs is not using generic_file_llseek, you
@@ -498,6 +500,9 @@ in sys_read() and friends.
the lease within the individual filesystem to record the result of the
operation
+->poll_mask can be called with or without the waitqueue lock for the waitqueue
+returned from ->get_poll_head.
+
--------------------------- dquot_operations -------------------------------
prototypes:
int (*write_dquot) (struct dquot *);
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index f608180ad59d..50ee13563271 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -857,6 +857,8 @@ struct file_operations {
ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
int (*iterate) (struct file *, struct dir_context *);
__poll_t (*poll) (struct file *, struct poll_table_struct *);
+ struct wait_queue_head * (*get_poll_head)(struct file *, __poll_t);
+ __poll_t (*poll_mask) (struct file *, __poll_t);
long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
int (*mmap) (struct file *, struct vm_area_struct *);
@@ -901,6 +903,17 @@ otherwise noted.
activity on this file and (optionally) go to sleep until there
is activity. Called by the select(2) and poll(2) system calls
+ get_poll_head: Returns the struct wait_queue_head that poll, select,
+ epoll or aio poll should wait on in case this instance only has single
+ waitqueue. Can return NULL to indicate polling is not supported,
+ or a POLL* value using the POLL_TO_PTR helper in case a grave error
+ occured and ->poll_mask shall not be called.
+
+ poll_mask: return the mask of POLL* values describing the file descriptor
+ state. Called either before going to sleep on the waitqueue returned by
+ get_poll_head, or after it has been woken. If ->get_poll_head and
+ ->poll_mask are implemented ->poll does not need to be implement.
+
unlocked_ioctl: called by the ioctl(2) system call.
compat_ioctl: called by the ioctl(2) system call when 32 bit system calls
diff --git a/fs/select.c b/fs/select.c
index ba91103707ea..cc270d7f6192 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -34,6 +34,34 @@
#include <linux/uaccess.h>
+__poll_t vfs_poll(struct file *file, struct poll_table_struct *pt)
+{
+ unsigned int events = poll_requested_events(pt);
+ struct wait_queue_head *head;
+
+ if (unlikely(!file_can_poll(file)))
+ return DEFAULT_POLLMASK;
+
+ if (file->f_op->poll)
+ return file->f_op->poll(file, pt);
+
+ /*
+ * Only get the poll head and do the first mask check if we are actually
+ * going to sleep on this file:
+ */
+ if (pt && pt->_qproc) {
+ head = vfs_get_poll_head(file, events);
+ if (!head)
+ return DEFAULT_POLLMASK;
+ if (IS_ERR(head))
+ return PTR_TO_POLL(head);
+
+ pt->_qproc(file, head, pt);
+ }
+
+ return file->f_op->poll_mask(file, events);
+}
+EXPORT_SYMBOL_GPL(vfs_poll);
/*
* Estimate expected accuracy in ns from a timeval.
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 79c413985305..6ea2c0843bb1 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1708,6 +1708,8 @@ struct file_operations {
int (*iterate) (struct file *, struct dir_context *);
int (*iterate_shared) (struct file *, struct dir_context *);
__poll_t (*poll) (struct file *, struct poll_table_struct *);
+ struct wait_queue_head * (*get_poll_head)(struct file *, __poll_t);
+ __poll_t (*poll_mask) (struct file *, __poll_t);
long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
int (*mmap) (struct file *, struct vm_area_struct *);
diff --git a/include/linux/poll.h b/include/linux/poll.h
index 7e0fdcf905d2..42e8e8665fb0 100644
--- a/include/linux/poll.h
+++ b/include/linux/poll.h
@@ -74,18 +74,37 @@ static inline void init_poll_funcptr(poll_table *pt, poll_queue_proc qproc)
pt->_key = ~(__poll_t)0; /* all events enabled */
}
+/*
+ * ->get_poll_head can return a __poll_t in the PTR_ERR, use these macros
+ * to return the value and recover it. It takes care of the negation as
+ * well as off the annotations.
+ */
+#define POLL_TO_PTR(mask) (ERR_PTR(-(__force int)(mask)))
+#define PTR_TO_POLL(ptr) ((__force __poll_t)-PTR_ERR((ptr)))
+
static inline bool file_can_poll(struct file *file)
{
- return file->f_op->poll;
+ return file->f_op->poll ||
+ (file->f_op->get_poll_head && file->f_op->poll_mask);
}
-static inline __poll_t vfs_poll(struct file *file, struct poll_table_struct *pt)
+static inline struct wait_queue_head *vfs_get_poll_head(struct file *file,
+ __poll_t events)
{
- if (unlikely(!file->f_op->poll))
+ if (unlikely(!file->f_op->get_poll_head || !file->f_op->poll_mask))
+ return NULL;
+ return file->f_op->get_poll_head(file, events);
+}
+
+static inline __poll_t vfs_poll_mask(struct file *file, __poll_t events)
+{
+ if (unlikely(!file->f_op->poll_mask))
return DEFAULT_POLLMASK;
- return file->f_op->poll(file, pt);
+ return file->f_op->poll_mask(file, events) & events;
}
+__poll_t vfs_poll(struct file *file, struct poll_table_struct *pt);
+
struct poll_table_entry {
struct file *filp;
__poll_t key;
--
2.14.2
--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org. For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>
^ permalink raw reply related
* [PATCH 06/28] aio: implement IOCB_CMD_POLL
From: Christoph Hellwig @ 2018-03-21 7:40 UTC (permalink / raw)
To: viro; +Cc: Avi Kivity, linux-aio, linux-fsdevel, netdev, linux-api,
linux-kernel
In-Reply-To: <20180321074032.14211-1-hch@lst.de>
Simple one-shot poll through the io_submit() interface. To poll for
a file descriptor the application should submit an iocb of type
IOCB_CMD_POLL. It will poll the fd for the events specified in the
the first 32 bits of the aio_buf field of the iocb.
Unlike poll or epoll without EPOLLONESHOT this interface always works
in one shot mode, that is once the iocb is completed, it will have to be
resubmitted.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
---
fs/aio.c | 102 ++++++++++++++++++++++++++++++++++++++++++-
include/uapi/linux/aio_abi.h | 6 +--
2 files changed, 103 insertions(+), 5 deletions(-)
diff --git a/fs/aio.c b/fs/aio.c
index 79d3eb3d2dd9..38b408129697 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -5,6 +5,7 @@
* Implements an efficient asynchronous io interface.
*
* Copyright 2000, 2001, 2002 Red Hat, Inc. All Rights Reserved.
+ * Copyright 2018 Christoph Hellwig.
*
* See ../COPYING for licensing terms.
*/
@@ -162,10 +163,18 @@ struct fsync_iocb {
bool datasync;
};
+struct poll_iocb {
+ struct file *file;
+ __poll_t events;
+ struct wait_queue_head *head;
+ struct wait_queue_entry wait;
+};
+
struct aio_kiocb {
union {
struct kiocb rw;
struct fsync_iocb fsync;
+ struct poll_iocb poll;
};
struct kioctx *ki_ctx;
@@ -1590,7 +1599,6 @@ static int aio_fsync(struct fsync_iocb *req, struct iocb *iocb, bool datasync)
return -EINVAL;
if (iocb->aio_offset || iocb->aio_nbytes || iocb->aio_rw_flags)
return -EINVAL;
-
req->file = fget(iocb->aio_fildes);
if (unlikely(!req->file))
return -EBADF;
@@ -1609,6 +1617,96 @@ static int aio_fsync(struct fsync_iocb *req, struct iocb *iocb, bool datasync)
return ret;
}
+static void __aio_complete_poll(struct poll_iocb *req, __poll_t mask)
+{
+ fput(req->file);
+ aio_complete(container_of(req, struct aio_kiocb, poll),
+ mangle_poll(mask), 0);
+}
+
+static void aio_complete_poll(struct poll_iocb *req, __poll_t mask)
+{
+ struct aio_kiocb *iocb = container_of(req, struct aio_kiocb, poll);
+
+ if (!(iocb->flags & AIO_IOCB_CANCELLED))
+ __aio_complete_poll(req, mask);
+}
+
+static int aio_poll_cancel(struct kiocb *rw)
+{
+ struct aio_kiocb *iocb = container_of(rw, struct aio_kiocb, rw);
+
+ remove_wait_queue(iocb->poll.head, &iocb->poll.wait);
+ __aio_complete_poll(&iocb->poll, 0); /* no events to report */
+ return 0;
+}
+
+static int aio_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
+ void *key)
+{
+ struct poll_iocb *req = container_of(wait, struct poll_iocb, wait);
+ struct file *file = req->file;
+ __poll_t mask = key_to_poll(key);
+
+ assert_spin_locked(&req->head->lock);
+
+ /* for instances that support it check for an event match first: */
+ if (mask && !(mask & req->events))
+ return 0;
+
+ mask = vfs_poll_mask(file, req->events);
+ if (!mask)
+ return 0;
+
+ __remove_wait_queue(req->head, &req->wait);
+ aio_complete_poll(req, mask);
+ return 1;
+}
+
+static ssize_t aio_poll(struct aio_kiocb *aiocb, struct iocb *iocb)
+{
+ struct poll_iocb *req = &aiocb->poll;
+ unsigned long flags;
+ __poll_t mask;
+
+ /* reject any unknown events outside the normal event mask. */
+ if ((u16)iocb->aio_buf != iocb->aio_buf)
+ return -EINVAL;
+ /* reject fields that are not defined for poll */
+ if (iocb->aio_offset || iocb->aio_nbytes || iocb->aio_rw_flags)
+ return -EINVAL;
+
+ req->events = demangle_poll(iocb->aio_buf) | POLLERR | POLLHUP;
+ req->file = fget(iocb->aio_fildes);
+ if (unlikely(!req->file))
+ return -EBADF;
+
+ req->head = vfs_get_poll_head(req->file, req->events);
+ if (!req->head) {
+ fput(req->file);
+ return -EINVAL; /* same as no support for IOCB_CMD_POLL */
+ }
+ if (IS_ERR(req->head)) {
+ mask = PTR_TO_POLL(req->head);
+ goto done;
+ }
+
+ init_waitqueue_func_entry(&req->wait, aio_poll_wake);
+
+ spin_lock_irqsave(&req->head->lock, flags);
+ mask = vfs_poll_mask(req->file, req->events);
+ if (!mask) {
+ __kiocb_set_cancel_fn(aiocb, aio_poll_cancel,
+ AIO_IOCB_DELAYED_CANCEL);
+ __add_wait_queue(req->head, &req->wait);
+ }
+ spin_unlock_irqrestore(&req->head->lock, flags);
+done:
+ if (mask)
+ aio_complete_poll(req, mask);
+ return -EIOCBQUEUED;
+}
+
static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
struct iocb *iocb, bool compat)
{
@@ -1677,6 +1775,8 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
break;
case IOCB_CMD_FDSYNC:
ret = aio_fsync(&req->fsync, iocb, true);
+ case IOCB_CMD_POLL:
+ ret = aio_poll(req, iocb);
break;
default:
pr_debug("invalid aio operation %d\n", iocb->aio_lio_opcode);
diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h
index 2c0a3415beee..ed0185945bb2 100644
--- a/include/uapi/linux/aio_abi.h
+++ b/include/uapi/linux/aio_abi.h
@@ -39,10 +39,8 @@ enum {
IOCB_CMD_PWRITE = 1,
IOCB_CMD_FSYNC = 2,
IOCB_CMD_FDSYNC = 3,
- /* These two are experimental.
- * IOCB_CMD_PREADX = 4,
- * IOCB_CMD_POLL = 5,
- */
+ /* 4 was the experimental IOCB_CMD_PREADX */
+ IOCB_CMD_POLL = 5,
IOCB_CMD_NOOP = 6,
IOCB_CMD_PREADV = 7,
IOCB_CMD_PWRITEV = 8,
--
2.14.2
--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org. For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>
^ permalink raw reply related
* [PATCH 07/28] net: refactor socket_poll
From: Christoph Hellwig @ 2018-03-21 7:40 UTC (permalink / raw)
To: viro; +Cc: Avi Kivity, linux-aio, linux-fsdevel, netdev, linux-api,
linux-kernel
In-Reply-To: <20180321074032.14211-1-hch@lst.de>
Factor out two busy poll related helpers for late reuse, and remove
a command that isn't very helpful, especially with the __poll_t
annotations in place.
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
include/net/busy_poll.h | 15 +++++++++++++++
net/socket.c | 21 ++++-----------------
2 files changed, 19 insertions(+), 17 deletions(-)
diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
index 71c72a939bf8..c5187438af38 100644
--- a/include/net/busy_poll.h
+++ b/include/net/busy_poll.h
@@ -121,6 +121,21 @@ static inline void sk_busy_loop(struct sock *sk, int nonblock)
#endif
}
+static inline void sock_poll_busy_loop(struct socket *sock, __poll_t events)
+{
+ if (sk_can_busy_loop(sock->sk) &&
+ events && (events & POLL_BUSY_LOOP)) {
+ /* once, only if requested by syscall */
+ sk_busy_loop(sock->sk, 1);
+ }
+}
+
+/* if this socket can poll_ll, tell the system call */
+static inline __poll_t sock_poll_busy_flag(struct socket *sock)
+{
+ return sk_can_busy_loop(sock->sk) ? POLL_BUSY_LOOP : 0;
+}
+
/* used in the NIC receive handler to mark the skb */
static inline void skb_mark_napi_id(struct sk_buff *skb,
struct napi_struct *napi)
diff --git a/net/socket.c b/net/socket.c
index a93c99b518ca..3f859a07641a 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -1117,24 +1117,11 @@ EXPORT_SYMBOL(sock_create_lite);
/* No kernel lock held - perfect */
static __poll_t sock_poll(struct file *file, poll_table *wait)
{
- __poll_t busy_flag = 0;
- struct socket *sock;
-
- /*
- * We can't return errors to poll, so it's either yes or no.
- */
- sock = file->private_data;
-
- if (sk_can_busy_loop(sock->sk)) {
- /* this socket can poll_ll so tell the system call */
- busy_flag = POLL_BUSY_LOOP;
-
- /* once, only if requested by syscall */
- if (wait && (wait->_key & POLL_BUSY_LOOP))
- sk_busy_loop(sock->sk, 1);
- }
+ struct socket *sock = file->private_data;
+ __poll_t events = poll_requested_events(wait);
- return busy_flag | sock->ops->poll(file, sock, wait);
+ sock_poll_busy_loop(sock, events);
+ return sock->ops->poll(file, sock, wait) | sock_poll_busy_flag(sock);
}
static int sock_mmap(struct file *file, struct vm_area_struct *vma)
--
2.14.2
--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org. For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>
^ permalink raw reply related
* [PATCH 08/28] net: add support for ->poll_mask in proto_ops
From: Christoph Hellwig @ 2018-03-21 7:40 UTC (permalink / raw)
To: viro; +Cc: Avi Kivity, linux-aio, linux-fsdevel, netdev, linux-api,
linux-kernel
In-Reply-To: <20180321074032.14211-1-hch@lst.de>
The socket file operations still implement ->poll until all protocols are
switched over.
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
include/linux/net.h | 3 +++
net/socket.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++-----
2 files changed, 49 insertions(+), 5 deletions(-)
diff --git a/include/linux/net.h b/include/linux/net.h
index 91216b16feb7..ce3d4dacb51e 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -147,6 +147,9 @@ struct proto_ops {
int (*getname) (struct socket *sock,
struct sockaddr *addr,
int *sockaddr_len, int peer);
+ struct wait_queue_head *(*get_poll_head)(struct socket *sock,
+ __poll_t events);
+ __poll_t (*poll_mask) (struct socket *sock, __poll_t events);
__poll_t (*poll) (struct file *file, struct socket *sock,
struct poll_table_struct *wait);
int (*ioctl) (struct socket *sock, unsigned int cmd,
diff --git a/net/socket.c b/net/socket.c
index 3f859a07641a..ceb69ddcd7bd 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -118,8 +118,10 @@ static ssize_t sock_write_iter(struct kiocb *iocb, struct iov_iter *from);
static int sock_mmap(struct file *file, struct vm_area_struct *vma);
static int sock_close(struct inode *inode, struct file *file);
-static __poll_t sock_poll(struct file *file,
- struct poll_table_struct *wait);
+static struct wait_queue_head *sock_get_poll_head(struct file *file,
+ __poll_t events);
+static __poll_t sock_poll_mask(struct file *file, __poll_t);
+static __poll_t sock_poll(struct file *file, struct poll_table_struct *wait);
static long sock_ioctl(struct file *file, unsigned int cmd, unsigned long arg);
#ifdef CONFIG_COMPAT
static long compat_sock_ioctl(struct file *file,
@@ -142,6 +144,8 @@ static const struct file_operations socket_file_ops = {
.llseek = no_llseek,
.read_iter = sock_read_iter,
.write_iter = sock_write_iter,
+ .get_poll_head = sock_get_poll_head,
+ .poll_mask = sock_poll_mask,
.poll = sock_poll,
.unlocked_ioctl = sock_ioctl,
#ifdef CONFIG_COMPAT
@@ -1114,14 +1118,51 @@ int sock_create_lite(int family, int type, int protocol, struct socket **res)
}
EXPORT_SYMBOL(sock_create_lite);
+static struct wait_queue_head *sock_get_poll_head(struct file *file,
+ __poll_t events)
+{
+ struct socket *sock = file->private_data;
+
+ if (!sock->ops->poll_mask)
+ return NULL;
+ if (sock->ops->get_poll_head)
+ return sock->ops->get_poll_head(sock, events);
+
+ sock_poll_busy_loop(sock, events);
+ return sk_sleep(sock->sk);
+}
+
+static __poll_t sock_poll_mask(struct file *file, __poll_t events)
+{
+ struct socket *sock = file->private_data;
+
+ /*
+ * We need to be sure we are in sync with the socket flags modification.
+ *
+ * This memory barrier is paired in the wq_has_sleeper.
+ */
+ smp_mb();
+
+ /* this socket can poll_ll so tell the system call */
+ return sock->ops->poll_mask(sock, events) |
+ (sk_can_busy_loop(sock->sk) ? POLL_BUSY_LOOP : 0);
+}
+
/* No kernel lock held - perfect */
static __poll_t sock_poll(struct file *file, poll_table *wait)
{
struct socket *sock = file->private_data;
- __poll_t events = poll_requested_events(wait);
+ __poll_t events = poll_requested_events(wait), mask = 0;
- sock_poll_busy_loop(sock, events);
- return sock->ops->poll(file, sock, wait) | sock_poll_busy_flag(sock);
+ if (sock->ops->poll) {
+ sock_poll_busy_loop(sock, events);
+ mask = sock->ops->poll(file, sock, wait);
+ } else if (sock->ops->poll_mask) {
+ sock_poll_wait(file, sock_get_poll_head(file, events), wait);
+ mask = sock->ops->poll_mask(sock, events);
+ }
+
+ return mask | sock_poll_busy_flag(sock);
}
static int sock_mmap(struct file *file, struct vm_area_struct *vma)
--
2.14.2
--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org. For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>
^ permalink raw reply related
* [PATCH 09/28] net: remove sock_no_poll
From: Christoph Hellwig @ 2018-03-21 7:40 UTC (permalink / raw)
To: viro; +Cc: Avi Kivity, linux-aio, linux-fsdevel, netdev, linux-api,
linux-kernel
In-Reply-To: <20180321074032.14211-1-hch@lst.de>
Now that sock_poll handles a NULL ->poll or ->poll_mask there is no need
for a stub.
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
crypto/af_alg.c | 1 -
crypto/algif_hash.c | 2 --
crypto/algif_rng.c | 1 -
drivers/isdn/mISDN/socket.c | 1 -
drivers/net/ppp/pptp.c | 1 -
include/net/sock.h | 2 --
net/bluetooth/bnep/sock.c | 1 -
net/bluetooth/cmtp/sock.c | 1 -
net/bluetooth/hidp/sock.c | 1 -
net/core/sock.c | 6 ------
10 files changed, 17 deletions(-)
diff --git a/crypto/af_alg.c b/crypto/af_alg.c
index c49766b03165..50d75de539f5 100644
--- a/crypto/af_alg.c
+++ b/crypto/af_alg.c
@@ -347,7 +347,6 @@ static const struct proto_ops alg_proto_ops = {
.sendpage = sock_no_sendpage,
.sendmsg = sock_no_sendmsg,
.recvmsg = sock_no_recvmsg,
- .poll = sock_no_poll,
.bind = alg_bind,
.release = af_alg_release,
diff --git a/crypto/algif_hash.c b/crypto/algif_hash.c
index 6c9b1927a520..bfcf595fd8f9 100644
--- a/crypto/algif_hash.c
+++ b/crypto/algif_hash.c
@@ -288,7 +288,6 @@ static struct proto_ops algif_hash_ops = {
.mmap = sock_no_mmap,
.bind = sock_no_bind,
.setsockopt = sock_no_setsockopt,
- .poll = sock_no_poll,
.release = af_alg_release,
.sendmsg = hash_sendmsg,
@@ -396,7 +395,6 @@ static struct proto_ops algif_hash_ops_nokey = {
.mmap = sock_no_mmap,
.bind = sock_no_bind,
.setsockopt = sock_no_setsockopt,
- .poll = sock_no_poll,
.release = af_alg_release,
.sendmsg = hash_sendmsg_nokey,
diff --git a/crypto/algif_rng.c b/crypto/algif_rng.c
index 150c2b6480ed..22df3799a17b 100644
--- a/crypto/algif_rng.c
+++ b/crypto/algif_rng.c
@@ -106,7 +106,6 @@ static struct proto_ops algif_rng_ops = {
.bind = sock_no_bind,
.accept = sock_no_accept,
.setsockopt = sock_no_setsockopt,
- .poll = sock_no_poll,
.sendmsg = sock_no_sendmsg,
.sendpage = sock_no_sendpage,
diff --git a/drivers/isdn/mISDN/socket.c b/drivers/isdn/mISDN/socket.c
index c5603d1a07d6..c84270e16bdd 100644
--- a/drivers/isdn/mISDN/socket.c
+++ b/drivers/isdn/mISDN/socket.c
@@ -746,7 +746,6 @@ static const struct proto_ops base_sock_ops = {
.getname = sock_no_getname,
.sendmsg = sock_no_sendmsg,
.recvmsg = sock_no_recvmsg,
- .poll = sock_no_poll,
.listen = sock_no_listen,
.shutdown = sock_no_shutdown,
.setsockopt = sock_no_setsockopt,
diff --git a/drivers/net/ppp/pptp.c b/drivers/net/ppp/pptp.c
index 6dde9a0cfe76..87f892f1d0fe 100644
--- a/drivers/net/ppp/pptp.c
+++ b/drivers/net/ppp/pptp.c
@@ -627,7 +627,6 @@ static const struct proto_ops pptp_ops = {
.socketpair = sock_no_socketpair,
.accept = sock_no_accept,
.getname = pptp_getname,
- .poll = sock_no_poll,
.listen = sock_no_listen,
.shutdown = sock_no_shutdown,
.setsockopt = sock_no_setsockopt,
diff --git a/include/net/sock.h b/include/net/sock.h
index 169c92afcafa..d9249fe65859 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1585,8 +1585,6 @@ int sock_no_connect(struct socket *, struct sockaddr *, int, int);
int sock_no_socketpair(struct socket *, struct socket *);
int sock_no_accept(struct socket *, struct socket *, int, bool);
int sock_no_getname(struct socket *, struct sockaddr *, int *, int);
-__poll_t sock_no_poll(struct file *, struct socket *,
- struct poll_table_struct *);
int sock_no_ioctl(struct socket *, unsigned int, unsigned long);
int sock_no_listen(struct socket *, int);
int sock_no_shutdown(struct socket *, int);
diff --git a/net/bluetooth/bnep/sock.c b/net/bluetooth/bnep/sock.c
index b5116fa9835e..00deacdcb51c 100644
--- a/net/bluetooth/bnep/sock.c
+++ b/net/bluetooth/bnep/sock.c
@@ -175,7 +175,6 @@ static const struct proto_ops bnep_sock_ops = {
.getname = sock_no_getname,
.sendmsg = sock_no_sendmsg,
.recvmsg = sock_no_recvmsg,
- .poll = sock_no_poll,
.listen = sock_no_listen,
.shutdown = sock_no_shutdown,
.setsockopt = sock_no_setsockopt,
diff --git a/net/bluetooth/cmtp/sock.c b/net/bluetooth/cmtp/sock.c
index ce86a7bae844..e08f28fadd65 100644
--- a/net/bluetooth/cmtp/sock.c
+++ b/net/bluetooth/cmtp/sock.c
@@ -178,7 +178,6 @@ static const struct proto_ops cmtp_sock_ops = {
.getname = sock_no_getname,
.sendmsg = sock_no_sendmsg,
.recvmsg = sock_no_recvmsg,
- .poll = sock_no_poll,
.listen = sock_no_listen,
.shutdown = sock_no_shutdown,
.setsockopt = sock_no_setsockopt,
diff --git a/net/bluetooth/hidp/sock.c b/net/bluetooth/hidp/sock.c
index 008ba439bd62..1eaac01f85de 100644
--- a/net/bluetooth/hidp/sock.c
+++ b/net/bluetooth/hidp/sock.c
@@ -208,7 +208,6 @@ static const struct proto_ops hidp_sock_ops = {
.getname = sock_no_getname,
.sendmsg = sock_no_sendmsg,
.recvmsg = sock_no_recvmsg,
- .poll = sock_no_poll,
.listen = sock_no_listen,
.shutdown = sock_no_shutdown,
.setsockopt = sock_no_setsockopt,
diff --git a/net/core/sock.c b/net/core/sock.c
index c501499a04fe..b72b6ad050e4 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2503,12 +2503,6 @@ int sock_no_getname(struct socket *sock, struct sockaddr *saddr,
}
EXPORT_SYMBOL(sock_no_getname);
-__poll_t sock_no_poll(struct file *file, struct socket *sock, poll_table *pt)
-{
- return 0;
-}
-EXPORT_SYMBOL(sock_no_poll);
-
int sock_no_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg)
{
return -EOPNOTSUPP;
--
2.14.2
--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org. For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>
^ permalink raw reply related
* [PATCH 10/28] net/tcp: convert to ->poll_mask
From: Christoph Hellwig @ 2018-03-21 7:40 UTC (permalink / raw)
To: viro; +Cc: Avi Kivity, linux-aio, linux-fsdevel, netdev, linux-api,
linux-kernel
In-Reply-To: <20180321074032.14211-1-hch@lst.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
include/net/tcp.h | 4 ++--
net/ipv4/af_inet.c | 3 ++-
net/ipv4/tcp.c | 31 ++++++++++++++-----------------
net/ipv6/af_inet6.c | 3 ++-
4 files changed, 20 insertions(+), 21 deletions(-)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index e3fc667f9ac2..fb52f93d556c 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -387,8 +387,8 @@ bool tcp_peer_is_proven(struct request_sock *req, struct dst_entry *dst);
void tcp_close(struct sock *sk, long timeout);
void tcp_init_sock(struct sock *sk);
void tcp_init_transfer(struct sock *sk, int bpf_op);
-__poll_t tcp_poll(struct file *file, struct socket *sock,
- struct poll_table_struct *wait);
+struct wait_queue_head *tcp_get_poll_head(struct socket *sock, __poll_t events);
+__poll_t tcp_poll_mask(struct socket *sock, __poll_t events);
int tcp_getsockopt(struct sock *sk, int level, int optname,
char __user *optval, int __user *optlen);
int tcp_setsockopt(struct sock *sk, int level, int optname,
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index e4329e161943..ec32cc263b18 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -952,7 +952,8 @@ const struct proto_ops inet_stream_ops = {
.socketpair = sock_no_socketpair,
.accept = inet_accept,
.getname = inet_getname,
- .poll = tcp_poll,
+ .get_poll_head = tcp_get_poll_head,
+ .poll_mask = tcp_poll_mask,
.ioctl = inet_ioctl,
.listen = inet_listen,
.shutdown = inet_shutdown,
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 48636aee23c3..ad8e281066a0 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -484,33 +484,30 @@ static void tcp_tx_timestamp(struct sock *sk, u16 tsflags)
}
}
+struct wait_queue_head *tcp_get_poll_head(struct socket *sock, __poll_t events)
+{
+ sock_poll_busy_loop(sock, events);
+ sock_rps_record_flow(sock->sk);
+ return sk_sleep(sock->sk);
+}
+EXPORT_SYMBOL(tcp_get_poll_head);
+
/*
- * Wait for a TCP event.
- *
- * Note that we don't need to lock the socket, as the upper poll layers
- * take care of normal races (between the test and the event) and we don't
- * go look at any of the socket buffers directly.
+ * Socket is not locked. We are protected from async events by poll logic and
+ * correct handling of state changes made by other threads is impossible in
+ * any case.
*/
-__poll_t tcp_poll(struct file *file, struct socket *sock, poll_table *wait)
+__poll_t tcp_poll_mask(struct socket *sock, __poll_t events)
{
- __poll_t mask;
struct sock *sk = sock->sk;
const struct tcp_sock *tp = tcp_sk(sk);
+ __poll_t mask = 0;
int state;
- sock_poll_wait(file, sk_sleep(sk), wait);
-
state = inet_sk_state_load(sk);
if (state == TCP_LISTEN)
return inet_csk_listen_poll(sk);
- /* Socket is not locked. We are protected from async events
- * by poll logic and correct handling of state changes
- * made by other threads is impossible in any case.
- */
-
- mask = 0;
-
/*
* EPOLLHUP is certainly not done right. But poll() doesn't
* have a notion of HUP in just one direction, and for a
@@ -591,7 +588,7 @@ __poll_t tcp_poll(struct file *file, struct socket *sock, poll_table *wait)
return mask;
}
-EXPORT_SYMBOL(tcp_poll);
+EXPORT_SYMBOL(tcp_poll_mask);
int tcp_ioctl(struct sock *sk, int cmd, unsigned long arg)
{
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 416917719a6f..c470549d6ef9 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -547,7 +547,8 @@ const struct proto_ops inet6_stream_ops = {
.socketpair = sock_no_socketpair, /* a do nothing */
.accept = inet_accept, /* ok */
.getname = inet6_getname,
- .poll = tcp_poll, /* ok */
+ .get_poll_head = tcp_get_poll_head,
+ .poll_mask = tcp_poll_mask, /* ok */
.ioctl = inet6_ioctl, /* must change */
.listen = inet_listen, /* ok */
.shutdown = inet_shutdown, /* ok */
--
2.14.2
--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org. For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox