Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH v4 1/2] kernel.h: Introduce const_max() for VLA removal
From: Kees Cook @ 2018-03-15 23:31 UTC (permalink / raw)
  To: Miguel Ojeda
  Cc: Linus Torvalds, Andrew Morton, Josh Poimboeuf, Rasmus Villemoes,
	Randy Dunlap, Ingo Molnar, David Laight, Ian Abbott, linux-input,
	linux-btrfs, Network Development, Linux Kernel Mailing List,
	Kernel Hardening
In-Reply-To: <CANiq72m7GEZS=Wkzg5KLkTnzW8vYz8X90OwVs4r5vcCCAp1-Pg@mail.gmail.com>

On Thu, Mar 15, 2018 at 4:17 PM, Miguel Ojeda
<miguel.ojeda.sandonis@gmail.com> wrote:
>> The full one, using your naming convention:
>>
>> #define const_max(x, y)                                          \
>>     ({                                                           \
>>         if (!__builtin_constant_p(x))                            \
>>             __error_not_const_arg();                             \
>>         if (!__builtin_constant_p(y))                            \
>>             __error_not_const_arg();                             \
>>         if (!__builtin_types_compatible_p(typeof(x), typeof(y))) \
>>             __error_incompatible_types();                        \
>>         if ((x) < 0)                                             \
>>             __error_not_positive_arg();                          \
>>         if ((y) < 0)                                             \
>>             __error_not_positive_arg();                          \
>>         __builtin_choose_expr((x) > (y), (x), (y));              \
>>     })
>>
>
> Nevermind... gcc doesn't take that as a constant expr, even if it
> compiles as one at -O0.

Yeah, unfortunately. :(

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply

* Re: [PATCH 6/7] e1000: eliminate duplicate barriers on weakly-ordered archs
From: Sinan Kaya @ 2018-03-15 23:30 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Netdev, Timur Tabi, sulrich, linux-arm-msm, linux-arm-kernel,
	Jeff Kirsher, intel-wired-lan, LKML
In-Reply-To: <CAKgT0UfJwYFy5g_2TE5W4MgdEvyH2CvBAS2imdw9UUiumuUOhQ@mail.gmail.com>

On 3/14/2018 9:41 PM, Alexander Duyck wrote:
>>  }
>>
> So you missed the writel in e1000_xmit_frame. You should probably get
> that one too while you are doing these updates. The wmb() is in
> e1000_tx_queue().
> 

I brought wmb() outside along with the next descriptor assignment to be
similar to the rest of the other code.

if wmb() and writel() are not visible in the same function, let's not touch
the code.

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply

* Re: [PATCH v4 1/2] kernel.h: Introduce const_max() for VLA removal
From: Miguel Ojeda @ 2018-03-15 23:17 UTC (permalink / raw)
  To: Kees Cook
  Cc: Linus Torvalds, Andrew Morton, Josh Poimboeuf, Rasmus Villemoes,
	Randy Dunlap, Ingo Molnar, David Laight, Ian Abbott, linux-input,
	linux-btrfs, Network Development, Linux Kernel Mailing List,
	Kernel Hardening
In-Reply-To: <CANiq72nEsno3JH=c_Xaf9gn1pJdM=ni6Z0ZUDqcEF07FT+SBSw@mail.gmail.com>

On Fri, Mar 16, 2018 at 12:08 AM, Miguel Ojeda
<miguel.ojeda.sandonis@gmail.com> wrote:
> On Thu, Mar 15, 2018 at 11:58 PM, Miguel Ojeda
> <miguel.ojeda.sandonis@gmail.com> wrote:
>> On Thu, Mar 15, 2018 at 11:46 PM, Kees Cook <keescook@chromium.org> wrote:
>>>
>>> By using this eye-bleed:
>>>
>>> size_t __error_not_const_arg(void) \
>>> __compiletime_error("const_max() used with non-compile-time constant arg");
>>> size_t __error_not_positive_arg(void) \
>>> __compiletime_error("const_max() used with negative arg");
>>> #define const_max(x, y)                                                 \
>>>         __builtin_choose_expr(__builtin_constant_p(x) &&                \
>>>                               __builtin_constant_p(y),                  \
>>>                 __builtin_choose_expr((x) >= 0 && (y) >= 0,             \
>>>                                       (typeof(x))(x) > (typeof(y))(y) ? \
>>>                                         (x) : (y),                      \
>>>                                       __error_not_positive_arg()),      \
>>>                 __error_not_const_arg())
>>>
>>
>> I was writing it like this:
>>
>> #define const_max(a, b) \
>>     ({ \
>>         if ((a) < 0) \
>>             __const_max_called_with_negative_value(); \
>>         if ((b) < 0) \
>>             __const_max_called_with_negative_value(); \
>>         if (!__builtin_types_compatible_p(typeof(a), typeof(b))) \
>>             __const_max_called_with_incompatible_types(); \
>>         __builtin_choose_expr((a) > (b), (a), (b)); \
>> })
>
> The full one, using your naming convention:
>
> #define const_max(x, y)                                          \
>     ({                                                           \
>         if (!__builtin_constant_p(x))                            \
>             __error_not_const_arg();                             \
>         if (!__builtin_constant_p(y))                            \
>             __error_not_const_arg();                             \
>         if (!__builtin_types_compatible_p(typeof(x), typeof(y))) \
>             __error_incompatible_types();                        \
>         if ((x) < 0)                                             \
>             __error_not_positive_arg();                          \
>         if ((y) < 0)                                             \
>             __error_not_positive_arg();                          \
>         __builtin_choose_expr((x) > (y), (x), (y));              \
>     })
>

Nevermind... gcc doesn't take that as a constant expr, even if it
compiles as one at -O0.

^ permalink raw reply

* Re: [PATCH v4 1/2] kernel.h: Introduce const_max() for VLA removal
From: Miguel Ojeda @ 2018-03-15 23:08 UTC (permalink / raw)
  To: Kees Cook
  Cc: Linus Torvalds, Andrew Morton, Josh Poimboeuf, Rasmus Villemoes,
	Randy Dunlap, Ingo Molnar, David Laight, Ian Abbott, linux-input,
	linux-btrfs, Network Development, Linux Kernel Mailing List,
	Kernel Hardening
In-Reply-To: <CANiq72k4qvXBy-VbFc5uOh-wAMx0yui5JokzX=NXtgZJ6F_NEg@mail.gmail.com>

On Thu, Mar 15, 2018 at 11:58 PM, Miguel Ojeda
<miguel.ojeda.sandonis@gmail.com> wrote:
> On Thu, Mar 15, 2018 at 11:46 PM, Kees Cook <keescook@chromium.org> wrote:
>>
>> By using this eye-bleed:
>>
>> size_t __error_not_const_arg(void) \
>> __compiletime_error("const_max() used with non-compile-time constant arg");
>> size_t __error_not_positive_arg(void) \
>> __compiletime_error("const_max() used with negative arg");
>> #define const_max(x, y)                                                 \
>>         __builtin_choose_expr(__builtin_constant_p(x) &&                \
>>                               __builtin_constant_p(y),                  \
>>                 __builtin_choose_expr((x) >= 0 && (y) >= 0,             \
>>                                       (typeof(x))(x) > (typeof(y))(y) ? \
>>                                         (x) : (y),                      \
>>                                       __error_not_positive_arg()),      \
>>                 __error_not_const_arg())
>>
>
> I was writing it like this:
>
> #define const_max(a, b) \
>     ({ \
>         if ((a) < 0) \
>             __const_max_called_with_negative_value(); \
>         if ((b) < 0) \
>             __const_max_called_with_negative_value(); \
>         if (!__builtin_types_compatible_p(typeof(a), typeof(b))) \
>             __const_max_called_with_incompatible_types(); \
>         __builtin_choose_expr((a) > (b), (a), (b)); \
> })

The full one, using your naming convention:

#define const_max(x, y)                                          \
    ({                                                           \
        if (!__builtin_constant_p(x))                            \
            __error_not_const_arg();                             \
        if (!__builtin_constant_p(y))                            \
            __error_not_const_arg();                             \
        if (!__builtin_types_compatible_p(typeof(x), typeof(y))) \
            __error_incompatible_types();                        \
        if ((x) < 0)                                             \
            __error_not_positive_arg();                          \
        if ((y) < 0)                                             \
            __error_not_positive_arg();                          \
        __builtin_choose_expr((x) > (y), (x), (y));              \
    })

Miguel

^ permalink raw reply

* Announce: Netdev 0x12 Conference
From: Jamal Hadi Salim @ 2018-03-15 23:08 UTC (permalink / raw)
  To: netdev@vger.kernel.org, board

The NetDev Society is pleased to announce that Netdev 0x12
will take place July 11-13, 2018 in Montreal, Canada.

More details here:
https://www.netdevconf.org/0x12

For regular updates, please subscribe to people@lists.netdevconf.org
(more info at: https://lists.netdevconf.org/cgi-bin/mailman/listinfo/people)

If twitter is your thing then follow us: @netdev01
and use hashtag #netdevconf

cheers,
jamal(on behalf of NetDev Society)

^ permalink raw reply

* Re: [bpf-next PATCH v2 05/18] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data
From: Alexei Starovoitov @ 2018-03-15 23:06 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: John Fastabend, davem, ast, davejwatson, netdev
In-Reply-To: <e0e5622d-20d2-1aa0-e5dd-20f1417c875e@iogearbox.net>

On Thu, Mar 15, 2018 at 11:55:39PM +0100, Daniel Borkmann wrote:
> On 03/15/2018 11:20 PM, Alexei Starovoitov wrote:
> > On Thu, Mar 15, 2018 at 11:17:12PM +0100, Daniel Borkmann wrote:
> >> On 03/15/2018 10:59 PM, Alexei Starovoitov wrote:
> >>> On Mon, Mar 12, 2018 at 12:23:29PM -0700, John Fastabend wrote:
> >>>>  
> >>>> +/* User return codes for SK_MSG prog type. */
> >>>> +enum sk_msg_action {
> >>>> +	SK_MSG_DROP = 0,
> >>>> +	SK_MSG_PASS,
> >>>> +};
> >>>
> >>> do we really need new enum here?
> >>> It's the same as 'enum sk_action' and SK_DROP == SK_MSG_DROP
> >>> and there will be only drop/pass in both enums.
> >>> Also I don't see where these two new SK_MSG_* are used...
> >>>
> >>>> +
> >>>> +/* user accessible metadata for SK_MSG packet hook, new fields must
> >>>> + * be added to the end of this structure
> >>>> + */
> >>>> +struct sk_msg_md {
> >>>> +	__u32 data;
> >>>> +	__u32 data_end;
> >>>> +};
> >>>
> >>> I think it's time for me to ask for forgiveness :)
> >>
> >> :-)
> >>
> >>> I used __u32 for data and data_end only because all other fields
> >>> in __sk_buff were __u32 at the time and I couldn't easily figure out
> >>> how to teach verifier to recognize 8-byte rewrites.
> >>> Unfortunately my mistake stuck and was copied over into xdp.
> >>> Since this is new struct let's do it right and add
> >>> 'void *data, *data_end' here,
> >>> since bpf prog will use them as 'void *' pointers.
> >>> There are no compat issues here, since bpf is always 64-bit.
> >>
> >> But at least offset-wise when you do the ctx rewrite this would then
> >> be a bit more tricky when you have 64 bit kernel with 32 bit user
> >> space since void * members are in each cases at different offset. So
> >> unless I'm missing something, this still should either be __u32 or
> >> __u64 instead of void *, no?
> > 
> > there is no 32-bit user space. these structs are seen by bpf progs only
> > and bpf is 64-bit only too.
> > unless I'm missing your point.
> 
> Ok, so lets say you have 32 bit LLVM binary and compile the prog where
> you access md->data_end. Given the void * in the struct will that access
> end up being BPF_W at ctx offset 4 or BPF_DW at ctx offset 8 from clang
> perspective (iow, is the back end treating this special and always use
> fixed BPF_DW in such case)? If not and it would be the first case with
> offset 4, then we could have the case that underlying 64 bit kernel is
> expecting ctx offset 8 for doing the md ctx conversion.

i'm still not quite following.
Whether llvm itself is 32-bit binary or it's arm32 or sprac32 binary
doesn't matter. It will produce the same 64-bit bpf code.
It will see 'void *' deref from this struct and will emit DW.
May be confusion is from newly added -mattr=+alu32 flag?
That option doesn't change that sizeof(void*)==8.
It only allows backend to emit 32-bit alu insns.

^ permalink raw reply

* [PATCH net 5/5] net/sched: fix NULL dereference on the error path of tcf_skbmod_init()
From: Davide Caratti @ 2018-03-15 23:00 UTC (permalink / raw)
  To: Cong Wang, Jiri Pirko, David S. Miller; +Cc: Roman Mashak, Manish Kurup, netdev
In-Reply-To: <cover.1521154629.git.dcaratti@redhat.com>

when the following command

 # tc action replace action skbmod swap mac index 100

is run for the first time, and tcf_skbmod_init() fails to allocate struct
tcf_skbmod_params, tcf_skbmod_cleanup() calls kfree_rcu(NULL), thus
causing the following error:

 BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
 IP: __call_rcu+0x23/0x2b0
 PGD 8000000034057067 P4D 8000000034057067 PUD 74937067 PMD 0
 Oops: 0002 [#1] SMP PTI
 Modules linked in: act_skbmod(E) psample ip6table_filter ip6_tables iptable_filter binfmt_misc ext4 snd_hda_codec_generic snd_hda_intel snd_hda_codec crct10dif_pclmul mbcache jbd2 crc32_pclmul snd_hda_core ghash_clmulni_intel snd_hwdep pcbc snd_seq snd_seq_device snd_pcm aesni_intel snd_timer crypto_simd glue_helper snd cryptd virtio_balloon joydev soundcore pcspkr i2c_piix4 nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c ata_generic pata_acpi qxl drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm virtio_console virtio_net virtio_blk ata_piix libata crc32c_intel virtio_pci serio_raw virtio_ring virtio i2c_core floppy dm_mirror dm_region_hash dm_log dm_mod [last unloaded: act_skbmod]
 CPU: 3 PID: 3144 Comm: tc Tainted: G            E    4.16.0-rc4.act_vlan.orig+ #403
 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
 RIP: 0010:__call_rcu+0x23/0x2b0
 RSP: 0018:ffffbd2e403e7798 EFLAGS: 00010246
 RAX: ffffffffc0872080 RBX: ffff981d34bff780 RCX: 00000000ffffffff
 RDX: ffffffff922a5f00 RSI: 0000000000000000 RDI: 0000000000000000
 RBP: 0000000000000000 R08: 0000000000000001 R09: 000000000000021f
 R10: 000000003d003000 R11: 0000000000aaaaaa R12: 0000000000000000
 R13: ffffffff922a5f00 R14: 0000000000000001 R15: ffff981d3b698c2c
 FS:  00007f3678292740(0000) GS:ffff981d3fd80000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000008 CR3: 000000007c57a006 CR4: 00000000001606e0
 Call Trace:
  __tcf_idr_release+0x79/0xf0
  tcf_skbmod_init+0x1d1/0x210 [act_skbmod]
  tcf_action_init_1+0x2cc/0x430
  tcf_action_init+0xd3/0x1b0
  tc_ctl_action+0x18b/0x240
  rtnetlink_rcv_msg+0x29c/0x310
  ? _cond_resched+0x15/0x30
  ? __kmalloc_node_track_caller+0x1b9/0x270
  ? rtnl_calcit.isra.28+0x100/0x100
  netlink_rcv_skb+0xd2/0x110
  netlink_unicast+0x17c/0x230
  netlink_sendmsg+0x2cd/0x3c0
  sock_sendmsg+0x30/0x40
  ___sys_sendmsg+0x27a/0x290
  ? filemap_map_pages+0x34a/0x3a0
  ? __handle_mm_fault+0xbfd/0xe20
  __sys_sendmsg+0x51/0x90
  do_syscall_64+0x6e/0x1a0
  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
 RIP: 0033:0x7f36776a3ba0
 RSP: 002b:00007fff4703b618 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
 RAX: ffffffffffffffda RBX: 00007fff4703b740 RCX: 00007f36776a3ba0
 RDX: 0000000000000000 RSI: 00007fff4703b690 RDI: 0000000000000003
 RBP: 000000005aaaba36 R08: 0000000000000002 R09: 0000000000000000
 R10: 00007fff4703b0a0 R11: 0000000000000246 R12: 0000000000000000
 R13: 00007fff4703b754 R14: 0000000000000001 R15: 0000000000669f60
 Code: 5d e9 42 da ff ff 66 90 0f 1f 44 00 00 41 57 41 56 41 55 49 89 d5 41 54 55 48 89 fd 53 48 83 ec 08 40 f6 c7 07 0f 85 19 02 00 00 <48> 89 75 08 48 c7 45 00 00 00 00 00 9c 58 0f 1f 44 00 00 49 89
 RIP: __call_rcu+0x23/0x2b0 RSP: ffffbd2e403e7798
 CR2: 0000000000000008

Fix it in tcf_skbmod_cleanup(), ensuring that kfree_rcu(p, ...) is called
only when p is not NULL.

Fixes: 86da71b57383 ("net_sched: Introduce skbmod action")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
---
 net/sched/act_skbmod.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/sched/act_skbmod.c b/net/sched/act_skbmod.c
index fa975262dbac..d09565d6433e 100644
--- a/net/sched/act_skbmod.c
+++ b/net/sched/act_skbmod.c
@@ -190,7 +190,8 @@ static void tcf_skbmod_cleanup(struct tc_action *a)
 	struct tcf_skbmod_params  *p;
 
 	p = rcu_dereference_protected(d->skbmod_p, 1);
-	kfree_rcu(p, rcu);
+	if (p)
+		kfree_rcu(p, rcu);
 }
 
 static int tcf_skbmod_dump(struct sk_buff *skb, struct tc_action *a,
-- 
2.14.3

^ permalink raw reply related

* [PATCH net 4/5] net/sched: fix NULL dereference in the error path of tcf_sample_init()
From: Davide Caratti @ 2018-03-15 23:00 UTC (permalink / raw)
  To: Cong Wang, Jiri Pirko, David S. Miller; +Cc: Roman Mashak, Manish Kurup, netdev
In-Reply-To: <cover.1521154629.git.dcaratti@redhat.com>

when the following command

 # tc action add action sample rate 100 group 100 index 100

is run for the first time, and psample_group_get(100) fails to create a
new group, tcf_sample_cleanup() calls psample_group_put(NULL), thus
causing the following error:

 BUG: unable to handle kernel NULL pointer dereference at 000000000000001c
 IP: psample_group_put+0x15/0x71 [psample]
 PGD 8000000075775067 P4D 8000000075775067 PUD 7453c067 PMD 0
 Oops: 0002 [#1] SMP PTI
 Modules linked in: act_sample(E) psample ip6table_filter ip6_tables iptable_filter binfmt_misc ext4 snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core mbcache jbd2 crct10dif_pclmul snd_hwdep crc32_pclmul snd_seq ghash_clmulni_intel pcbc snd_seq_device snd_pcm aesni_intel crypto_simd snd_timer glue_helper snd cryptd joydev pcspkr i2c_piix4 soundcore virtio_balloon nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c ata_generic pata_acpi qxl drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm virtio_net ata_piix virtio_console virtio_blk libata serio_raw crc32c_intel virtio_pci i2c_core virtio_ring virtio floppy dm_mirror dm_region_hash dm_log dm_mod [last unloaded: act_tunnel_key]
 CPU: 2 PID: 5740 Comm: tc Tainted: G            E    4.16.0-rc4.act_vlan.orig+ #403
 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
 RIP: 0010:psample_group_put+0x15/0x71 [psample]
 RSP: 0018:ffffb8a80032f7d0 EFLAGS: 00010246
 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000024
 RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffffffc06d93c0
 RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000044
 R10: 00000000bd003000 R11: ffff979fba04aa59 R12: 0000000000000000
 R13: 0000000000000000 R14: 0000000000000000 R15: ffff979fbba3f22c
 FS:  00007f7638112740(0000) GS:ffff979fbfd00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 000000000000001c CR3: 00000000734ea001 CR4: 00000000001606e0
 Call Trace:
  __tcf_idr_release+0x79/0xf0
  tcf_sample_init+0x125/0x1d0 [act_sample]
  tcf_action_init_1+0x2cc/0x430
  tcf_action_init+0xd3/0x1b0
  tc_ctl_action+0x18b/0x240
  rtnetlink_rcv_msg+0x29c/0x310
  ? _cond_resched+0x15/0x30
  ? __kmalloc_node_track_caller+0x1b9/0x270
  ? rtnl_calcit.isra.28+0x100/0x100
  netlink_rcv_skb+0xd2/0x110
  netlink_unicast+0x17c/0x230
  netlink_sendmsg+0x2cd/0x3c0
  sock_sendmsg+0x30/0x40
  ___sys_sendmsg+0x27a/0x290
  ? filemap_map_pages+0x34a/0x3a0
  ? __handle_mm_fault+0xbfd/0xe20
  __sys_sendmsg+0x51/0x90
  do_syscall_64+0x6e/0x1a0
  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
 RIP: 0033:0x7f7637523ba0
 RSP: 002b:00007fff0473ef58 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
 RAX: ffffffffffffffda RBX: 00007fff0473f080 RCX: 00007f7637523ba0
 RDX: 0000000000000000 RSI: 00007fff0473efd0 RDI: 0000000000000003
 RBP: 000000005aaaac80 R08: 0000000000000002 R09: 0000000000000000
 R10: 00007fff0473e9e0 R11: 0000000000000246 R12: 0000000000000000
 R13: 00007fff0473f094 R14: 0000000000000001 R15: 0000000000669f60
 Code: be 02 00 00 00 48 89 df e8 a9 fe ff ff e9 7c ff ff ff 0f 1f 40 00 0f 1f 44 00 00 53 48 89 fb 48 c7 c7 c0 93 6d c0 e8 db 20 8c ef <83> 6b 1c 01 74 10 48 c7 c7 c0 93 6d c0 ff 14 25 e8 83 83 b0 5b
 RIP: psample_group_put+0x15/0x71 [psample] RSP: ffffb8a80032f7d0
 CR2: 000000000000001c

Fix it in tcf_sample_cleanup(), ensuring that calls to psample_group_put(p)
are done only when p is not NULL.

Fixes: cadb9c9fdbc6 ("net/sched: act_sample: Fix error path in init")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
---
 net/sched/act_sample.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/sched/act_sample.c b/net/sched/act_sample.c
index 1ba0df238756..74c5d7e6a0fa 100644
--- a/net/sched/act_sample.c
+++ b/net/sched/act_sample.c
@@ -103,7 +103,8 @@ static void tcf_sample_cleanup(struct tc_action *a)
 
 	psample_group = rtnl_dereference(s->psample_group);
 	RCU_INIT_POINTER(s->psample_group, NULL);
-	psample_group_put(psample_group);
+	if (psample_group)
+		psample_group_put(psample_group);
 }
 
 static bool tcf_sample_dev_ok_push(struct net_device *dev)
-- 
2.14.3

^ permalink raw reply related

* [PATCH net 3/5] net/sched: fix NULL dereference in the error path of tunnel_key_init()
From: Davide Caratti @ 2018-03-15 23:00 UTC (permalink / raw)
  To: Cong Wang, Jiri Pirko, David S. Miller; +Cc: Roman Mashak, Manish Kurup, netdev
In-Reply-To: <cover.1521154629.git.dcaratti@redhat.com>

when the following command

 # tc action add action tunnel_key unset index 100

is run for the first time, and tunnel_key_init() fails to allocate struct
tcf_tunnel_key_params, tunnel_key_release() dereferences NULL pointers.
This causes the following error:

 BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
 IP: tunnel_key_release+0xd/0x40 [act_tunnel_key]
 PGD 8000000033787067 P4D 8000000033787067 PUD 74646067 PMD 0
 Oops: 0000 [#1] SMP PTI
 Modules linked in: act_tunnel_key(E) act_csum ip6table_filter ip6_tables iptable_filter binfmt_misc ext4 mbcache jbd2 crct10dif_pclmul crc32_pclmul snd_hda_codec_generic ghash_clmulni_intel snd_hda_intel pcbc snd_hda_codec snd_hda_core snd_hwdep snd_seq aesni_intel snd_seq_device crypto_simd glue_helper snd_pcm cryptd joydev snd_timer pcspkr virtio_balloon snd i2c_piix4 soundcore nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c ata_generic pata_acpi qxl drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm virtio_net virtio_blk drm virtio_console crc32c_intel ata_piix serio_raw i2c_core virtio_pci libata virtio_ring virtio floppy dm_mirror dm_region_hash dm_log dm_mod
 CPU: 2 PID: 3101 Comm: tc Tainted: G            E    4.16.0-rc4.act_vlan.orig+ #403
 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
 RIP: 0010:tunnel_key_release+0xd/0x40 [act_tunnel_key]
 RSP: 0018:ffffba46803b7768 EFLAGS: 00010286
 RAX: ffffffffc09010a0 RBX: 0000000000000000 RCX: 0000000000000024
 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff99ee336d7480
 RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000044
 R10: 0000000000000220 R11: ffff99ee79d73131 R12: 0000000000000000
 R13: ffff99ee32d67610 R14: ffff99ee7671dc38 R15: 00000000fffffff4
 FS:  00007febcb2cd740(0000) GS:ffff99ee7fd00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000010 CR3: 000000007c8e4005 CR4: 00000000001606e0
 Call Trace:
  __tcf_idr_release+0x79/0xf0
  tunnel_key_init+0xd9/0x460 [act_tunnel_key]
  tcf_action_init_1+0x2cc/0x430
  tcf_action_init+0xd3/0x1b0
  tc_ctl_action+0x18b/0x240
  rtnetlink_rcv_msg+0x29c/0x310
  ? _cond_resched+0x15/0x30
  ? __kmalloc_node_track_caller+0x1b9/0x270
  ? rtnl_calcit.isra.28+0x100/0x100
  netlink_rcv_skb+0xd2/0x110
  netlink_unicast+0x17c/0x230
  netlink_sendmsg+0x2cd/0x3c0
  sock_sendmsg+0x30/0x40
  ___sys_sendmsg+0x27a/0x290
  __sys_sendmsg+0x51/0x90
  do_syscall_64+0x6e/0x1a0
  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
 RIP: 0033:0x7febca6deba0
 RSP: 002b:00007ffe7b0dd128 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
 RAX: ffffffffffffffda RBX: 00007ffe7b0dd250 RCX: 00007febca6deba0
 RDX: 0000000000000000 RSI: 00007ffe7b0dd1a0 RDI: 0000000000000003
 RBP: 000000005aaa90cb R08: 0000000000000002 R09: 0000000000000000
 R10: 00007ffe7b0dcba0 R11: 0000000000000246 R12: 0000000000000000
 R13: 00007ffe7b0dd264 R14: 0000000000000001 R15: 0000000000669f60
 Code: 44 00 00 8b 0d b5 23 00 00 48 8b 87 48 10 00 00 48 8b 3c c8 e9 a5 e5 d8 c3 0f 1f 44 00 00 0f 1f 44 00 00 53 48 8b 9f b0 00 00 00 <83> 7b 10 01 74 0b 48 89 df 31 f6 5b e9 f2 fa 7f c3 48 8b 7b 18
 RIP: tunnel_key_release+0xd/0x40 [act_tunnel_key] RSP: ffffba46803b7768
 CR2: 0000000000000010

Fix this in tunnel_key_release(), ensuring 'param' is not NULL before
dereferencing it.

Fixes: d0f6dd8a914f ("net/sched: Introduce act_tunnel_key")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
---
 net/sched/act_tunnel_key.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/net/sched/act_tunnel_key.c b/net/sched/act_tunnel_key.c
index 0e23aac09ad6..5dd819840feb 100644
--- a/net/sched/act_tunnel_key.c
+++ b/net/sched/act_tunnel_key.c
@@ -207,11 +207,12 @@ static void tunnel_key_release(struct tc_action *a)
 	struct tcf_tunnel_key_params *params;
 
 	params = rcu_dereference_protected(t->params, 1);
+	if (params) {
+		if (params->tcft_action == TCA_TUNNEL_KEY_ACT_SET)
+			dst_release(&params->tcft_enc_metadata->dst);
 
-	if (params->tcft_action == TCA_TUNNEL_KEY_ACT_SET)
-		dst_release(&params->tcft_enc_metadata->dst);
-
-	kfree_rcu(params, rcu);
+		kfree_rcu(params, rcu);
+	}
 }
 
 static int tunnel_key_dump_addresses(struct sk_buff *skb,
-- 
2.14.3

^ permalink raw reply related

* [PATCH net 2/5] net/sched: fix NULL dereference in the error path of tcf_csum_init()
From: Davide Caratti @ 2018-03-15 23:00 UTC (permalink / raw)
  To: Cong Wang, Jiri Pirko, David S. Miller; +Cc: Roman Mashak, Manish Kurup, netdev
In-Reply-To: <cover.1521154629.git.dcaratti@redhat.com>

when the following command

 # tc action add action csum udp continue index 100

is run for the first time, and tcf_csum_init() fails allocating struct
tcf_csum, tcf_csum_cleanup() calls kfree_rcu(NULL,...). This causes the
following error:

 BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
 IP: __call_rcu+0x23/0x2b0
 PGD 80000000740b4067 P4D 80000000740b4067 PUD 32e7f067 PMD 0
 Oops: 0002 [#1] SMP PTI
 Modules linked in: act_csum(E) act_vlan ip6table_filter ip6_tables iptable_filter binfmt_misc ext4 mbcache jbd2 crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_codec_generic pcbc snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer aesni_intel crypto_simd glue_helper cryptd snd joydev pcspkr virtio_balloon i2c_piix4 soundcore nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c ata_generic pata_acpi qxl drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm virtio_blk drm virtio_net virtio_console ata_piix crc32c_intel libata virtio_pci serio_raw i2c_core virtio_ring virtio floppy dm_mirror dm_region_hash dm_log dm_mod [last unloaded: act_vlan]
 CPU: 2 PID: 5763 Comm: tc Tainted: G            E    4.16.0-rc4.act_vlan.orig+ #403
 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
 RIP: 0010:__call_rcu+0x23/0x2b0
 RSP: 0018:ffffb275803e77c0 EFLAGS: 00010246
 RAX: ffffffffc057b080 RBX: ffff9674bc6f5240 RCX: 00000000ffffffff
 RDX: ffffffff928a5f00 RSI: 0000000000000008 RDI: 0000000000000008
 RBP: 0000000000000008 R08: 0000000000000001 R09: 0000000000000044
 R10: 0000000000000220 R11: ffff9674b9ab4821 R12: 0000000000000000
 R13: ffffffff928a5f00 R14: 0000000000000000 R15: 0000000000000001
 FS:  00007fa6368d8740(0000) GS:ffff9674bfd00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000010 CR3: 0000000073dec001 CR4: 00000000001606e0
 Call Trace:
  __tcf_idr_release+0x79/0xf0
  tcf_csum_init+0xfb/0x180 [act_csum]
  tcf_action_init_1+0x2cc/0x430
  tcf_action_init+0xd3/0x1b0
  tc_ctl_action+0x18b/0x240
  rtnetlink_rcv_msg+0x29c/0x310
  ? _cond_resched+0x15/0x30
  ? __kmalloc_node_track_caller+0x1b9/0x270
  ? rtnl_calcit.isra.28+0x100/0x100
  netlink_rcv_skb+0xd2/0x110
  netlink_unicast+0x17c/0x230
  netlink_sendmsg+0x2cd/0x3c0
  sock_sendmsg+0x30/0x40
  ___sys_sendmsg+0x27a/0x290
  ? filemap_map_pages+0x34a/0x3a0
  ? __handle_mm_fault+0xbfd/0xe20
  __sys_sendmsg+0x51/0x90
  do_syscall_64+0x6e/0x1a0
  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
 RIP: 0033:0x7fa635ce9ba0
 RSP: 002b:00007ffc185b0fc8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
 RAX: ffffffffffffffda RBX: 00007ffc185b10f0 RCX: 00007fa635ce9ba0
 RDX: 0000000000000000 RSI: 00007ffc185b1040 RDI: 0000000000000003
 RBP: 000000005aaa85e0 R08: 0000000000000002 R09: 0000000000000000
 R10: 00007ffc185b0a20 R11: 0000000000000246 R12: 0000000000000000
 R13: 00007ffc185b1104 R14: 0000000000000001 R15: 0000000000669f60
 Code: 5d e9 42 da ff ff 66 90 0f 1f 44 00 00 41 57 41 56 41 55 49 89 d5 41 54 55 48 89 fd 53 48 83 ec 08 40 f6 c7 07 0f 85 19 02 00 00 <48> 89 75 08 48 c7 45 00 00 00 00 00 9c 58 0f 1f 44 00 00 49 89
 RIP: __call_rcu+0x23/0x2b0 RSP: ffffb275803e77c0
 CR2: 0000000000000010

fix this in tcf_csum_cleanup(), ensuring that kfree_rcu(param, ...) is
called only when param is not NULL.

Fixes: 9c5f69bbd75a ("net/sched: act_csum: don't use spinlock in the fast path")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
---
 net/sched/act_csum.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/sched/act_csum.c b/net/sched/act_csum.c
index 24b2e8e681cf..2a5c8fd860cf 100644
--- a/net/sched/act_csum.c
+++ b/net/sched/act_csum.c
@@ -626,7 +626,8 @@ static void tcf_csum_cleanup(struct tc_action *a)
 	struct tcf_csum_params *params;
 
 	params = rcu_dereference_protected(p->params, 1);
-	kfree_rcu(params, rcu);
+	if (params)
+		kfree_rcu(params, rcu);
 }
 
 static int tcf_csum_walker(struct net *net, struct sk_buff *skb,
-- 
2.14.3

^ permalink raw reply related

* [PATCH net 1/5] net/sched: fix NULL dereference in the error path of tcf_vlan_init()
From: Davide Caratti @ 2018-03-15 23:00 UTC (permalink / raw)
  To: Cong Wang, Jiri Pirko, David S. Miller; +Cc: Roman Mashak, Manish Kurup, netdev
In-Reply-To: <cover.1521154629.git.dcaratti@redhat.com>

when the following command

 # tc actions replace action vlan pop index 100

is run for the first time, and tcf_vlan_init() fails allocating struct
tcf_vlan_params, tcf_vlan_cleanup() calls kfree_rcu(NULL, ...). This causes
the following error:

 BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
 IP: __call_rcu+0x23/0x2b0
 PGD 80000000760a2067 P4D 80000000760a2067 PUD 742c1067 PMD 0
 Oops: 0002 [#1] SMP PTI
 Modules linked in: act_vlan(E) ip6table_filter ip6_tables iptable_filter binfmt_misc ext4 snd_hda_codec_generic snd_hda_intel mbcache snd_hda_codec jbd2 snd_hda_core crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc snd_hwdep snd_seq snd_seq_device snd_pcm aesni_intel crypto_simd snd_timer glue_helper snd cryptd joydev soundcore virtio_balloon pcspkr i2c_piix4 nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c ata_generic pata_acpi qxl drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm virtio_console virtio_blk virtio_net ata_piix crc32c_intel libata virtio_pci i2c_core virtio_ring serio_raw virtio floppy dm_mirror dm_region_hash dm_log dm_mod [last unloaded: act_vlan]
 CPU: 3 PID: 3119 Comm: tc Tainted: G            E    4.16.0-rc4.act_vlan.orig+ #403
 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
 RIP: 0010:__call_rcu+0x23/0x2b0
 RSP: 0018:ffffaac3005fb798 EFLAGS: 00010246
 RAX: ffffffffc0704080 RBX: ffff97f2b4bbe900 RCX: 00000000ffffffff
 RDX: ffffffffabca5f00 RSI: 0000000000000010 RDI: 0000000000000010
 RBP: 0000000000000010 R08: 0000000000000001 R09: 0000000000000044
 R10: 00000000fd003000 R11: ffff97f2faab5b91 R12: 0000000000000000
 R13: ffffffffabca5f00 R14: ffff97f2fb80202c R15: 00000000fffffff4
 FS:  00007f68f75b4740(0000) GS:ffff97f2ffd80000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000018 CR3: 0000000072b52001 CR4: 00000000001606e0
 Call Trace:
  __tcf_idr_release+0x79/0xf0
  tcf_vlan_init+0x168/0x270 [act_vlan]
  tcf_action_init_1+0x2cc/0x430
  tcf_action_init+0xd3/0x1b0
  tc_ctl_action+0x18b/0x240
  rtnetlink_rcv_msg+0x29c/0x310
  ? _cond_resched+0x15/0x30
  ? __kmalloc_node_track_caller+0x1b9/0x270
  ? rtnl_calcit.isra.28+0x100/0x100
  netlink_rcv_skb+0xd2/0x110
  netlink_unicast+0x17c/0x230
  netlink_sendmsg+0x2cd/0x3c0
  sock_sendmsg+0x30/0x40
  ___sys_sendmsg+0x27a/0x290
  ? filemap_map_pages+0x34a/0x3a0
  ? __handle_mm_fault+0xbfd/0xe20
  __sys_sendmsg+0x51/0x90
  do_syscall_64+0x6e/0x1a0
  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
 RIP: 0033:0x7f68f69c5ba0
 RSP: 002b:00007fffd79c1118 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
 RAX: ffffffffffffffda RBX: 00007fffd79c1240 RCX: 00007f68f69c5ba0
 RDX: 0000000000000000 RSI: 00007fffd79c1190 RDI: 0000000000000003
 RBP: 000000005aaa708e R08: 0000000000000002 R09: 0000000000000000
 R10: 00007fffd79c0ba0 R11: 0000000000000246 R12: 0000000000000000
 R13: 00007fffd79c1254 R14: 0000000000000001 R15: 0000000000669f60
 Code: 5d e9 42 da ff ff 66 90 0f 1f 44 00 00 41 57 41 56 41 55 49 89 d5 41 54 55 48 89 fd 53 48 83 ec 08 40 f6 c7 07 0f 85 19 02 00 00 <48> 89 75 08 48 c7 45 00 00 00 00 00 9c 58 0f 1f 44 00 00 49 89
 RIP: __call_rcu+0x23/0x2b0 RSP: ffffaac3005fb798
 CR2: 0000000000000018

fix this in tcf_vlan_cleanup(), ensuring that kfree_rcu(p, ...) is called
only when p is not NULL.

Fixes: 4c5b9d9642c8 ("act_vlan: VLAN action rewrite to use RCU lock/unlock and update")
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Manish Kurup <manish.kurup@verizon.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
---
 net/sched/act_vlan.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/sched/act_vlan.c b/net/sched/act_vlan.c
index e1a1b3f3983a..c2914e9a4a6f 100644
--- a/net/sched/act_vlan.c
+++ b/net/sched/act_vlan.c
@@ -225,7 +225,8 @@ static void tcf_vlan_cleanup(struct tc_action *a)
 	struct tcf_vlan_params *p;
 
 	p = rcu_dereference_protected(v->vlan_p, 1);
-	kfree_rcu(p, rcu);
+	if (p)
+		kfree_rcu(p, rcu);
 }
 
 static int tcf_vlan_dump(struct sk_buff *skb, struct tc_action *a,
-- 
2.14.3

^ permalink raw reply related

* [PATCH net 0/5] net/sched: fix NULL dereference in the error path of .init()
From: Davide Caratti @ 2018-03-15 23:00 UTC (permalink / raw)
  To: Cong Wang, Jiri Pirko, David S. Miller; +Cc: Roman Mashak, Manish Kurup, netdev

with several TC actions it's possible to see NULL pointer dereference,
when the .init() function calls tcf_idr_alloc(), fails at some point and
then calls tcf_idr_release(): this series fixes all them introducing
non-NULL tests in the .cleanup() function.

Davide Caratti (5):
  net/sched: fix NULL dereference in the error path of tcf_vlan_init()
  net/sched: fix NULL dereference in the error path of tcf_csum_init()
  net/sched: fix NULL dereference in the error path of tunnel_key_init()
  net/sched: fix NULL dereference in the error path of tcf_sample_init()
  net/sched: fix NULL dereference on the error path of tcf_skbmod_init()

 net/sched/act_csum.c       | 3 ++-
 net/sched/act_sample.c     | 3 ++-
 net/sched/act_skbmod.c     | 3 ++-
 net/sched/act_tunnel_key.c | 9 +++++----
 net/sched/act_vlan.c       | 3 ++-
 5 files changed, 13 insertions(+), 8 deletions(-)

-- 
2.14.3

^ permalink raw reply

* Re: [PATCH v4 1/2] kernel.h: Introduce const_max() for VLA removal
From: Miguel Ojeda @ 2018-03-15 22:58 UTC (permalink / raw)
  To: Kees Cook
  Cc: Linus Torvalds, Andrew Morton, Josh Poimboeuf, Rasmus Villemoes,
	Randy Dunlap, Ingo Molnar, David Laight, Ian Abbott, linux-input,
	linux-btrfs, Network Development, Linux Kernel Mailing List,
	Kernel Hardening
In-Reply-To: <CAGXu5jLHam5kz18k9uvhsz2WYodkF2v1tsEOV4Sx0O7jir4B3A@mail.gmail.com>

On Thu, Mar 15, 2018 at 11:46 PM, Kees Cook <keescook@chromium.org> wrote:
> On Thu, Mar 15, 2018 at 3:23 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> On Thu, Mar 15, 2018 at 3:16 PM, Kees Cook <keescook@chromium.org> wrote:
>>>
>>> size_t __error_not_const_arg(void) \
>>> __compiletime_error("const_max() used with non-compile-time constant arg");
>>> #define const_max(x, y)                                         \
>>>         __builtin_choose_expr(__builtin_constant_p(x) &&        \
>>>                               __builtin_constant_p(y),          \
>>>                               (typeof(x))(x) > (typeof(y))(y) ? \
>>>                                         (x) : (y),              \
>>>                               __error_not_const_arg())
>>>
>>> Is typeof() forcing enums to int? Regardless, I'll put this through
>>> larger testing. How does that look?
>>
>> Ok, that alleviates my worry about one class of insane behavior, but
>> it does raise a few other questions:
>>
>>  - what drugs is gcc on where (typeof(x)(x)) makes a difference? Funky.
>
> Yeah, that's why I didn't even try that originally. But in looking
> back at max() again, it seemed to be the only thing missing that would
> handle the enum evaluation, which turned out to be true.
>
>>  - this does have the usual "what happen if you do
>>
>>      const_max(-1,sizeof(x))
>>
>> where the comparison will now be done in 'size_t', and -1 ends up
>> being a very very big unsigned integer.
>>
>> Is there no way to get that type checking inserted? Maybe now is a
>> good point for that __builtin_types_compatible(), and add it to the
>> constness checking (and change the name of that error case function)?
>
> So, AIUI, I can either get strict type checking, in which case, this
> is rejected (which I assume there is still a desire to have):
>
> int foo[const_max(6, sizeof(whatever))];

Is it that bad to just call it with (size_t)6?

>
> due to __builtin_types_compatible_p() rejecting it, or I can construct
> a "positive arguments only" test, in which the above is accepted, but
> this is rejected:
>
> int foo[const_max(-1, sizeof(whatever))];

Do we need this case?

>
> By using this eye-bleed:
>
> size_t __error_not_const_arg(void) \
> __compiletime_error("const_max() used with non-compile-time constant arg");
> size_t __error_not_positive_arg(void) \
> __compiletime_error("const_max() used with negative arg");
> #define const_max(x, y)                                                 \
>         __builtin_choose_expr(__builtin_constant_p(x) &&                \
>                               __builtin_constant_p(y),                  \
>                 __builtin_choose_expr((x) >= 0 && (y) >= 0,             \
>                                       (typeof(x))(x) > (typeof(y))(y) ? \
>                                         (x) : (y),                      \
>                                       __error_not_positive_arg()),      \
>                 __error_not_const_arg())
>

I was writing it like this:

#define const_max(a, b) \
    ({ \
        if ((a) < 0) \
            __const_max_called_with_negative_value(); \
        if ((b) < 0) \
            __const_max_called_with_negative_value(); \
        if (!__builtin_types_compatible_p(typeof(a), typeof(b))) \
            __const_max_called_with_incompatible_types(); \
        __builtin_choose_expr((a) > (b), (a), (b)); \
})

Cheers,
Miguel


> -Kees
>
> --
> Kees Cook
> Pixel Security

^ permalink raw reply

* Re: [bpf-next PATCH v2 05/18] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data
From: Daniel Borkmann @ 2018-03-15 22:55 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: John Fastabend, davem, ast, davejwatson, netdev
In-Reply-To: <20180315222010.kq5hcjsl33d3smho@ast-mbp>

On 03/15/2018 11:20 PM, Alexei Starovoitov wrote:
> On Thu, Mar 15, 2018 at 11:17:12PM +0100, Daniel Borkmann wrote:
>> On 03/15/2018 10:59 PM, Alexei Starovoitov wrote:
>>> On Mon, Mar 12, 2018 at 12:23:29PM -0700, John Fastabend wrote:
>>>>  
>>>> +/* User return codes for SK_MSG prog type. */
>>>> +enum sk_msg_action {
>>>> +	SK_MSG_DROP = 0,
>>>> +	SK_MSG_PASS,
>>>> +};
>>>
>>> do we really need new enum here?
>>> It's the same as 'enum sk_action' and SK_DROP == SK_MSG_DROP
>>> and there will be only drop/pass in both enums.
>>> Also I don't see where these two new SK_MSG_* are used...
>>>
>>>> +
>>>> +/* user accessible metadata for SK_MSG packet hook, new fields must
>>>> + * be added to the end of this structure
>>>> + */
>>>> +struct sk_msg_md {
>>>> +	__u32 data;
>>>> +	__u32 data_end;
>>>> +};
>>>
>>> I think it's time for me to ask for forgiveness :)
>>
>> :-)
>>
>>> I used __u32 for data and data_end only because all other fields
>>> in __sk_buff were __u32 at the time and I couldn't easily figure out
>>> how to teach verifier to recognize 8-byte rewrites.
>>> Unfortunately my mistake stuck and was copied over into xdp.
>>> Since this is new struct let's do it right and add
>>> 'void *data, *data_end' here,
>>> since bpf prog will use them as 'void *' pointers.
>>> There are no compat issues here, since bpf is always 64-bit.
>>
>> But at least offset-wise when you do the ctx rewrite this would then
>> be a bit more tricky when you have 64 bit kernel with 32 bit user
>> space since void * members are in each cases at different offset. So
>> unless I'm missing something, this still should either be __u32 or
>> __u64 instead of void *, no?
> 
> there is no 32-bit user space. these structs are seen by bpf progs only
> and bpf is 64-bit only too.
> unless I'm missing your point.

Ok, so lets say you have 32 bit LLVM binary and compile the prog where
you access md->data_end. Given the void * in the struct will that access
end up being BPF_W at ctx offset 4 or BPF_DW at ctx offset 8 from clang
perspective (iow, is the back end treating this special and always use
fixed BPF_DW in such case)? If not and it would be the first case with
offset 4, then we could have the case that underlying 64 bit kernel is
expecting ctx offset 8 for doing the md ctx conversion.

^ permalink raw reply

* Re: [PATCH v4 1/2] kernel.h: Introduce const_max() for VLA removal
From: Kees Cook @ 2018-03-15 22:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Josh Poimboeuf, Rasmus Villemoes, Randy Dunlap,
	Miguel Ojeda, Ingo Molnar, David Laight, Ian Abbott, linux-input,
	linux-btrfs, Network Development, Linux Kernel Mailing List,
	Kernel Hardening
In-Reply-To: <CA+55aFwDJ906oQ-98L2DNrjfKtb6cd5ykwMxpG942qxCFmAoEQ@mail.gmail.com>

On Thu, Mar 15, 2018 at 3:23 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Mar 15, 2018 at 3:16 PM, Kees Cook <keescook@chromium.org> wrote:
>>
>> size_t __error_not_const_arg(void) \
>> __compiletime_error("const_max() used with non-compile-time constant arg");
>> #define const_max(x, y)                                         \
>>         __builtin_choose_expr(__builtin_constant_p(x) &&        \
>>                               __builtin_constant_p(y),          \
>>                               (typeof(x))(x) > (typeof(y))(y) ? \
>>                                         (x) : (y),              \
>>                               __error_not_const_arg())
>>
>> Is typeof() forcing enums to int? Regardless, I'll put this through
>> larger testing. How does that look?
>
> Ok, that alleviates my worry about one class of insane behavior, but
> it does raise a few other questions:
>
>  - what drugs is gcc on where (typeof(x)(x)) makes a difference? Funky.

Yeah, that's why I didn't even try that originally. But in looking
back at max() again, it seemed to be the only thing missing that would
handle the enum evaluation, which turned out to be true.

>  - this does have the usual "what happen if you do
>
>      const_max(-1,sizeof(x))
>
> where the comparison will now be done in 'size_t', and -1 ends up
> being a very very big unsigned integer.
>
> Is there no way to get that type checking inserted? Maybe now is a
> good point for that __builtin_types_compatible(), and add it to the
> constness checking (and change the name of that error case function)?

So, AIUI, I can either get strict type checking, in which case, this
is rejected (which I assume there is still a desire to have):

int foo[const_max(6, sizeof(whatever))];

due to __builtin_types_compatible_p() rejecting it, or I can construct
a "positive arguments only" test, in which the above is accepted, but
this is rejected:

int foo[const_max(-1, sizeof(whatever))];

By using this eye-bleed:

size_t __error_not_const_arg(void) \
__compiletime_error("const_max() used with non-compile-time constant arg");
size_t __error_not_positive_arg(void) \
__compiletime_error("const_max() used with negative arg");
#define const_max(x, y)                                                 \
        __builtin_choose_expr(__builtin_constant_p(x) &&                \
                              __builtin_constant_p(y),                  \
                __builtin_choose_expr((x) >= 0 && (y) >= 0,             \
                                      (typeof(x))(x) > (typeof(y))(y) ? \
                                        (x) : (y),                      \
                                      __error_not_positive_arg()),      \
                __error_not_const_arg())

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply

* Re: [bug, bisected] pfifo_fast causes packet reordering
From: John Fastabend @ 2018-03-15 22:30 UTC (permalink / raw)
  To: Jakob Unterwurzacher, Dave Taht
  Cc: netdev, linux-kernel, David S. Miller, linux-can@vger.kernel.org,
	Martin Elshuber
In-Reply-To: <3a959e50-8656-5d9c-97b9-227d733948f8@theobroma-systems.com>

On 03/15/2018 11:08 AM, Jakob Unterwurzacher wrote:
> On 14.03.18 05:03, John Fastabend wrote:
>> On 03/13/2018 11:35 AM, Dave Taht wrote:
>>> On Tue, Mar 13, 2018 at 11:24 AM, Jakob Unterwurzacher
>>> <jakob.unterwurzacher@theobroma-systems.com> wrote:
>>>> During stress-testing our "ucan" USB/CAN adapter SocketCAN driver on Linux
>>>> v4.16-rc4-383-ged58d66f60b3 we observed that a small fraction of packets are
>>>> delivered out-of-order.
>>>>
>>
>> Is the stress-testing tool available somewhere? What type of packets
>> are being sent?
> 
> 
> I have reproduced it using two USB network cards connected to each other. The test tool sends UDP packets containing a counter and listens on the other interface, it is available at
> https://github.com/jakob-tsd/pfifo_stress/blob/master/pfifo_stress.py
> 

Great thanks, can you also run this with taskset to bind to
a single CPU,

 # taskset 0x1 ./pifof_stress.py

And let me know if you still see the OOO.

> Here is what I get:
> 
> root@rk3399-q7:~# ./pfifo_stress.py
> [...]
> expected ctr 0xcdc, received 0xcdd
> expected ctr 0xcde, received 0xcdc
> expected ctr 0xcdd, received 0xcde
> expected ctr 0xe3c, received 0xe3d
> expected ctr 0xe3e, received 0xe3c
> expected ctr 0xe3d, received 0xe3e
> expected ctr 0x1097, received 0x1098
> expected ctr 0x1099, received 0x1097
> expected ctr 0x1098, received 0x1099
> expected ctr 0x17c0, received 0x17c1
> expected ctr 0x17c2, received 0x17c0
> [...]
> 
> Best regards,
> Jakob

^ permalink raw reply

* Re: [PATCH 3/7] RDMA/qedr: eliminate duplicate barriers on weakly-ordered archs
From: Jason Gunthorpe @ 2018-03-15 22:23 UTC (permalink / raw)
  To: Sinan Kaya
  Cc: netdev, timur, sulrich, linux-arm-msm, linux-arm-kernel,
	Michal Kalderon, Ariel Elior, Doug Ledford, linux-rdma,
	linux-kernel
In-Reply-To: <1520997629-17361-3-git-send-email-okaya@codeaurora.org>

On Tue, Mar 13, 2018 at 11:20:24PM -0400, Sinan Kaya wrote:
> Code includes wmb() followed by writel() in multiple places. writel()
> already has a barrier on some architectures like arm64.
> 
> This ends up CPU observing two barriers back to back before executing the
> register write.
> 
> Since code already has an explicit barrier call, changing writel() to
> writel_relaxed().
> 
> Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
> Acked-by: Jason Gunthorpe <jgg@mellanox.com>
>  drivers/infiniband/hw/qedr/verbs.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

Applied to RDMA for-next

Thanks,
Jason

^ permalink raw reply

* Re: [PATCH v4 1/2] kernel.h: Introduce const_max() for VLA removal
From: Linus Torvalds @ 2018-03-15 22:23 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andrew Morton, Josh Poimboeuf, Rasmus Villemoes, Randy Dunlap,
	Miguel Ojeda, Ingo Molnar, David Laight, Ian Abbott, linux-input,
	linux-btrfs, Network Development, Linux Kernel Mailing List,
	Kernel Hardening
In-Reply-To: <CAGXu5jKtv=a8qTy1-AfbzNRB=Azb8V7Pt1M4QMVYNKg6+Ci7=Q@mail.gmail.com>

On Thu, Mar 15, 2018 at 3:16 PM, Kees Cook <keescook@chromium.org> wrote:
>
> size_t __error_not_const_arg(void) \
> __compiletime_error("const_max() used with non-compile-time constant arg");
> #define const_max(x, y)                                         \
>         __builtin_choose_expr(__builtin_constant_p(x) &&        \
>                               __builtin_constant_p(y),          \
>                               (typeof(x))(x) > (typeof(y))(y) ? \
>                                         (x) : (y),              \
>                               __error_not_const_arg())
>
> Is typeof() forcing enums to int? Regardless, I'll put this through
> larger testing. How does that look?

Ok, that alleviates my worry about one class of insane behavior, but
it does raise a few other questions:

 - what drugs is gcc on where (typeof(x)(x)) makes a difference? Funky.

 - this does have the usual "what happen if you do

     const_max(-1,sizeof(x))

where the comparison will now be done in 'size_t', and -1 ends up
being a very very big unsigned integer.

Is there no way to get that type checking inserted? Maybe now is a
good point for that __builtin_types_compatible(), and add it to the
constness checking (and change the name of that error case function)?

          Linus

^ permalink raw reply

* Re: [PATCH net] net/sched: fix NULL dereference in the error path of tcf_vlan_init()
From: Davide Caratti @ 2018-03-15 22:21 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: Cong Wang, David S. Miller, netdev, Manish Kurup, Roman Mashak
In-Reply-To: <1521124149.2797.11.camel@redhat.com>

On Thu, 2018-03-15 at 15:29 +0100, Davide Caratti wrote:
> On Thu, 2018-03-15 at 15:21 +0100, Jiri Pirko wrote:
> ...
> 
> > Acked-by: Jiri Pirko <jiri@mellanox.com>
> 
> thank you for reviewing!
> 
> apparently, also act_tunnel_key seem and act_csum have a similar problem.
> I will check and eventually do a followup series this afternoon.
> 
> thank you,
> regards

hello David,

please drop this patch: after some tests, the following TC actions are
affected by the same problem:

act_vlan
act_csum
act_tunnel_key
act_skbmod
act_sample

so, I'm posting right now a series that fixes all of them.

In act_ife and act_bpf, the problem is potentially there, but we don't see
it crashing yet because we don't call tcf_idr_release() on the error
path. 
This is causing the leak of 'index', and will be fixed in another series
tomorrow.

thank you in advance,
regards
-- 
davide

^ permalink raw reply

* Re: [bpf-next PATCH v2 05/18] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data
From: Alexei Starovoitov @ 2018-03-15 22:20 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: John Fastabend, davem, ast, davejwatson, netdev
In-Reply-To: <31bf7017-7456-1637-3a99-9630fc4d7009@iogearbox.net>

On Thu, Mar 15, 2018 at 11:17:12PM +0100, Daniel Borkmann wrote:
> On 03/15/2018 10:59 PM, Alexei Starovoitov wrote:
> > On Mon, Mar 12, 2018 at 12:23:29PM -0700, John Fastabend wrote:
> >>  
> >> +/* User return codes for SK_MSG prog type. */
> >> +enum sk_msg_action {
> >> +	SK_MSG_DROP = 0,
> >> +	SK_MSG_PASS,
> >> +};
> > 
> > do we really need new enum here?
> > It's the same as 'enum sk_action' and SK_DROP == SK_MSG_DROP
> > and there will be only drop/pass in both enums.
> > Also I don't see where these two new SK_MSG_* are used...
> > 
> >> +
> >> +/* user accessible metadata for SK_MSG packet hook, new fields must
> >> + * be added to the end of this structure
> >> + */
> >> +struct sk_msg_md {
> >> +	__u32 data;
> >> +	__u32 data_end;
> >> +};
> > 
> > I think it's time for me to ask for forgiveness :)
> 
> :-)
> 
> > I used __u32 for data and data_end only because all other fields
> > in __sk_buff were __u32 at the time and I couldn't easily figure out
> > how to teach verifier to recognize 8-byte rewrites.
> > Unfortunately my mistake stuck and was copied over into xdp.
> > Since this is new struct let's do it right and add
> > 'void *data, *data_end' here,
> > since bpf prog will use them as 'void *' pointers.
> > There are no compat issues here, since bpf is always 64-bit.
> 
> But at least offset-wise when you do the ctx rewrite this would then
> be a bit more tricky when you have 64 bit kernel with 32 bit user
> space since void * members are in each cases at different offset. So
> unless I'm missing something, this still should either be __u32 or
> __u64 instead of void *, no?

there is no 32-bit user space. these structs are seen by bpf progs only
and bpf is 64-bit only too.
unless I'm missing your point.

^ permalink raw reply

* Re: [bpf-next PATCH v2 05/18] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data
From: Daniel Borkmann @ 2018-03-15 22:17 UTC (permalink / raw)
  To: Alexei Starovoitov, John Fastabend; +Cc: davem, ast, davejwatson, netdev
In-Reply-To: <20180315215954.wufvwdhcjpntdxbb@ast-mbp>

On 03/15/2018 10:59 PM, Alexei Starovoitov wrote:
> On Mon, Mar 12, 2018 at 12:23:29PM -0700, John Fastabend wrote:
>>  
>> +/* User return codes for SK_MSG prog type. */
>> +enum sk_msg_action {
>> +	SK_MSG_DROP = 0,
>> +	SK_MSG_PASS,
>> +};
> 
> do we really need new enum here?
> It's the same as 'enum sk_action' and SK_DROP == SK_MSG_DROP
> and there will be only drop/pass in both enums.
> Also I don't see where these two new SK_MSG_* are used...
> 
>> +
>> +/* user accessible metadata for SK_MSG packet hook, new fields must
>> + * be added to the end of this structure
>> + */
>> +struct sk_msg_md {
>> +	__u32 data;
>> +	__u32 data_end;
>> +};
> 
> I think it's time for me to ask for forgiveness :)

:-)

> I used __u32 for data and data_end only because all other fields
> in __sk_buff were __u32 at the time and I couldn't easily figure out
> how to teach verifier to recognize 8-byte rewrites.
> Unfortunately my mistake stuck and was copied over into xdp.
> Since this is new struct let's do it right and add
> 'void *data, *data_end' here,
> since bpf prog will use them as 'void *' pointers.
> There are no compat issues here, since bpf is always 64-bit.

But at least offset-wise when you do the ctx rewrite this would then
be a bit more tricky when you have 64 bit kernel with 32 bit user
space since void * members are in each cases at different offset. So
unless I'm missing something, this still should either be __u32 or
__u64 instead of void *, no?

>> +static int bpf_map_msg_verdict(int _rc, struct sk_msg_buff *md)
>> +{
>> +	return ((_rc == SK_PASS) ?
>> +	       (md->map ? __SK_REDIRECT : __SK_PASS) :
>> +	       __SK_DROP);
> 
> you're using old SK_PASS here too ;)
> that's to my point of not adding SK_MSG_PASS...
> 
> Overall the patch set looks absolutely great.
> Thank you for working on it.

+1

^ permalink raw reply

* Re: [PATCH v4 1/2] kernel.h: Introduce const_max() for VLA removal
From: Kees Cook @ 2018-03-15 22:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Josh Poimboeuf, Rasmus Villemoes, Randy Dunlap,
	Miguel Ojeda, Ingo Molnar, David Laight, Ian Abbott, linux-input,
	linux-btrfs, Network Development, Linux Kernel Mailing List,
	Kernel Hardening
In-Reply-To: <CA+55aFzu+VU_FVZdbLCZttUspZdJCvBhkuo4V69H=XnqmLrpLA@mail.gmail.com>

On Thu, Mar 15, 2018 at 2:42 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Mar 15, 2018 at 12:47 PM, Kees Cook <keescook@chromium.org> wrote:
>>
>> To gain the ability to compare differing types, the arguments are
>> explicitly cast to size_t.
>
> Ugh, I really hate this.
>
> It silently does insane things if you do
>
>    const_max(-1,6)
>
> and there is nothing in the name that implies that you can't use
> negative constants.

Yeah, I didn't like that effect either. I was seeing this:

./include/linux/kernel.h:836:14: warning: comparison between ‘enum
<anonymous>’ and ‘enum <anonymous>’ [-Wenum-compare]
          (x) > (y) ? \
              ^
./include/linux/kernel.h:838:7: note: in definition of macro ‘const_max’
      (y),  \
       ^
net/ipv6/proc.c:34:11: note: in expansion of macro ‘const_max’
           const_max(IPSTATS_MIB_MAX, ICMP_MIB_MAX))
           ^~~~~~~~~

But it turns out that just doing a typeof() fixes this, and there's no
need for the hard cast to size_t:

size_t __error_not_const_arg(void) \
__compiletime_error("const_max() used with non-compile-time constant arg");
#define const_max(x, y)                                         \
        __builtin_choose_expr(__builtin_constant_p(x) &&        \
                              __builtin_constant_p(y),          \
                              (typeof(x))(x) > (typeof(y))(y) ? \
                                        (x) : (y),              \
                              __error_not_const_arg())

Is typeof() forcing enums to int? Regardless, I'll put this through
larger testing. How does that look?

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply

* Re: [bpf-next PATCH v2 05/18] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data
From: John Fastabend @ 2018-03-15 22:08 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: davem, ast, daniel, davejwatson, netdev
In-Reply-To: <20180315215954.wufvwdhcjpntdxbb@ast-mbp>

On 03/15/2018 02:59 PM, Alexei Starovoitov wrote:
> On Mon, Mar 12, 2018 at 12:23:29PM -0700, John Fastabend wrote:
>>  
>> +/* User return codes for SK_MSG prog type. */
>> +enum sk_msg_action {
>> +	SK_MSG_DROP = 0,
>> +	SK_MSG_PASS,
>> +};
> 
> do we really need new enum here?

Nope and as you noticed the actual code uses the
SK_{DROP|PASS} enum. Will remove this.

> It's the same as 'enum sk_action' and SK_DROP == SK_MSG_DROP
> and there will be only drop/pass in both enums.
> Also I don't see where these two new SK_MSG_* are used...
> 
>> +
>> +/* user accessible metadata for SK_MSG packet hook, new fields must
>> + * be added to the end of this structure
>> + */
>> +struct sk_msg_md {
>> +	__u32 data;
>> +	__u32 data_end;
>> +};
> 
> I think it's time for me to ask for forgiveness :)
> I used __u32 for data and data_end only because all other fields
> in __sk_buff were __u32 at the time and I couldn't easily figure out
> how to teach verifier to recognize 8-byte rewrites.
> Unfortunately my mistake stuck and was copied over into xdp.
> Since this is new struct let's do it right and add
> 'void *data, *data_end' here,
> since bpf prog will use them as 'void *' pointers.
> There are no compat issues here, since bpf is always 64-bit.
> 

aha nice catch. Yep lets use 'void*' here. I had forgot about
that discussion and copied them here as well.

>> +static int bpf_map_msg_verdict(int _rc, struct sk_msg_buff *md)
>> +{
>> +	return ((_rc == SK_PASS) ?
>> +	       (md->map ? __SK_REDIRECT : __SK_PASS) :
>> +	       __SK_DROP);
> 
> you're using old SK_PASS here too ;)
> that's to my point of not adding SK_MSG_PASS...
> 

+1

> Overall the patch set looks absolutely great.
> Thank you for working on it.
> 

I'll fixup a few of these small things now and should have
a v3 shortly.

^ permalink raw reply

* Re: [bpf-next PATCH v2 15/18] bpf: sockmap sample support for bpf_msg_cork_bytes()
From: John Fastabend @ 2018-03-15 22:04 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: davem, ast, daniel, davejwatson, netdev
In-Reply-To: <20180315201555.4osbqupt62fnvkvq@ast-mbp.dhcp.thefacebook.com>

On 03/15/2018 01:15 PM, Alexei Starovoitov wrote:
> On Mon, Mar 12, 2018 at 12:24:21PM -0700, John Fastabend wrote:
>> Add sample application support for the bpf_msg_cork_bytes helper. This
>> lets the user specify how many bytes each verdict should apply to.
>>
>> Similar to apply_bytes() tests these can be run as a stand-alone test
>> when used without other options or inline with other tests by using
>> the txmsg_cork option along with any of the basic tests txmsg,
>> txmsg_redir, txmsg_drop.
>>
>> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
>> ---
>>  include/uapi/linux/bpf_common.h           |    7 ++--
>>  samples/sockmap/sockmap_kern.c            |   53 +++++++++++++++++++++++++----
>>  samples/sockmap/sockmap_user.c            |   19 ++++++++++
>>  tools/include/uapi/linux/bpf.h            |    3 +-
>>  tools/testing/selftests/bpf/bpf_helpers.h |    2 +
>>  5 files changed, 71 insertions(+), 13 deletions(-)
>>
>> diff --git a/include/uapi/linux/bpf_common.h b/include/uapi/linux/bpf_common.h
>> index ee97668..18be907 100644
>> --- a/include/uapi/linux/bpf_common.h
>> +++ b/include/uapi/linux/bpf_common.h
>> @@ -15,10 +15,9 @@
>>  
>>  /* ld/ldx fields */
>>  #define BPF_SIZE(code)  ((code) & 0x18)
>> -#define		BPF_W		0x00 /* 32-bit */
>> -#define		BPF_H		0x08 /* 16-bit */
>> -#define		BPF_B		0x10 /*  8-bit */
>> -/* eBPF		BPF_DW		0x18    64-bit */
>> +#define		BPF_W		0x00
>> +#define		BPF_H		0x08
>> +#define		BPF_B		0x10
> 
> this hunk seems wrong here. Botched rebase?
> 

Yep this hunk has nothing to do with my work so will
remove this hunk.

^ permalink raw reply

* Re: [Intel-wired-lan] [next-queue 4/4] ixgbe: enable tso with ipsec offload
From: Alexander Duyck @ 2018-03-15 22:03 UTC (permalink / raw)
  To: Shannon Nelson; +Cc: intel-wired-lan, Steffen Klassert, Netdev
In-Reply-To: <1521149003-1433-5-git-send-email-shannon.nelson@oracle.com>

On Thu, Mar 15, 2018 at 2:23 PM, Shannon Nelson
<shannon.nelson@oracle.com> wrote:
> Fix things up to support TSO offload in conjunction
> with IPsec hw offload.  This raises throughput with
> IPsec offload on to nearly line rate.
>
> Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
> ---
>  drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c |  7 +++++--
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c  | 25 +++++++++++++++++++------
>  2 files changed, 24 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
> index 5ddea43..bfbcfc2 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
> @@ -896,6 +896,7 @@ void ixgbe_ipsec_rx(struct ixgbe_ring *rx_ring,
>  void ixgbe_init_ipsec_offload(struct ixgbe_adapter *adapter)
>  {
>         struct ixgbe_ipsec *ipsec;
> +       netdev_features_t features;
>         size_t size;
>
>         if (adapter->hw.mac.type == ixgbe_mac_82598EB)
> @@ -929,8 +930,10 @@ void ixgbe_init_ipsec_offload(struct ixgbe_adapter *adapter)
>         ixgbe_ipsec_clear_hw_tables(adapter);
>
>         adapter->netdev->xfrmdev_ops = &ixgbe_xfrmdev_ops;
> -       adapter->netdev->features |= NETIF_F_HW_ESP;
> -       adapter->netdev->hw_enc_features |= NETIF_F_HW_ESP;
> +
> +       features = NETIF_F_HW_ESP | NETIF_F_HW_ESP_TX_CSUM | NETIF_F_GSO_ESP;
> +       adapter->netdev->features |= features;
> +       adapter->netdev->hw_enc_features |= features;

Instead of adding the local variable you might just create a new
define that includes these 3 feature flags and then use that here. You
could use the way I did IXGBE_GSO_PARTIAL_FEATURES as an example.

>         return;
>
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> index a54f3d8..6022666 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> @@ -7721,9 +7721,11 @@ static void ixgbe_service_task(struct work_struct *work)
>
>  static int ixgbe_tso(struct ixgbe_ring *tx_ring,
>                      struct ixgbe_tx_buffer *first,
> -                    u8 *hdr_len)
> +                    u8 *hdr_len,
> +                    struct ixgbe_ipsec_tx_data *itd)
>  {
>         u32 vlan_macip_lens, type_tucmd, mss_l4len_idx;
> +       u32 fceof_saidx = 0;
>         struct sk_buff *skb = first->skb;

Reverse xmas tree this. It should probably be moved down to just past
the declaration of paylen and l4_offset.

>         union {
>                 struct iphdr *v4;
> @@ -7762,9 +7764,13 @@ static int ixgbe_tso(struct ixgbe_ring *tx_ring,
>                 unsigned char *trans_start = ip.hdr + (ip.v4->ihl * 4);
>
>                 /* IP header will have to cancel out any data that
> -                * is not a part of the outer IP header
> +                * is not a part of the outer IP header, except for
> +                * IPsec where we want the IP+ESP header.
>                  */
> -               ip.v4->check = csum_fold(csum_partial(trans_start,
> +               if (first->tx_flags & IXGBE_TX_FLAGS_IPSEC)
> +                       ip.v4->check = 0;
> +               else
> +                       ip.v4->check = csum_fold(csum_partial(trans_start,
>                                                       csum_start - trans_start,
>                                                       0));
>                 type_tucmd |= IXGBE_ADVTXD_TUCMD_IPV4;

I would say this should be flipped like so:
ip.v4->check =  (skb_shinfo(skb)->gso_type & SKB_GSO_PARTIAL) ?
                         csum_fold(csum_partial(trans_start,
csum_start - trans_start, 0) : 0;

> @@ -7797,12 +7803,15 @@ static int ixgbe_tso(struct ixgbe_ring *tx_ring,
>         mss_l4len_idx = (*hdr_len - l4_offset) << IXGBE_ADVTXD_L4LEN_SHIFT;
>         mss_l4len_idx |= skb_shinfo(skb)->gso_size << IXGBE_ADVTXD_MSS_SHIFT;
>
> +       fceof_saidx |= itd->sa_idx;
> +       type_tucmd |= itd->flags | itd->trailer_len;
> +
>         /* vlan_macip_lens: HEADLEN, MACLEN, VLAN tag */
>         vlan_macip_lens = l4.hdr - ip.hdr;
>         vlan_macip_lens |= (ip.hdr - skb->data) << IXGBE_ADVTXD_MACLEN_SHIFT;
>         vlan_macip_lens |= first->tx_flags & IXGBE_TX_FLAGS_VLAN_MASK;
>
> -       ixgbe_tx_ctxtdesc(tx_ring, vlan_macip_lens, 0, type_tucmd,
> +       ixgbe_tx_ctxtdesc(tx_ring, vlan_macip_lens, fceof_saidx, type_tucmd,
>                           mss_l4len_idx);
>
>         return 1;
> @@ -8493,7 +8502,8 @@ netdev_tx_t ixgbe_xmit_frame_ring(struct sk_buff *skb,
>         if (skb->sp && !ixgbe_ipsec_tx(tx_ring, first, &ipsec_tx))
>                 goto out_drop;
>  #endif
> -       tso = ixgbe_tso(tx_ring, first, &hdr_len);
> +
> +       tso = ixgbe_tso(tx_ring, first, &hdr_len, &ipsec_tx);
>         if (tso < 0)
>                 goto out_drop;
>         else if (!tso)

No need for the extra blank line. I would say just leave it as is and
add your extra argument.

> @@ -9902,8 +9912,11 @@ ixgbe_features_check(struct sk_buff *skb, struct net_device *dev,
>
>         /* We can only support IPV4 TSO in tunnels if we can mangle the
>          * inner IP ID field, so strip TSO if MANGLEID is not supported.
> +        * IPsec offoad sets skb->encapsulation but still can handle
> +        * the TSO, so it's the exception.
>          */
> -       if (skb->encapsulation && !(features & NETIF_F_TSO_MANGLEID))
> +       if (skb->encapsulation && !(features & NETIF_F_TSO_MANGLEID) &&
> +           !skb->sp)
>                 features &= ~NETIF_F_TSO;
>
>         return features;
> --
> 2.7.4
>
> _______________________________________________
> Intel-wired-lan mailing list
> Intel-wired-lan@osuosl.org
> https://lists.osuosl.org/mailman/listinfo/intel-wired-lan

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox