* [PATCH net-next] fix unsafe set_memory_rw from softirq
@ 2013-10-03 3:50 Alexei Starovoitov
2013-10-03 4:23 ` Eric Dumazet
0 siblings, 1 reply; 6+ messages in thread
From: Alexei Starovoitov @ 2013-10-03 3:50 UTC (permalink / raw)
To: David S. Miller
Cc: netdev, Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
Patrick McHardy, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
Daniel Borkmann, Paul E. McKenney, Xi Wang, x86, Eric Dumazet,
linux-kernel
on x86 system with net.core.bpf_jit_enable = 1
sudo tcpdump -i eth1 'tcp port 22'
causes the warning:
[ 56.766097] Possible unsafe locking scenario:
[ 56.766097]
[ 56.780146] CPU0
[ 56.786807] ----
[ 56.793188] lock(&(&vb->lock)->rlock);
[ 56.799593] <Interrupt>
[ 56.805889] lock(&(&vb->lock)->rlock);
[ 56.812266]
[ 56.812266] *** DEADLOCK ***
[ 56.812266]
[ 56.830670] 1 lock held by ksoftirqd/1/13:
[ 56.836838] #0: (rcu_read_lock){.+.+..}, at: [<ffffffff8118f44c>] vm_unmap_aliases+0x8c/0x380
[ 56.849757]
[ 56.849757] stack backtrace:
[ 56.862194] CPU: 1 PID: 13 Comm: ksoftirqd/1 Not tainted 3.12.0-rc3+ #45
[ 56.868721] Hardware name: System manufacturer System Product Name/P8Z77 WS, BIOS 3007 07/26/2012
[ 56.882004] ffffffff821944c0 ffff88080bbdb8c8 ffffffff8175a145 0000000000000007
[ 56.895630] ffff88080bbd5f40 ffff88080bbdb928 ffffffff81755b14 0000000000000001
[ 56.909313] ffff880800000001 ffff880800000000 ffffffff8101178f 0000000000000001
[ 56.923006] Call Trace:
[ 56.929532] [<ffffffff8175a145>] dump_stack+0x55/0x76
[ 56.936067] [<ffffffff81755b14>] print_usage_bug+0x1f7/0x208
[ 56.942445] [<ffffffff8101178f>] ? save_stack_trace+0x2f/0x50
[ 56.948932] [<ffffffff810cc0a0>] ? check_usage_backwards+0x150/0x150
[ 56.955470] [<ffffffff810ccb52>] mark_lock+0x282/0x2c0
[ 56.961945] [<ffffffff810ccfed>] __lock_acquire+0x45d/0x1d50
[ 56.968474] [<ffffffff810cce6e>] ? __lock_acquire+0x2de/0x1d50
[ 56.975140] [<ffffffff81393bf5>] ? cpumask_next_and+0x55/0x90
[ 56.981942] [<ffffffff810cef72>] lock_acquire+0x92/0x1d0
[ 56.988745] [<ffffffff8118f52a>] ? vm_unmap_aliases+0x16a/0x380
[ 56.995619] [<ffffffff817628f1>] _raw_spin_lock+0x41/0x50
[ 57.002493] [<ffffffff8118f52a>] ? vm_unmap_aliases+0x16a/0x380
[ 57.009447] [<ffffffff8118f52a>] vm_unmap_aliases+0x16a/0x380
[ 57.016477] [<ffffffff8118f44c>] ? vm_unmap_aliases+0x8c/0x380
[ 57.023607] [<ffffffff810436b0>] change_page_attr_set_clr+0xc0/0x460
[ 57.030818] [<ffffffff810cfb8d>] ? trace_hardirqs_on+0xd/0x10
[ 57.037896] [<ffffffff811a8330>] ? kmem_cache_free+0xb0/0x2b0
[ 57.044789] [<ffffffff811b59c3>] ? free_object_rcu+0x93/0xa0
[ 57.051720] [<ffffffff81043d9f>] set_memory_rw+0x2f/0x40
[ 57.058727] [<ffffffff8104e17c>] bpf_jit_free+0x2c/0x40
[ 57.065577] [<ffffffff81642cba>] sk_filter_release_rcu+0x1a/0x30
[ 57.072338] [<ffffffff811108e2>] rcu_process_callbacks+0x202/0x7c0
[ 57.078962] [<ffffffff81057f17>] __do_softirq+0xf7/0x3f0
[ 57.085373] [<ffffffff81058245>] run_ksoftirqd+0x35/0x70
cannot reuse filter memory, since it's readonly, so have to
extend sk_filter with work_struct
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
arch/x86/net/bpf_jit_comp.c | 17 ++++++++++++-----
include/linux/filter.h | 1 +
2 files changed, 13 insertions(+), 5 deletions(-)
diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 79c216a..89a43df 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -772,13 +772,20 @@ out:
return;
}
+static void bpf_jit_free_deferred(struct work_struct *work)
+{
+ struct sk_filter *fp = container_of(work, struct sk_filter, work);
+ unsigned long addr = (unsigned long)fp->bpf_func & PAGE_MASK;
+ struct bpf_binary_header *header = (void *)addr;
+
+ set_memory_rw(addr, header->pages);
+ module_free(NULL, header);
+}
+
void bpf_jit_free(struct sk_filter *fp)
{
if (fp->bpf_func != sk_run_filter) {
- unsigned long addr = (unsigned long)fp->bpf_func & PAGE_MASK;
- struct bpf_binary_header *header = (void *)addr;
-
- set_memory_rw(addr, header->pages);
- module_free(NULL, header);
+ INIT_WORK(&fp->work, bpf_jit_free_deferred);
+ schedule_work(&fp->work);
}
}
diff --git a/include/linux/filter.h b/include/linux/filter.h
index a6ac848..378fa03 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -27,6 +27,7 @@ struct sk_filter
unsigned int len; /* Number of filter blocks */
unsigned int (*bpf_func)(const struct sk_buff *skb,
const struct sock_filter *filter);
+ struct work_struct work;
struct rcu_head rcu;
struct sock_filter insns[0];
};
--
1.7.9.5
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH net-next] fix unsafe set_memory_rw from softirq
2013-10-03 3:50 [PATCH net-next] fix unsafe set_memory_rw from softirq Alexei Starovoitov
@ 2013-10-03 4:23 ` Eric Dumazet
2013-10-03 4:44 ` Alexei Starovoitov
0 siblings, 1 reply; 6+ messages in thread
From: Eric Dumazet @ 2013-10-03 4:23 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: David S. Miller, netdev, Alexey Kuznetsov, James Morris,
Hideaki YOSHIFUJI, Patrick McHardy, Thomas Gleixner, Ingo Molnar,
H. Peter Anvin, Daniel Borkmann, Paul E. McKenney, Xi Wang, x86,
Eric Dumazet, linux-kernel
On Wed, 2013-10-02 at 20:50 -0700, Alexei Starovoitov wrote:
> on x86 system with net.core.bpf_jit_enable = 1
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index a6ac848..378fa03 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -27,6 +27,7 @@ struct sk_filter
> unsigned int len; /* Number of filter blocks */
> unsigned int (*bpf_func)(const struct sk_buff *skb,
> const struct sock_filter *filter);
> + struct work_struct work;
> struct rcu_head rcu;
> struct sock_filter insns[0];
> };
Nice catch !
It seems only x86 and s390 needs this work_struct.
(and you might CC Heiko Carstens <heiko.carstens@de.ibm.com> to ask him
to make the s390 part, of Ack it if you plan to do it)
Thanks
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH net-next] fix unsafe set_memory_rw from softirq
2013-10-03 4:23 ` Eric Dumazet
@ 2013-10-03 4:44 ` Alexei Starovoitov
2013-10-03 4:53 ` Eric Dumazet
0 siblings, 1 reply; 6+ messages in thread
From: Alexei Starovoitov @ 2013-10-03 4:44 UTC (permalink / raw)
To: Eric Dumazet
Cc: David S. Miller, netdev, Alexey Kuznetsov, James Morris,
Hideaki YOSHIFUJI, Patrick McHardy, Thomas Gleixner, Ingo Molnar,
H. Peter Anvin, Daniel Borkmann, Paul E. McKenney, Xi Wang, x86,
Eric Dumazet, linux-kernel, Heiko Carstens
On Wed, Oct 2, 2013 at 9:23 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2013-10-02 at 20:50 -0700, Alexei Starovoitov wrote:
>> on x86 system with net.core.bpf_jit_enable = 1
>
>> diff --git a/include/linux/filter.h b/include/linux/filter.h
>> index a6ac848..378fa03 100644
>> --- a/include/linux/filter.h
>> +++ b/include/linux/filter.h
>> @@ -27,6 +27,7 @@ struct sk_filter
>> unsigned int len; /* Number of filter blocks */
>> unsigned int (*bpf_func)(const struct sk_buff *skb,
>> const struct sock_filter *filter);
>> + struct work_struct work;
>> struct rcu_head rcu;
>> struct sock_filter insns[0];
>> };
>
> Nice catch !
>
> It seems only x86 and s390 needs this work_struct.
I think ifdef config_x86 is a bit ugly inside struct sk_filter, but
don't mind whichever way.
> (and you might CC Heiko Carstens <heiko.carstens@de.ibm.com> to ask him
> to make the s390 part, of Ack it if you plan to do it)
set_memory_rw() on s390 is a simple page table walker that doesn't do
any IPI unlike x86
Heiko, please confirm that it's not an issue there.
Thanks
Alexei
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH net-next] fix unsafe set_memory_rw from softirq
2013-10-03 4:44 ` Alexei Starovoitov
@ 2013-10-03 4:53 ` Eric Dumazet
2013-10-03 4:57 ` Eric Dumazet
0 siblings, 1 reply; 6+ messages in thread
From: Eric Dumazet @ 2013-10-03 4:53 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: David S. Miller, netdev, Alexey Kuznetsov, James Morris,
Hideaki YOSHIFUJI, Patrick McHardy, Thomas Gleixner, Ingo Molnar,
H. Peter Anvin, Daniel Borkmann, Paul E. McKenney, Xi Wang, x86,
Eric Dumazet, linux-kernel, Heiko Carstens
On Wed, 2013-10-02 at 21:44 -0700, Alexei Starovoitov wrote:
> I think ifdef config_x86 is a bit ugly inside struct sk_filter, but
> don't mind whichever way.
Its not fair to make sk_filter bigger, because it means that simple (non
JIT) filter might need an extra cache line.
You could presumably use the following layout instead :
struct sk_filter
{
atomic_t refcnt;
struct rcu_head rcu;
struct work_struct work;
unsigned int len ____cacheline_aligned; /* Number of filter blocks */
unsigned int (*bpf_func)(const struct sk_buff *skb,
const struct sock_filter *filter);
struct sock_filter insns[0];
};
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH net-next] fix unsafe set_memory_rw from softirq
2013-10-03 4:53 ` Eric Dumazet
@ 2013-10-03 4:57 ` Eric Dumazet
2013-10-03 11:57 ` Alexei Starovoitov
0 siblings, 1 reply; 6+ messages in thread
From: Eric Dumazet @ 2013-10-03 4:57 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: David S. Miller, netdev, Alexey Kuznetsov, James Morris,
Hideaki YOSHIFUJI, Patrick McHardy, Thomas Gleixner, Ingo Molnar,
H. Peter Anvin, Daniel Borkmann, Paul E. McKenney, Xi Wang, x86,
Eric Dumazet, linux-kernel, Heiko Carstens
On Wed, 2013-10-02 at 21:53 -0700, Eric Dumazet wrote:
> On Wed, 2013-10-02 at 21:44 -0700, Alexei Starovoitov wrote:
>
> > I think ifdef config_x86 is a bit ugly inside struct sk_filter, but
> > don't mind whichever way.
>
> Its not fair to make sk_filter bigger, because it means that simple (non
> JIT) filter might need an extra cache line.
>
> You could presumably use the following layout instead :
>
> struct sk_filter
> {
> atomic_t refcnt;
> struct rcu_head rcu;
> struct work_struct work;
>
> unsigned int len ____cacheline_aligned; /* Number of filter blocks */
> unsigned int (*bpf_func)(const struct sk_buff *skb,
> const struct sock_filter *filter);
> struct sock_filter insns[0];
> };
And since @len is not used by sk_run_filter() use :
struct sk_filter {
atomic_t refcnt;
int len; /* number of filter blocks */
struct rcu_head rcu;
struct work_struct work;
unsigned int (*bpf_func)(const struct sk_buff *skb,
const struct sock_filter *filter) ____cacheline_aligned;
struct sock_filter insns[0];
};
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH net-next] fix unsafe set_memory_rw from softirq
2013-10-03 4:57 ` Eric Dumazet
@ 2013-10-03 11:57 ` Alexei Starovoitov
0 siblings, 0 replies; 6+ messages in thread
From: Alexei Starovoitov @ 2013-10-03 11:57 UTC (permalink / raw)
To: Eric Dumazet
Cc: David S. Miller, netdev, Alexey Kuznetsov, James Morris,
Hideaki YOSHIFUJI, Patrick McHardy, Thomas Gleixner, Ingo Molnar,
H. Peter Anvin, Daniel Borkmann, Paul E. McKenney, Xi Wang, x86,
Eric Dumazet, linux-kernel, Heiko Carstens
On Wed, Oct 2, 2013 at 9:57 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2013-10-02 at 21:53 -0700, Eric Dumazet wrote:
>> On Wed, 2013-10-02 at 21:44 -0700, Alexei Starovoitov wrote:
>>
>> > I think ifdef config_x86 is a bit ugly inside struct sk_filter, but
>> > don't mind whichever way.
>>
>> Its not fair to make sk_filter bigger, because it means that simple (non
>> JIT) filter might need an extra cache line.
>>
>> You could presumably use the following layout instead :
>>
>> struct sk_filter
>> {
>> atomic_t refcnt;
>> struct rcu_head rcu;
>> struct work_struct work;
>>
>> unsigned int len ____cacheline_aligned; /* Number of filter blocks */
>> unsigned int (*bpf_func)(const struct sk_buff *skb,
>> const struct sock_filter *filter);
>> struct sock_filter insns[0];
>> };
>
> And since @len is not used by sk_run_filter() use :
>
> struct sk_filter {
> atomic_t refcnt;
> int len; /* number of filter blocks */
> struct rcu_head rcu;
> struct work_struct work;
>
> unsigned int (*bpf_func)(const struct sk_buff *skb,
> const struct sock_filter *filter) ____cacheline_aligned;
> struct sock_filter insns[0];
> };
yes. make sense to avoid first insn cache miss inside sk_run_filter()
at the expense
of 8-byte gap between work and bpf_func (on x86_64 w/o lockdep)
Probably even better to overlap work and insns fields.
Pro: sk_filter size the same, no impact on non-jit case
Con: would be harder to understand the code
another problem is that kfree(sk_filter) inside
sk_filter_release_rcu() needs to move inside bpf_jit_free().
so self nack. Let me fix these issues and respin
Thanks
Alexei
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2013-10-03 11:57 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-10-03 3:50 [PATCH net-next] fix unsafe set_memory_rw from softirq Alexei Starovoitov
2013-10-03 4:23 ` Eric Dumazet
2013-10-03 4:44 ` Alexei Starovoitov
2013-10-03 4:53 ` Eric Dumazet
2013-10-03 4:57 ` Eric Dumazet
2013-10-03 11:57 ` Alexei Starovoitov
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox