From: Linkui Xiao <xiaolinkui@gmail.com>
To: Jozsef Kadlecsik <kadlec@netfilter.org>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>,
Florian Westphal <fw@strlen.de>,
David Miller <davem@davemloft.net>,
edumazet@google.com, kuba@kernel.org, pabeni@redhat.com,
justinstitt@google.com, kuniyu@amazon.com,
netfilter-devel@vger.kernel.org,
Linkui Xiao <xiaolinkui@kylinos.cn>
Subject: Re: [PATCH 2/2] netfilter: ipset: fix race condition in ipset swap, destroy and test
Date: Tue, 17 Oct 2023 14:29:14 +0800 [thread overview]
Message-ID: <96163a1d-81ed-1739-96da-f1a7b5f63dd7@gmail.com> (raw)
In-Reply-To: <3ad078fa-fa61-95a8-dbd2-33d5faa2a8b@netfilter.org>
Hi,
Thanks for your reply.
On 10/17/23 03:52, Jozsef Kadlecsik wrote:
> Hello,
>
> On Mon, 16 Oct 2023, xiaolinkui wrote:
>
>> From: Linkui Xiao <xiaolinkui@kylinos.cn>
>>
>> There is a race condition which can be demonstrated by the following
>> script:
>>
>> ipset create hash_ip1 hash:net family inet hashsize 1024 maxelem 1048576
>> ipset add hash_ip1 172.20.0.0/16
>> ipset add hash_ip1 192.168.0.0/16
>> iptables -A INPUT -m set --match-set hash_ip1 src -j ACCEPT
>> while [ 1 ]
>> do
>> ipset create hash_ip2 hash:net family inet hashsize 1024 maxelem 1048576
>> ipset add hash_ip2 172.20.0.0/16
>> ipset swap hash_ip1 hash_ip2
>> ipset destroy hash_ip2
>> sleep 0.05
>> done
> I have been running the script above for more than two hours and nothing
> happened. What else is needed to trigger the bug? (I have been
> continuously sending ping to the tested host.)
This is an extremely low probability event. In our k8s cluster, hundreds
of virtual machines ran for several months before experiencing a crash.
Perhaps due to some special reasons, the ip_set_test run time has been
increased, during this period, other CPU completed the swap and destroy
operations on the ipset.
In order to quickly trigger this bug, we can add a delay and destroy
operations
to the test kernel to simulate the actual environment, such as:
@@ -733,11 +733,13 @@ ip_set_unlock(struct ip_set *set)
spin_unlock_bh(&set->lock);
}
+static void ip_set_destroy_set(struct ip_set *set);
int
ip_set_test(ip_set_id_t index, const struct sk_buff *skb,
const struct xt_action_param *par, struct ip_set_adt_opt *opt)
{
struct ip_set *set = ip_set_rcu_get(xt_net(par), index);
+ struct ip_set_net *inst = ip_set_pernet(xt_net(par));
int ret = 0;
BUG_ON(!set);
@@ -747,6 +749,17 @@ ip_set_test(ip_set_id_t index, const struct sk_buff
*skb,
!(opt->family == set->family || set->family ==
NFPROTO_UNSPEC))
return 0;
+ mdelay(100); // Waiting for swap to complete
+
+ read_lock_bh(&ip_set_ref_lock);
+ if (set->ref == 0 && set->ref_netlink == 0) { // set can be
destroyed with ref = 0
+ pr_warn("set %s, index %u, ref %u\n", set->name, index,
set->ref);
+ read_unlock_bh(&ip_set_ref_lock);
+ ip_set(inst, index) = NULL;
+ ip_set_destroy_set(set);
+ } else
+ read_unlock_bh(&ip_set_ref_lock);
+
ret = set->variant->kadt(set, skb, par, IPSET_TEST, opt);
if (ret == -EAGAIN) {
Then run the above script on the test kernel, crash will appear on
hash_net4_kadt:
(need sending ping to the tested host.)
[ 93.021792] BUG: kernel NULL pointer dereference, address:
000000000000009c
[ 93.021796] #PF: supervisor read access in kernel mode
[ 93.021798] #PF: error_code(0x0000) - not-present page
[ 93.021800] PGD 0 P4D 0
[ 93.021804] Oops: 0000 [#1] PREEMPT SMP PTI
[ 93.021807] CPU: 4 PID: 0 Comm: swapper/4 Kdump: loaded Not tainted
6.6.0-rc5 #29
[ 93.021811] Hardware name: VMware, Inc. VMware Virtual Platform/440BX
Desktop Reference Platform, BIOS 6.00 11/12/2020
[ 93.021813] RIP: 0010:hash_net4_kadt+0x5f/0x1b0 [ip_set_hash_net]
[ 93.021825] Code: 00 48 89 44 24 48 48 8b 87 80 00 00 00 45 8b 60 30
c7 44 24 08 00 00 00 00 4c 8b 54 ca 10 31 d2 c6 44 24 0e 00 66 89 54 24
0c <0f> b6 b0 9c 00 00 00 40 84 f6 0f 84 bf 00 00 00 48 8d 54 24 10 83
[ 93.021827] RSP: 0018:ffffb0180058c9c0 EFLAGS: 00010246
[ 93.021830] RAX: 0000000000000000 RBX: 0000000000000002 RCX:
0000000000000002
[ 93.021832] RDX: 0000000000000000 RSI: ffff956d747cdd00 RDI:
ffff956d9348ba80
[ 93.021835] RBP: ffffb0180058ca30 R08: ffffb0180058ca88 R09:
ffff956d9348ba80
[ 93.021837] R10: ffffffffc102f4f0 R11: ffff956d747cdd00 R12:
00000000ffffffff
[ 93.021839] R13: 0000000000000054 R14: ffffb0180058ca88 R15:
ffff956d747cdd00
[ 93.021862] FS: 0000000000000000(0000) GS:ffff956d9b100000(0000)
knlGS:0000000000000000
[ 93.021870] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 93.021873] CR2: 000000000000009c CR3: 0000000031e3a006 CR4:
00000000003706e0
[ 93.021887] Call Trace:
[ 93.021891] <IRQ>
[ 93.021893] ? __die+0x24/0x70
[ 93.021900] ? page_fault_oops+0x82/0x150
[ 93.021905] ? exc_page_fault+0x69/0x150
[ 93.021911] ? asm_exc_page_fault+0x26/0x30
[ 93.021915] ? __pfx_hash_net4_test+0x10/0x10 [ip_set_hash_net]
[ 93.021922] ? hash_net4_kadt+0x5f/0x1b0 [ip_set_hash_net]
[ 93.021931] ip_set_test+0x119/0x250 [ip_set]
[ 93.021943] set_match_v4+0xa2/0xe0 [xt_set]
[ 93.021973] nft_match_eval+0x65/0xb0 [nft_compat]
[ 93.021982] nft_do_chain+0xe1/0x430 [nf_tables]
[ 93.022008] ? resolve_normal_ct+0xc1/0x200 [nf_conntrack]
[ 93.022026] ? nf_conntrack_icmpv4_error+0x123/0x1a0 [nf_conntrack]
[ 93.022046] nft_do_chain_ipv4+0x6b/0x90 [nf_tables]
[ 93.022066] nf_hook_slow+0x40/0xc0
[ 93.022071] ip_local_deliver+0xdb/0x130
[ 93.022075] ? __pfx_ip_local_deliver_finish+0x10/0x10
[ 93.022078] __netif_receive_skb_core.constprop.0+0x6e1/0x10a0
[ 93.022086] ? find_busiest_group+0x43/0x240
[ 93.022091] __netif_receive_skb_list_core+0x136/0x2c0
[ 93.022096] netif_receive_skb_list_internal+0x1c9/0x300
[ 93.022101] ? e1000_clean_rx_irq+0x316/0x4c0 [e1000]
[ 93.022113] napi_complete_done+0x73/0x1b0
[ 93.022117] e1000_clean+0x7c/0x1e0 [e1000]
[ 93.022127] __napi_poll+0x29/0x1b0
[ 93.022131] net_rx_action+0x29f/0x370
[ 93.022136] __do_softirq+0xcc/0x2af
[ 93.022142] __irq_exit_rcu+0xa1/0xc0
[ 93.022147] sysvec_apic_timer_interrupt+0x76/0x90
>
>> Swap will exchange the values of ref so destroy will see ref = 0 instead of
>> ref = 1. So after running this script for a period of time, the following
>> race situations may occur:
>> CPU0: CPU1:
>> ipt_do_table
>> ->set_match_v4
>> ->ip_set_test
>> ipset swap hash_ip1 hash_ip2
>> ipset destroy hash_ip2
>> ->hash_net4_kadt
>>
>> CPU0 found ipset through the index, and at this time, hash_ip2 has been
>> destroyed by CPU1 through name search. So CPU0 will crash when accessing
>> set->data in the function hash_net4_kadt.
>>
>> With this fix in place swap will not succeed because ip_set_test still has
>> ref_swapping on the set.
>>
>> Both destroy and swap will error out if ref_swapping != 0 on the set.
>>
>> Signed-off-by: Linkui Xiao <xiaolinkui@kylinos.cn>
>> ---
>> net/netfilter/ipset/ip_set_core.c | 6 +++++-
>> 1 file changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/net/netfilter/ipset/ip_set_core.c b/net/netfilter/ipset/ip_set_core.c
>> index e5d25df5c64c..d6bd37010bfb 100644
>> --- a/net/netfilter/ipset/ip_set_core.c
>> +++ b/net/netfilter/ipset/ip_set_core.c
>> @@ -741,11 +741,13 @@ ip_set_test(ip_set_id_t index, const struct sk_buff *skb,
>> int ret = 0;
>>
>> BUG_ON(!set);
>> +
>> + __ip_set_get_swapping(set);
>> pr_debug("set %s, index %u\n", set->name, index);
>>
>> if (opt->dim < set->type->dimension ||
>> !(opt->family == set->family || set->family == NFPROTO_UNSPEC))
>> - return 0;
>> + goto out;
>>
>> ret = set->variant->kadt(set, skb, par, IPSET_TEST, opt);
>>
>> @@ -764,6 +766,8 @@ ip_set_test(ip_set_id_t index, const struct sk_buff *skb,
>> ret = -ret;
>> }
>>
>> +out:
>> + __ip_set_put_swapping(set);
>> /* Convert error codes to nomatch */
>> return (ret < 0 ? 0 : ret);
>> }
>> --
>> 2.17.1
> The patch above alas is also not acceptable: it adds locking to a lockless
> area and thus slows down the execution unnecessarily.
Use seq of cpu in destroy can avoid this issue, but Florian Westphal says
it's not suitable:
https://lore.kernel.org/all/20231005123107.GB9350@breakpoint.cc/
>
> If there's a bug, then that must be handled in ip_set_swap() itself, like
> for example adding a quiescent time and waiting for the ongoing users of
> the swapped set to finish their job. You can make it slow there, because
> swapping is not performance critical.
Can we add read seq of cpu to swap to wait for the test to complete?
Such as:
@@ -1357,6 +1357,7 @@ static int ip_set_swap(struct sk_buff *skb, const
struct nfnl_info *info,
struct ip_set *from, *to;
ip_set_id_t from_id, to_id;
char from_name[IPSET_MAXNAMELEN];
+ unsigned int cpu;
if (unlikely(protocol_min_failed(attr) ||
!attr[IPSET_ATTR_SETNAME] ||
@@ -1395,8 +1396,21 @@ static int ip_set_swap(struct sk_buff *skb, const
struct nfnl_info *info,
swap(from->ref, to->ref);
ip_set(inst, from_id) = to;
ip_set(inst, to_id) = from;
- write_unlock_bh(&ip_set_ref_lock);
+ /* wait for even xt_recseq on all cpus */
+ for_each_possible_cpu(cpu) {
+ seqcount_t *s = &per_cpu(xt_recseq, cpu);
+ u32 seq = raw_read_seqcount(s);
+
+ if (seq & 1) {
+ do {
+ cond_resched();
+ cpu_relax();
+ } while (seq == raw_read_seqcount(s));
+ }
+ }
+
+ write_unlock_bh(&ip_set_ref_lock);
return 0;
}
Of course, this solution cannot be verified in the test kernel above.
>
> Best regards,
> Jozsef
Do you have a better solution for this race condition?
Looking forward to your reply.
Best regards,
Linkui Xiao
next prev parent reply other threads:[~2023-10-17 6:29 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-10-16 13:52 [PATCH 1/2] netfilter: ipset: rename ref_netlink to ref_swapping xiaolinkui
2023-10-16 13:52 ` [PATCH 2/2] netfilter: ipset: fix race condition in ipset swap, destroy and test xiaolinkui
2023-10-16 19:52 ` Jozsef Kadlecsik
2023-10-17 6:29 ` Linkui Xiao [this message]
2023-10-17 9:26 ` Jozsef Kadlecsik
2023-10-17 12:49 ` Linkui Xiao
2023-10-16 19:43 ` [PATCH 1/2] netfilter: ipset: rename ref_netlink to ref_swapping Jozsef Kadlecsik
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=96163a1d-81ed-1739-96da-f1a7b5f63dd7@gmail.com \
--to=xiaolinkui@gmail.com \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=fw@strlen.de \
--cc=justinstitt@google.com \
--cc=kadlec@netfilter.org \
--cc=kuba@kernel.org \
--cc=kuniyu@amazon.com \
--cc=netfilter-devel@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=pablo@netfilter.org \
--cc=xiaolinkui@kylinos.cn \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).