netfilter-devel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Linkui Xiao <xiaolinkui@gmail.com>
To: Jozsef Kadlecsik <kadlec@netfilter.org>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>,
	Florian Westphal <fw@strlen.de>,
	David Miller <davem@davemloft.net>,
	edumazet@google.com, kuba@kernel.org, pabeni@redhat.com,
	justinstitt@google.com, kuniyu@amazon.com,
	netfilter-devel@vger.kernel.org,
	Linkui Xiao <xiaolinkui@kylinos.cn>
Subject: Re: [PATCH 2/2] netfilter: ipset: fix race condition in ipset swap, destroy and test
Date: Tue, 17 Oct 2023 14:29:14 +0800	[thread overview]
Message-ID: <96163a1d-81ed-1739-96da-f1a7b5f63dd7@gmail.com> (raw)
In-Reply-To: <3ad078fa-fa61-95a8-dbd2-33d5faa2a8b@netfilter.org>

Hi,
Thanks for your reply.
On 10/17/23 03:52, Jozsef Kadlecsik wrote:
> Hello,
>
> On Mon, 16 Oct 2023, xiaolinkui wrote:
>
>> From: Linkui Xiao <xiaolinkui@kylinos.cn>
>>
>> There is a race condition which can be demonstrated by the following
>> script:
>>
>> ipset create hash_ip1 hash:net family inet hashsize 1024 maxelem 1048576
>> ipset add hash_ip1 172.20.0.0/16
>> ipset add hash_ip1 192.168.0.0/16
>> iptables -A INPUT -m set --match-set hash_ip1 src -j ACCEPT
>> while [ 1 ]
>> do
>> 	ipset create hash_ip2 hash:net family inet hashsize 1024 maxelem 1048576
>> 	ipset add hash_ip2 172.20.0.0/16
>> 	ipset swap hash_ip1 hash_ip2
>> 	ipset destroy hash_ip2
>> 	sleep 0.05
>> done
> I have been running the script above for more than two hours and nothing
> happened. What else is needed to trigger the bug? (I have been
> continuously sending ping to the tested host.)
This is an extremely low probability event. In our k8s cluster, hundreds
of virtual machines ran for several months before experiencing a crash.
Perhaps due to some special reasons, the ip_set_test run time has been
increased, during this period, other CPU completed the swap and destroy
operations on the ipset.
In order to quickly trigger this bug, we can add a delay and destroy 
operations
to the test kernel to simulate the actual environment, such as:

@@ -733,11 +733,13 @@ ip_set_unlock(struct ip_set *set)
                 spin_unlock_bh(&set->lock);
  }

+static void ip_set_destroy_set(struct ip_set *set);
  int
  ip_set_test(ip_set_id_t index, const struct sk_buff *skb,
             const struct xt_action_param *par, struct ip_set_adt_opt *opt)
  {
         struct ip_set *set = ip_set_rcu_get(xt_net(par), index);
+       struct ip_set_net *inst = ip_set_pernet(xt_net(par));
         int ret = 0;

         BUG_ON(!set);
@@ -747,6 +749,17 @@ ip_set_test(ip_set_id_t index, const struct sk_buff 
*skb,
             !(opt->family == set->family || set->family == 
NFPROTO_UNSPEC))
                 return 0;

+       mdelay(100); // Waiting for swap to complete
+
+       read_lock_bh(&ip_set_ref_lock);
+       if (set->ref == 0 && set->ref_netlink == 0) { // set can be 
destroyed with ref = 0
+               pr_warn("set %s, index %u, ref %u\n", set->name, index, 
set->ref);
+               read_unlock_bh(&ip_set_ref_lock);
+               ip_set(inst, index) = NULL;
+               ip_set_destroy_set(set);
+       } else
+               read_unlock_bh(&ip_set_ref_lock);
+
         ret = set->variant->kadt(set, skb, par, IPSET_TEST, opt);

         if (ret == -EAGAIN) {

Then run the above script on the test kernel, crash will appear on 
hash_net4_kadt:
(need sending ping to the tested host.)

[   93.021792] BUG: kernel NULL pointer dereference, address: 
000000000000009c
[   93.021796] #PF: supervisor read access in kernel mode
[   93.021798] #PF: error_code(0x0000) - not-present page
[   93.021800] PGD 0 P4D 0
[   93.021804] Oops: 0000 [#1] PREEMPT SMP PTI
[   93.021807] CPU: 4 PID: 0 Comm: swapper/4 Kdump: loaded Not tainted 
6.6.0-rc5 #29
[   93.021811] Hardware name: VMware, Inc. VMware Virtual Platform/440BX 
Desktop Reference Platform, BIOS 6.00 11/12/2020
[   93.021813] RIP: 0010:hash_net4_kadt+0x5f/0x1b0 [ip_set_hash_net]
[   93.021825] Code: 00 48 89 44 24 48 48 8b 87 80 00 00 00 45 8b 60 30 
c7 44 24 08 00 00 00 00 4c 8b 54 ca 10 31 d2 c6 44 24 0e 00 66 89 54 24 
0c <0f> b6 b0 9c 00 00 00 40 84 f6 0f 84 bf 00 00 00 48 8d 54 24 10 83
[   93.021827] RSP: 0018:ffffb0180058c9c0 EFLAGS: 00010246
[   93.021830] RAX: 0000000000000000 RBX: 0000000000000002 RCX: 
0000000000000002
[   93.021832] RDX: 0000000000000000 RSI: ffff956d747cdd00 RDI: 
ffff956d9348ba80
[   93.021835] RBP: ffffb0180058ca30 R08: ffffb0180058ca88 R09: 
ffff956d9348ba80
[   93.021837] R10: ffffffffc102f4f0 R11: ffff956d747cdd00 R12: 
00000000ffffffff
[   93.021839] R13: 0000000000000054 R14: ffffb0180058ca88 R15: 
ffff956d747cdd00
[   93.021862] FS:  0000000000000000(0000) GS:ffff956d9b100000(0000) 
knlGS:0000000000000000
[   93.021870] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   93.021873] CR2: 000000000000009c CR3: 0000000031e3a006 CR4: 
00000000003706e0
[   93.021887] Call Trace:
[   93.021891]  <IRQ>
[   93.021893]  ? __die+0x24/0x70
[   93.021900]  ? page_fault_oops+0x82/0x150
[   93.021905]  ? exc_page_fault+0x69/0x150
[   93.021911]  ? asm_exc_page_fault+0x26/0x30
[   93.021915]  ? __pfx_hash_net4_test+0x10/0x10 [ip_set_hash_net]
[   93.021922]  ? hash_net4_kadt+0x5f/0x1b0 [ip_set_hash_net]
[   93.021931]  ip_set_test+0x119/0x250 [ip_set]
[   93.021943]  set_match_v4+0xa2/0xe0 [xt_set]
[   93.021973]  nft_match_eval+0x65/0xb0 [nft_compat]
[   93.021982]  nft_do_chain+0xe1/0x430 [nf_tables]
[   93.022008]  ? resolve_normal_ct+0xc1/0x200 [nf_conntrack]
[   93.022026]  ? nf_conntrack_icmpv4_error+0x123/0x1a0 [nf_conntrack]
[   93.022046]  nft_do_chain_ipv4+0x6b/0x90 [nf_tables]
[   93.022066]  nf_hook_slow+0x40/0xc0
[   93.022071]  ip_local_deliver+0xdb/0x130
[   93.022075]  ? __pfx_ip_local_deliver_finish+0x10/0x10
[   93.022078]  __netif_receive_skb_core.constprop.0+0x6e1/0x10a0
[   93.022086]  ? find_busiest_group+0x43/0x240
[   93.022091]  __netif_receive_skb_list_core+0x136/0x2c0
[   93.022096]  netif_receive_skb_list_internal+0x1c9/0x300
[   93.022101]  ? e1000_clean_rx_irq+0x316/0x4c0 [e1000]
[   93.022113]  napi_complete_done+0x73/0x1b0
[   93.022117]  e1000_clean+0x7c/0x1e0 [e1000]
[   93.022127]  __napi_poll+0x29/0x1b0
[   93.022131]  net_rx_action+0x29f/0x370
[   93.022136]  __do_softirq+0xcc/0x2af
[   93.022142]  __irq_exit_rcu+0xa1/0xc0
[   93.022147]  sysvec_apic_timer_interrupt+0x76/0x90
>
>> Swap will exchange the values of ref so destroy will see ref = 0 instead of
>> ref = 1. So after running this script for a period of time, the following
>> race situations may occur:
>> 	CPU0:                CPU1:
>> 	ipt_do_table
>> 	->set_match_v4
>> 	->ip_set_test
>> 			ipset swap hash_ip1 hash_ip2
>> 			ipset destroy hash_ip2
>> 	->hash_net4_kadt
>>
>> CPU0 found ipset through the index, and at this time, hash_ip2 has been
>> destroyed by CPU1 through name search. So CPU0 will crash when accessing
>> set->data in the function hash_net4_kadt.
>>
>> With this fix in place swap will not succeed because ip_set_test still has
>> ref_swapping on the set.
>>
>> Both destroy and swap will error out if ref_swapping != 0 on the set.
>>
>> Signed-off-by: Linkui Xiao <xiaolinkui@kylinos.cn>
>> ---
>>   net/netfilter/ipset/ip_set_core.c | 6 +++++-
>>   1 file changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/net/netfilter/ipset/ip_set_core.c b/net/netfilter/ipset/ip_set_core.c
>> index e5d25df5c64c..d6bd37010bfb 100644
>> --- a/net/netfilter/ipset/ip_set_core.c
>> +++ b/net/netfilter/ipset/ip_set_core.c
>> @@ -741,11 +741,13 @@ ip_set_test(ip_set_id_t index, const struct sk_buff *skb,
>>   	int ret = 0;
>>   
>>   	BUG_ON(!set);
>> +
>> +	__ip_set_get_swapping(set);
>>   	pr_debug("set %s, index %u\n", set->name, index);
>>   
>>   	if (opt->dim < set->type->dimension ||
>>   	    !(opt->family == set->family || set->family == NFPROTO_UNSPEC))
>> -		return 0;
>> +		goto out;
>>   
>>   	ret = set->variant->kadt(set, skb, par, IPSET_TEST, opt);
>>   
>> @@ -764,6 +766,8 @@ ip_set_test(ip_set_id_t index, const struct sk_buff *skb,
>>   			ret = -ret;
>>   	}
>>   
>> +out:
>> +	__ip_set_put_swapping(set);
>>   	/* Convert error codes to nomatch */
>>   	return (ret < 0 ? 0 : ret);
>>   }
>> -- 
>> 2.17.1
> The patch above alas is also not acceptable: it adds locking to a lockless
> area and thus slows down the execution unnecessarily.
Use seq of cpu in destroy can avoid this issue, but Florian Westphal says
it's not suitable: 
https://lore.kernel.org/all/20231005123107.GB9350@breakpoint.cc/
>
> If there's a bug, then that must be handled in ip_set_swap() itself, like
> for example adding a quiescent time and waiting for the ongoing users of
> the swapped set to finish their job. You can make it slow there, because
> swapping is not performance critical.
Can we add read seq of cpu to swap to wait for the test to complete? 
Such as:

@@ -1357,6 +1357,7 @@ static int ip_set_swap(struct sk_buff *skb, const 
struct nfnl_info *info,
         struct ip_set *from, *to;
         ip_set_id_t from_id, to_id;
         char from_name[IPSET_MAXNAMELEN];
+       unsigned int cpu;

         if (unlikely(protocol_min_failed(attr) ||
                      !attr[IPSET_ATTR_SETNAME] ||
@@ -1395,8 +1396,21 @@ static int ip_set_swap(struct sk_buff *skb, const 
struct nfnl_info *info,
         swap(from->ref, to->ref);
         ip_set(inst, from_id) = to;
         ip_set(inst, to_id) = from;
-       write_unlock_bh(&ip_set_ref_lock);

+       /* wait for even xt_recseq on all cpus */
+       for_each_possible_cpu(cpu) {
+               seqcount_t *s = &per_cpu(xt_recseq, cpu);
+               u32 seq = raw_read_seqcount(s);
+
+               if (seq & 1) {
+                       do {
+                               cond_resched();
+                               cpu_relax();
+                       } while (seq == raw_read_seqcount(s));
+               }
+       }
+
+       write_unlock_bh(&ip_set_ref_lock);
         return 0;
  }

Of course, this solution cannot be verified in the test kernel above.
>
> Best regards,
> Jozsef
Do you have a better solution for this race condition?
Looking forward to your reply.

Best regards,
Linkui Xiao

  reply	other threads:[~2023-10-17  6:29 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-10-16 13:52 [PATCH 1/2] netfilter: ipset: rename ref_netlink to ref_swapping xiaolinkui
2023-10-16 13:52 ` [PATCH 2/2] netfilter: ipset: fix race condition in ipset swap, destroy and test xiaolinkui
2023-10-16 19:52   ` Jozsef Kadlecsik
2023-10-17  6:29     ` Linkui Xiao [this message]
2023-10-17  9:26       ` Jozsef Kadlecsik
2023-10-17 12:49         ` Linkui Xiao
2023-10-16 19:43 ` [PATCH 1/2] netfilter: ipset: rename ref_netlink to ref_swapping Jozsef Kadlecsik

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=96163a1d-81ed-1739-96da-f1a7b5f63dd7@gmail.com \
    --to=xiaolinkui@gmail.com \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=fw@strlen.de \
    --cc=justinstitt@google.com \
    --cc=kadlec@netfilter.org \
    --cc=kuba@kernel.org \
    --cc=kuniyu@amazon.com \
    --cc=netfilter-devel@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=pablo@netfilter.org \
    --cc=xiaolinkui@kylinos.cn \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).