From: Linkui Xiao <xiaolinkui@gmail.com>
To: Jozsef Kadlecsik <kadlec@netfilter.org>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>,
Florian Westphal <fw@strlen.de>,
David Miller <davem@davemloft.net>,
edumazet@google.com, kuba@kernel.org, pabeni@redhat.com,
justinstitt@google.com, kuniyu@amazon.com,
netfilter-devel@vger.kernel.org,
Linkui Xiao <xiaolinkui@kylinos.cn>
Subject: Re: [PATCH 2/2] netfilter: ipset: fix race condition in ipset swap, destroy and test
Date: Tue, 17 Oct 2023 20:49:11 +0800 [thread overview]
Message-ID: <8586e54f-eb36-dad5-ca1e-8924bc5e23a8@gmail.com> (raw)
In-Reply-To: <69e7963b-e7f8-3ad0-210-7b86eebf7f78@netfilter.org>
Hi,
Thanks for your reply.
On 10/17/23 17:26, Jozsef Kadlecsik wrote:
> Hi,
>
> On Tue, 17 Oct 2023, Linkui Xiao wrote:
>
>> On 10/17/23 03:52, Jozsef Kadlecsik wrote:
>>> Hello,
>>>
>>> On Mon, 16 Oct 2023, xiaolinkui wrote:
>>>
>>>> From: Linkui Xiao <xiaolinkui@kylinos.cn>
>>>>
>>>> There is a race condition which can be demonstrated by the following
>>>> script:
>>>>
>>>> ipset create hash_ip1 hash:net family inet hashsize 1024 maxelem 1048576
>>>> ipset add hash_ip1 172.20.0.0/16
>>>> ipset add hash_ip1 192.168.0.0/16
>>>> iptables -A INPUT -m set --match-set hash_ip1 src -j ACCEPT
>>>> while [ 1 ]
>>>> do
>>>> ipset create hash_ip2 hash:net family inet hashsize 1024 maxelem
>>>> 1048576
>>>> ipset add hash_ip2 172.20.0.0/16
>>>> ipset swap hash_ip1 hash_ip2
>>>> ipset destroy hash_ip2
>>>> sleep 0.05
>>>> done
>>> I have been running the script above for more than two hours and nothing
>>> happened. What else is needed to trigger the bug? (I have been
>>> continuously sending ping to the tested host.)
>> This is an extremely low probability event. In our k8s cluster, hundreds
>> of virtual machines ran for several months before experiencing a crash.
>> Perhaps due to some special reasons, the ip_set_test run time has been
>> increased, during this period, other CPU completed the swap and destroy
>> operations on the ipset.
>> In order to quickly trigger this bug, we can add a delay and destroy
>> operations to the test kernel to simulate the actual environment, such
>> as:
>>
>> @@ -733,11 +733,13 @@ ip_set_unlock(struct ip_set *set)
>> spin_unlock_bh(&set->lock);
>> }
>>
>> +static void ip_set_destroy_set(struct ip_set *set);
>> int
>> ip_set_test(ip_set_id_t index, const struct sk_buff *skb,
>> const struct xt_action_param *par, struct ip_set_adt_opt *opt)
>> {
>> struct ip_set *set = ip_set_rcu_get(xt_net(par), index);
>> + struct ip_set_net *inst = ip_set_pernet(xt_net(par));
>> int ret = 0;
>>
>> BUG_ON(!set);
>> @@ -747,6 +749,17 @@ ip_set_test(ip_set_id_t index, const struct sk_buff *skb,
>> !(opt->family == set->family || set->family == NFPROTO_UNSPEC))
>> return 0;
>>
>> + mdelay(100); // Waiting for swap to complete
>> +
>> + read_lock_bh(&ip_set_ref_lock);
>> + if (set->ref == 0 && set->ref_netlink == 0) { // set can be destroyed
>> with ref = 0
>> + pr_warn("set %s, index %u, ref %u\n", set->name, index,
>> set->ref);
>> + read_unlock_bh(&ip_set_ref_lock);
>> + ip_set(inst, index) = NULL;
>> + ip_set_destroy_set(set);
>> + } else
>> + read_unlock_bh(&ip_set_ref_lock);
>> +
>> ret = set->variant->kadt(set, skb, par, IPSET_TEST, opt);
>>
>> if (ret == -EAGAIN) {
>>
>> Then run the above script on the test kernel, crash will appear on
>> hash_net4_kadt: (need sending ping to the tested host.)
> I understand the intent but that's a sure way to crash the kernel indeed.
>
>> [ 93.021792] BUG: kernel NULL pointer dereference, address: 000000000000009c
>> [ 93.021796] #PF: supervisor read access in kernel mode
>> [ 93.021798] #PF: error_code(0x0000) - not-present page
>> [ 93.021800] PGD 0 P4D 0
>> [ 93.021804] Oops: 0000 [#1] PREEMPT SMP PTI
>> [ 93.021807] CPU: 4 PID: 0 Comm: swapper/4 Kdump: loaded Not tainted
>> 6.6.0-rc5 #29
>> [ 93.021811] Hardware name: VMware, Inc. VMware Virtual Platform/440BX
>> Desktop Reference Platform, BIOS 6.00 11/12/2020
>> [ 93.021813] RIP: 0010:hash_net4_kadt+0x5f/0x1b0 [ip_set_hash_net]
>> [ 93.021825] Code: 00 48 89 44 24 48 48 8b 87 80 00 00 00 45 8b 60 30 c7 44
>> 24 08 00 00 00 00 4c 8b 54 ca 10 31 d2 c6 44 24 0e 00 66 89 54 24 0c <0f> b6
>> b0 9c 00 00 00 40 84 f6 0f 84 bf 00 00 00 48 8d 54 24 10 83
>> [ 93.021827] RSP: 0018:ffffb0180058c9c0 EFLAGS: 00010246
>> [ 93.021830] RAX: 0000000000000000 RBX: 0000000000000002 RCX:
>> 0000000000000002
>> [ 93.021832] RDX: 0000000000000000 RSI: ffff956d747cdd00 RDI:
>> ffff956d9348ba80
>> [ 93.021835] RBP: ffffb0180058ca30 R08: ffffb0180058ca88 R09:
>> ffff956d9348ba80
>> [ 93.021837] R10: ffffffffc102f4f0 R11: ffff956d747cdd00 R12:
>> 00000000ffffffff
>> [ 93.021839] R13: 0000000000000054 R14: ffffb0180058ca88 R15:
>> ffff956d747cdd00
>> [ 93.021862] FS: 0000000000000000(0000) GS:ffff956d9b100000(0000)
>> knlGS:0000000000000000
>> [ 93.021870] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 93.021873] CR2: 000000000000009c CR3: 0000000031e3a006 CR4:
>> 00000000003706e0
>> [ 93.021887] Call Trace:
>> [ 93.021891] <IRQ>
>> [ 93.021893] ? __die+0x24/0x70
>> [ 93.021900] ? page_fault_oops+0x82/0x150
>> [ 93.021905] ? exc_page_fault+0x69/0x150
>> [ 93.021911] ? asm_exc_page_fault+0x26/0x30
>> [ 93.021915] ? __pfx_hash_net4_test+0x10/0x10 [ip_set_hash_net]
>> [ 93.021922] ? hash_net4_kadt+0x5f/0x1b0 [ip_set_hash_net]
>> [ 93.021931] ip_set_test+0x119/0x250 [ip_set]
>> [ 93.021943] set_match_v4+0xa2/0xe0 [xt_set]
>> [ 93.021973] nft_match_eval+0x65/0xb0 [nft_compat]
>> [ 93.021982] nft_do_chain+0xe1/0x430 [nf_tables]
>> [ 93.022008] ? resolve_normal_ct+0xc1/0x200 [nf_conntrack]
>> [ 93.022026] ? nf_conntrack_icmpv4_error+0x123/0x1a0 [nf_conntrack]
>> [ 93.022046] nft_do_chain_ipv4+0x6b/0x90 [nf_tables]
>> [ 93.022066] nf_hook_slow+0x40/0xc0
>> [ 93.022071] ip_local_deliver+0xdb/0x130
>> [ 93.022075] ? __pfx_ip_local_deliver_finish+0x10/0x10
>> [ 93.022078] __netif_receive_skb_core.constprop.0+0x6e1/0x10a0
>> [ 93.022086] ? find_busiest_group+0x43/0x240
>> [ 93.022091] __netif_receive_skb_list_core+0x136/0x2c0
>> [ 93.022096] netif_receive_skb_list_internal+0x1c9/0x300
>> [ 93.022101] ? e1000_clean_rx_irq+0x316/0x4c0 [e1000]
>> [ 93.022113] napi_complete_done+0x73/0x1b0
>> [ 93.022117] e1000_clean+0x7c/0x1e0 [e1000]
>> [ 93.022127] __napi_poll+0x29/0x1b0
>> [ 93.022131] net_rx_action+0x29f/0x370
>> [ 93.022136] __do_softirq+0xcc/0x2af
>> [ 93.022142] __irq_exit_rcu+0xa1/0xc0
>> [ 93.022147] sysvec_apic_timer_interrupt+0x76/0x90
>>>> Swap will exchange the values of ref so destroy will see ref = 0 instead
>>>> of
>>>> ref = 1. So after running this script for a period of time, the following
>>>> race situations may occur:
>>>> CPU0: CPU1:
>>>> ipt_do_table
>>>> ->set_match_v4
>>>> ->ip_set_test
>>>> ipset swap hash_ip1 hash_ip2
>>>> ipset destroy hash_ip2
>>>> ->hash_net4_kadt
>>>>
>>>> CPU0 found ipset through the index, and at this time, hash_ip2 has been
>>>> destroyed by CPU1 through name search. So CPU0 will crash when accessing
>>>> set->data in the function hash_net4_kadt.
>>>>
>>>> With this fix in place swap will not succeed because ip_set_test still has
>>>> ref_swapping on the set.
>>>>
>>>> Both destroy and swap will error out if ref_swapping != 0 on the set.
>>>>
>>>> Signed-off-by: Linkui Xiao <xiaolinkui@kylinos.cn>
>>>> ---
>>>> net/netfilter/ipset/ip_set_core.c | 6 +++++-
>>>> 1 file changed, 5 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/net/netfilter/ipset/ip_set_core.c
>>>> b/net/netfilter/ipset/ip_set_core.c
>>>> index e5d25df5c64c..d6bd37010bfb 100644
>>>> --- a/net/netfilter/ipset/ip_set_core.c
>>>> +++ b/net/netfilter/ipset/ip_set_core.c
>>>> @@ -741,11 +741,13 @@ ip_set_test(ip_set_id_t index, const struct sk_buff
>>>> *skb,
>>>> int ret = 0;
>>>> BUG_ON(!set);
>>>> +
>>>> + __ip_set_get_swapping(set);
>>>> pr_debug("set %s, index %u\n", set->name, index);
>>>> if (opt->dim < set->type->dimension ||
>>>> !(opt->family == set->family || set->family == NFPROTO_UNSPEC))
>>>> - return 0;
>>>> + goto out;
>>>> ret = set->variant->kadt(set, skb, par, IPSET_TEST, opt);
>>>> @@ -764,6 +766,8 @@ ip_set_test(ip_set_id_t index, const struct sk_buff
>>>> *skb,
>>>> ret = -ret;
>>>> }
>>>> +out:
>>>> + __ip_set_put_swapping(set);
>>>> /* Convert error codes to nomatch */
>>>> return (ret < 0 ? 0 : ret);
>>>> }
>>>> --
>>>> 2.17.1
>>> The patch above alas is also not acceptable: it adds locking to a lockless
>>> area and thus slows down the execution unnecessarily.
>> Use seq of cpu in destroy can avoid this issue, but Florian Westphal says
>> it's not suitable:
>> https://lore.kernel.org/all/20231005123107.GB9350@breakpoint.cc/
> No, because there are other, non-iptables usage of ipset in the kernel
> too. And Florian is right that synchonize_rcu() is the proper way to solve
> the issue.
>
>>> If there's a bug, then that must be handled in ip_set_swap() itself, like
>>> for example adding a quiescent time and waiting for the ongoing users of
>>> the swapped set to finish their job. You can make it slow there, because
>>> swapping is not performance critical.
>> Can we add read seq of cpu to swap to wait for the test to complete? Such as:
>>
>> @@ -1357,6 +1357,7 @@ static int ip_set_swap(struct sk_buff *skb, const struct
>> nfnl_info *info,
>> struct ip_set *from, *to;
>> ip_set_id_t from_id, to_id;
>> char from_name[IPSET_MAXNAMELEN];
>> + unsigned int cpu;
>>
>> if (unlikely(protocol_min_failed(attr) ||
>> !attr[IPSET_ATTR_SETNAME] ||
>> @@ -1395,8 +1396,21 @@ static int ip_set_swap(struct sk_buff *skb, const
>> struct nfnl_info *info,
>> swap(from->ref, to->ref);
>> ip_set(inst, from_id) = to;
>> ip_set(inst, to_id) = from;
>> - write_unlock_bh(&ip_set_ref_lock);
>>
>> + /* wait for even xt_recseq on all cpus */
>> + for_each_possible_cpu(cpu) {
>> + seqcount_t *s = &per_cpu(xt_recseq, cpu);
>> + u32 seq = raw_read_seqcount(s);
>> +
>> + if (seq & 1) {
>> + do {
>> + cond_resched();
>> + cpu_relax();
>> + } while (seq == raw_read_seqcount(s));
>> + }
>> + }
>> +
>> + write_unlock_bh(&ip_set_ref_lock);
>> return 0;
>> }
>>
>> Of course, this solution cannot be verified in the test kernel above.
> Also, it does not solve all possible cases, as it was mentioned.
>
> I'd go with using synchonize_rcu() but in the swap function instead of the
> destroy one. The patch below should solve the problem. (The patching above
> is not suitable to verify it at all.):
>
> diff --git a/net/netfilter/ipset/ip_set_core.c b/net/netfilter/ipset/ip_set_core.c
> index 35d2f9c9ada0..35bd1c1e09de 100644
> --- a/net/netfilter/ipset/ip_set_core.c
> +++ b/net/netfilter/ipset/ip_set_core.c
> @@ -712,13 +712,18 @@ ip_set_rcu_get(struct net *net, ip_set_id_t index)
> struct ip_set_net *inst = ip_set_pernet(net);
>
> rcu_read_lock();
> - /* ip_set_list itself needs to be protected */
> + /* ip_set_list and the set pointer need to be protected */
> set = rcu_dereference(inst->ip_set_list)[index];
> - rcu_read_unlock();
>
> return set;
> }
>
> +static inline void
> +ip_set_rcu_put(struct ip_set *set __always_unused)
> +{
> + rcu_read_unlock();
> +}
> +
> static inline void
> ip_set_lock(struct ip_set *set)
> {
> @@ -744,8 +749,10 @@ ip_set_test(ip_set_id_t index, const struct sk_buff *skb,
> pr_debug("set %s, index %u\n", set->name, index);
>
> if (opt->dim < set->type->dimension ||
> - !(opt->family == set->family || set->family == NFPROTO_UNSPEC))
> + !(opt->family == set->family || set->family == NFPROTO_UNSPEC)) {
> + ip_set_rcu_put(set);
> return 0;
> + }
>
> ret = set->variant->kadt(set, skb, par, IPSET_TEST, opt);
>
> @@ -764,6 +771,7 @@ ip_set_test(ip_set_id_t index, const struct sk_buff *skb,
> ret = -ret;
> }
>
> + ip_set_rcu_put(set);
> /* Convert error codes to nomatch */
> return (ret < 0 ? 0 : ret);
> }
> @@ -780,12 +788,15 @@ ip_set_add(ip_set_id_t index, const struct sk_buff *skb,
> pr_debug("set %s, index %u\n", set->name, index);
>
> if (opt->dim < set->type->dimension ||
> - !(opt->family == set->family || set->family == NFPROTO_UNSPEC))
> + !(opt->family == set->family || set->family == NFPROTO_UNSPEC)) {
> + ip_set_rcu_put(set);
> return -IPSET_ERR_TYPE_MISMATCH;
> + }
>
> ip_set_lock(set);
> ret = set->variant->kadt(set, skb, par, IPSET_ADD, opt);
> ip_set_unlock(set);
> + ip_set_rcu_put(set);
>
> return ret;
> }
> @@ -802,12 +813,15 @@ ip_set_del(ip_set_id_t index, const struct sk_buff *skb,
> pr_debug("set %s, index %u\n", set->name, index);
>
> if (opt->dim < set->type->dimension ||
> - !(opt->family == set->family || set->family == NFPROTO_UNSPEC))
> + !(opt->family == set->family || set->family == NFPROTO_UNSPEC)) {
> + ip_set_rcu_put(set);
> return -IPSET_ERR_TYPE_MISMATCH;
> + }
>
> ip_set_lock(set);
> ret = set->variant->kadt(set, skb, par, IPSET_DEL, opt);
> ip_set_unlock(set);
> + ip_set_rcu_put(set);
>
> return ret;
> }
> @@ -882,6 +896,7 @@ ip_set_name_byindex(struct net *net, ip_set_id_t index, char *name)
> read_lock_bh(&ip_set_ref_lock);
> strscpy_pad(name, set->name, IPSET_MAXNAMELEN);
> read_unlock_bh(&ip_set_ref_lock);
> + ip_set_rcu_put(set);
> }
> EXPORT_SYMBOL_GPL(ip_set_name_byindex);
>
> @@ -1348,6 +1363,9 @@ static int ip_set_rename(struct sk_buff *skb, const struct nfnl_info *info,
> * protected by the ip_set_ref_lock. The kernel interfaces
> * do not hold the mutex but the pointer settings are atomic
> * so the ip_set_list always contains valid pointers to the sets.
> + * However after swapping, an userspace set destroy command could
> + * remove a set still processed by kernel side add/del/test.
> + * Therefore we need to wait for the ongoing operations to be finished.
> */
>
> static int ip_set_swap(struct sk_buff *skb, const struct nfnl_info *info,
> @@ -1397,6 +1415,9 @@ static int ip_set_swap(struct sk_buff *skb, const struct nfnl_info *info,
> ip_set(inst, to_id) = from;
> write_unlock_bh(&ip_set_ref_lock);
>
> + /* Make sure all readers of the old set pointers are completed. */
> + synchronize_rcu();
> +
> return 0;
> }
>
>
> Best regards,
> Jozsef
I have tested that swap and destroy cannot be executed simultaneously.
This patch works for me. Thank you again.
Best regards,
Linkui Xiao
next prev parent reply other threads:[~2023-10-17 12:49 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-10-16 13:52 [PATCH 1/2] netfilter: ipset: rename ref_netlink to ref_swapping xiaolinkui
2023-10-16 13:52 ` [PATCH 2/2] netfilter: ipset: fix race condition in ipset swap, destroy and test xiaolinkui
2023-10-16 19:52 ` Jozsef Kadlecsik
2023-10-17 6:29 ` Linkui Xiao
2023-10-17 9:26 ` Jozsef Kadlecsik
2023-10-17 12:49 ` Linkui Xiao [this message]
2023-10-16 19:43 ` [PATCH 1/2] netfilter: ipset: rename ref_netlink to ref_swapping Jozsef Kadlecsik
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=8586e54f-eb36-dad5-ca1e-8924bc5e23a8@gmail.com \
--to=xiaolinkui@gmail.com \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=fw@strlen.de \
--cc=justinstitt@google.com \
--cc=kadlec@netfilter.org \
--cc=kuba@kernel.org \
--cc=kuniyu@amazon.com \
--cc=netfilter-devel@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=pablo@netfilter.org \
--cc=xiaolinkui@kylinos.cn \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).