From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id EE043200AD for ; Tue, 30 May 2023 19:46:38 +0000 (UTC) Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0FF2EC9; Tue, 30 May 2023 12:46:37 -0700 (PDT) From: Thomas Gleixner DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1685475995; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=0lqf16k56ICJR3f6afGNeInQg8eM22KZlAGnGhYDryg=; b=o6vZpXSfbD9F9sDMrCjUtZkQHJEhNpIRrQCrBSW6Iq8s4kwZS4LB+IMH34MON/awYWSJ75 rCNYEIhS8kMeB4cwK4YIdVcCzFWjWLROOsqxvuxzt4T7LuIasgW6UYg8IZpgKR4IWWSnkI cfaeFzvQ5DyTL+P77UgzfLrLaylkJ4pX9B6655stTFkCD+ZTlYf06soGLEn06X4P00KxmN 22ZiLuI/hWHoMIAk+CdWCKXf9rHuUmoJIgswkus2zWQZUP8VU9baKSymZLRdWEMpbAbUKC t7xcWC6UN0ANqlpeFMjLelUMGt4KLqm1uV/6z6WrPO0Z7CO+adyLcevcj1kQ7Q== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1685475995; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=0lqf16k56ICJR3f6afGNeInQg8eM22KZlAGnGhYDryg=; b=1ZLGV/9rlBhLnHdhKrGpeHGlteJDyZ0T1HJ2kfH59OnroTZzRc3FLQKliCW2RLbTxBkvvx xqFkAmEwC5LDrpAA== To: Chuck Lever III Cc: Eli Cohen , Leon Romanovsky , Saeed Mahameed , linux-rdma , "open list:NETWORKING [GENERAL]" Subject: Re: system hang on start-up (mlx5?) In-Reply-To: <0C0389AD-5DB9-42A8-993C-2C9DEDC958AC@oracle.com> References: <91176545-61D2-44BF-B736-513B78728DC7@oracle.com> <20230504072953.GP525452@unreal> <46EB453C-3CEB-43E8-BEE5-CD788162A3C9@oracle.com> <875y8altrq.ffs@tglx> <0C0389AD-5DB9-42A8-993C-2C9DEDC958AC@oracle.com> Date: Tue, 30 May 2023 21:46:34 +0200 Message-ID: <87o7m1iov9.ffs@tglx> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED,SPF_HELO_NONE, SPF_PASS,T_SCC_BODY_TEXT_LINE,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Chuck! On Tue, May 30 2023 at 13:09, Chuck Lever III wrote: >> On May 29, 2023, at 5:20 PM, Thomas Gleixner wrote: >> But if you look at the address: 0xffffffffb9ef3f80 >> >> That one is bogus: >> >> managed_map=ffff9a36efcf0f80 >> managed_map=ffff9a36efd30f80 >> managed_map=ffff9a3aefc30f80 >> managed_map=ffff9a3aefc70f80 >> managed_map=ffff9a3aefd30f80 >> managed_map=ffff9a3aefd70f80 >> managed_map=ffffffffb9ef3f80 >> >> Can you spot the fail? >> >> The first six are in the direct map and the last one is in module map, >> which makes no sense at all. > > Indeed. The reason for that is that the affinity mask has bits > set for CPU IDs that are not present on my system. Which I don't buy, but even if so then this should not make for_each_cpu() go south. See below. > After bbac70c74183 ("net/mlx5: Use newer affinity descriptor") > that mask is set up like this: > > struct mlx5_irq *mlx5_ctrl_irq_request(struct mlx5_core_dev *dev) > { > struct mlx5_irq_pool *pool = ctrl_irq_pool_get(dev); > - cpumask_var_t req_mask; > + struct irq_affinity_desc af_desc; That's daft. With NR_CPUS=8192 this is a whopping 1KB on stack... > struct mlx5_irq *irq; > - if (!zalloc_cpumask_var(&req_mask, GFP_KERNEL)) > - return ERR_PTR(-ENOMEM); > - cpumask_copy(req_mask, cpu_online_mask); > + cpumask_copy(&af_desc.mask, cpu_online_mask); > + af_desc.is_managed = false; > > Which normally works as you would expect. But for some historical > reason, I have CONFIG_NR_CPUS=32 on my system, and the > cpumask_copy() misbehaves. > > If I correct mlx5_ctrl_irq_request() to clear @af_desc before the > copy, this crash goes away. But mlx5_core crashes during a later > part of its init, in cpu_rmap_update(). cpu_rmap_update() does > exactly the same thing (for_each_cpu() on an affinity mask created > by copying), and crashes in a very similar fashion. > > If I set CONFIG_NR_CPUS to a larger value, like 512, the problem > vanishes entirely, and "modprobe mlx5_core" works as expected. > > Thus I think the problem is with cpumask_copy() or for_each_cpu() > when NR_CPUS is a small value (the default is 8192). I don't buy any of this. Here is why: cpumask_copy(d, s) bitmap_copy(d, s, nbits = 32) len = BITS_TO_LONGS(nbits) * sizeof(unsigned long); So it copies as many longs as required to cover nbits, i.e. it copies any clobbered bits beyond nbits too. While that looks odd at the first glance, that's just an optimization which is harmless. for_each_cpu() finds the next set bit in a mask and breaks the loop once bitnr >= small_cpumask_bits, which is nr_cpu_ids and should be 32 too. I just booted a kernel with NR_CPUS=32: [ 0.152988] smpboot: 56 Processors exceeds NR_CPUS limit of 32 [ 0.153606] smpboot: Allowing 32 CPUs, 0 hotplug CPUs ... [ 0.173854] setup_percpu: NR_CPUS:32 nr_cpumask_bits:32 nr_cpu_ids:32 nr_node_ids:1 and added a function which does: struct cpumask ma, mb; int cpu; memset(&ma, 0xaa, sizeof(ma); cpumask_copy(&mb, &ma); pr_info("MASKBITS: %016lx\n", mb.bits[0]); pr_info("CPUs:"); for_each_cpu(cpu, &mb) pr_cont(" %d", cpu); pr_cont("\n"); [ 2.165606] smp: MASKBITS: 0xaaaaaaaaaaaaaaaa [ 2.166574] smp: CPUs: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 and the same exercise with a 0x55 filled source bitmap. [ 2.167595] smp: MASKBITS: 0x5555555555555555 [ 2.168568] smp: CPUs: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 So while cpumask_copy copied beyond NR_CPUs bits, for_each_cpu() does the right thing simple because of this: nr_cpu_ids is 32, right? for_each_cpu(cpu, mask) for_each_set_bit(bit = cpu, addr = &mask, size = nr_cpu_ids) for ((bit) = 0; (bit) = find_next_bit((addr), (size), (bit)), (bit) < (size); (bit)++) So if find_next_bit() returns a bit after bit 31 the condition (bit) < (size) will terminate the loop, right? Also in the case of that driver the copy is _NOT_ copying set bits beyond bit 31 simply because the source mask is cpu_online_mask which definitely does not have a bit set which is greater than 31. As the copy copies longs the resulting mask in af_desc.mask cannot have any bit set past bit 31 either. If the above is not correct, then there is a bigger problem than that MLX5 driver crashing. So this: > If I correct mlx5_ctrl_irq_request() to clear @af_desc before the > copy, this crash goes away. does not make any sense to me. Can you please add after the cpumask_copy() in that mlx5 code: pr_info("ONLINEBITS: %016lx\n", cpu_online_mask->bits[0]); pr_info("MASKBITS: %016lx\n", af_desc.mask.bits[0]); Please print also in irq_matrix_reserve_managed(): - @mask->bits[0] - nr_cpu_ids - the CPU numbers inside the for_each_cpu() loop Thanks, tglx