From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from cmx-mtlrgo001.bell.net (mta-mtl-005.bell.net [209.71.208.25])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 385DE85626
	for <linux-parisc@vger.kernel.org>; Wed,  8 May 2024 20:52:57 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.71.208.25
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1715201580; cv=none; b=Tg/B6As2L6h92FXYD3FeJjexWsST90kB5AqpFA5BnqM+9um/BKZVlHJ7bIUhAa7lSHKZFxNT3BFejvYiJG6P83ZR0QABT2QmwVkCZlmvqp8FjzE6ERu5MLqZNsJu0s4H7DhoiDQBLUKu23qtEf8go6MT7QHr8IyAoRaaW2IjihQ=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1715201580; c=relaxed/simple;
	bh=4KyPPddNbXg7qORbv4hRnWRMUacRtNqgYPCYI0GjtsA=;
	h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From:
	 In-Reply-To:Content-Type; b=t5sFF5siE7+0MGtneQweQlulfrMbTCs91v0FIvdwCsFRX6Cs47Zn2cLEmPrgzs7GTLK94FGtY/4bYVyfnvN3dj1XX1jREHRwd039sU6a03OxPL8783C+c+TafkaUk4HxDzG8CX0hVuGx6QVNKwpOKXMUlPxE/CVtm8HnlE80HQo=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=bell.net; spf=pass smtp.mailfrom=bell.net; dkim=pass (2048-bit key) header.d=bell.net header.i=@bell.net header.b=d++JLv4D; arc=none smtp.client-ip=209.71.208.25
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=bell.net
Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bell.net
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=bell.net header.i=@bell.net header.b="d++JLv4D"
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bell.net; s=selector1; t=1715201578; 
        bh=UbaWxZPWppXUbHS5E21ujnunatYd5z41TLiE/eZqtW8=;
        h=Message-ID:Date:MIME-Version:Subject:To:References:From:In-Reply-To:Content-Type;
        b=d++JLv4DQaK5zDYjGUG2vWUiYblFKTy7JY93uUTuSvcVU+3+j/EYyt9fsMJNT2aKjiiBf/3Eln6xcD4pzoYNvjgrJRttYMllCzTCZLsn4Gah4b4WXmtk4z54CCzBTPHqEokp0iQEvuV57jx5SCxO4sT70L/VUQd/xwmEQaiaCpFq88RV2D862SXwbimDZZIYgzMJvY72lmHdkE1hjHzMQVvfZ1tcghLPJNml+KKtgbMVxUpaGACzIHk1zLzM2qRPgOkfnzfjD8ODruP8HgOnZsxYOIK6Y/oT48J9oPUry2EF+M6vA6OfpwelI2hK/zjYsONmyvfmAWXanfE5MeWPWw==
X-RG-SOPHOS: Clean
X-RG-VADE-SC: 0
X-RG-VADE: Clean
X-RG-Env-Sender: dave.anglin@bell.net
X-RG-Rigid: 663B1068001825FA
X-RazorGate-Vade: gggruggvucftvghtrhhoucdtuddrgedvledrvdeftddgudehvdcutefuodetggdotefrodftvfcurfhrohhfihhlvgemuceugffnnfdpqfgfvfenuceurghilhhouhhtmecufedtudenucesvcftvggtihhpihgvnhhtshculddquddttddmnecujfgurhepkfffgggfuffvvehfhfgjtgfgsehtkeertddtvdejnecuhfhrohhmpeflohhhnhcuffgrvhhiugcutehnghhlihhnuceouggrvhgvrdgrnhhglhhinhessggvlhhlrdhnvghtqeenucggtffrrghtthgvrhhnpeejleffffejhefggfeuheelgeefgeeuieegtdekffegudeuteffgeffjedukefgueenucfkphepudegvddruddviedrudekkedrvdehudenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhephhgvlhhopegludelvddrudeikedrvddrgeelngdpihhnvghtpedugedvrdduvdeirddukeekrddvhedupdhmrghilhhfrhhomhepuggrvhgvrdgrnhhglhhinhessggvlhhlrdhnvghtpdhnsggprhgtphhtthhopeeipdhrtghpthhtohepgghiughrrgdrlfhonhgrshesshgviihnrghmrdgtiidprhgtphhtthhopegurghvvgdrrghnghhlihhnsegsvghllhdrnhgvthdprhgtphhtthhopegurghvvgesphgrrhhishgtqdhlihhnuhigrdhorhhgpdhrtghpthhtohepuggvlhhlvghrsehgmhigrdguvgdprhgtphhtthhopehlihhnuhigqdhprghrihhstgesvhhgvghrrdhkvghrnhgvlhdrohhrghdprhgtphhtthhopehmrghtohhrohgp
	mhgrihhlihhnghhlihhsthgpkhgvrhhnvghlsehmrghtohhrohdrthhk
X-RazorGate-Vade-Verdict: clean 0
X-RazorGate-Vade-Classification: clean
Received: from [192.168.2.49] (142.126.188.251) by cmx-mtlrgo001.bell.net (authenticated as dave.anglin@bell.net)
        id 663B1068001825FA; Wed, 8 May 2024 16:52:10 -0400
Message-ID: <e88cebf4-bec6-4247-93ae-39eff59cfc8e@bell.net>
Date: Wed, 8 May 2024 16:52:11 -0400
Precedence: bulk
X-Mailing-List: linux-parisc@vger.kernel.org
List-Id: <linux-parisc.vger.kernel.org>
List-Subscribe: <mailto:linux-parisc+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-parisc+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [PATCH] parisc: Try to fix random segmentation faults in package
 builds
To: matoro <matoro_mailinglist_kernel@matoro.tk>
Cc: Vidra.Jonas@seznam.cz, linux-parisc@vger.kernel.org,
 John David Anglin <dave@parisc-linux.org>, Helge Deller <deller@gmx.de>
References: <Zje6ywzNAltbG3R2@mx3210.localdomain>
 <C4u.NueN.39ikIzqu}iW.1cEpt7@seznam.cz>
 <91563ff7-349b-4815-bcfe-99f8f34b0b16@bell.net>
 <34fdf2250fe166372a15d74d28adc8d2@matoro.tk>
Content-Language: en-US
From: John David Anglin <dave.anglin@bell.net>
Autocrypt: addr=dave.anglin@bell.net; keydata=
 xsFNBFJfN1MBEACxBrfJ+5RdCO+UQOUARQLSsnVewkvmNlJRgykqJkkI5BjO2hhScE+MHoTK
 MoAeKwoLfBwltwoohH5RKxDSAIWajTY5BtkJBT23y0hm37fN2JXHGS4PwwgHTSz63cu5N1MK
 n8DZ3xbXFmqKtyaWRwdA40dy11UfI4xzX/qWR3llW5lp6ERdsDDGHm5u/xwXdjrAilPDk/av
 d9WmA4s7TvM/DY3/GCJyNp0aJPcLShU2+1JgBxC6NO6oImVwW07Ico89ETcyaQtlXuGeXYTK
 UoKdEHQsRf669vwcV5XbmQ6qhur7QYTlOOIdDT+8zmBSlqBLLe09soATDciJnyyXDO1Nf/hZ
 gcI3lFX86i8Fm7lQvp2oM5tLsODZUTWVT1qAFkHCOJknVwqRZ8MfOvaTE7L9hzQ9QKgIKrSE
 FRgf+gs1t1vQMRHkIxVWb730C0TGiMGNn2oRUV5O5QEdb/tnH0Te1l+hX540adKZ8/CWzzW9
 vcx+qD9IWLRyZMsM9JnmAIvYv06+YIcdpbRYOngWPd2BqvktzIs9mC4n9oU6WmUhBIaGOGnt
 t/49bTRtJznqm/lgqxtE2NliJN79dbZJuJWe5HkjVa7mP4xtsG59Rh2hat9ByUfROOfoZ0dS
 sVHF/N6NLWcf44trK9HZdT/wUeftEWtMV9WqxIwsA4cgSHFR2QARAQABzTdKb2huIERhdmlk
 IEFuZ2xpbiAoRGViaWFuIFBvcnRzKSA8ZGF2ZS5hbmdsaW5AYmVsbC5uZXQ+wsF3BBMBCAAh
 BQJSXzdTAhsDBQsJCAcDBRUKCQgLBRYCAwEAAh4BAheAAAoJEF2/za5fGU3xs/4P/15sNizR
 ukZLNYoeGAd6keRtNcEcVGEpRgzc/WYlXCRTEjRknMvmCu9z13z8qB9Y9N4JrPdp+NQj5HEs
 ODPI+1w1Mjj9R2VZ1v7suFwhjxMTUQUjCsgna1H+zW/UFsrL5ERX2G3aUKlVdYmSWapeGeFL
 xSMPzawPEDsbWzBzYLSHUOZexMAxoJYWnpN9JceEcGvK1SU2AaGkhomFoPfEf7Ql1u3Pgzie
 ClWEr2QHl+Ku1xW0qx5OLKHxntaQiu30wKHBcsF0Zx2uVGYoINJl/syazfZyKTdbmJnEYyNa
 Bdbn7B8jIkVCShLOWJ8AQGX/XiOoL/oE9pSZ60+MBO9qd18TGYByj0X2PvH+OyQGul5zYM7Q
 7lT97PEzh8xnib49zJVVrKDdJds/rxFwkcHdeppRkxJH0+4T0GnU2IZsEkvpRQNJAEDmEE8n
 uRfssr7RudZQQwaBugUGaoouVyFxzCxdpSYL6zWHA51VojvJYEBQDuFNlUCqet9LtNlLKx2z
 CAKmUPTaDwPcS3uOywOW7WZrAGva1kz9lzxZ+GAwgh38HAFqQT8DQvW8jnBBG4m4q7lbaum3
 znERv7kcfKWoWS7fzxLNTIitrbpYA3E7Zl9D2pDV3v55ZQcO/M35K9teRo6glrtFDU/HXM+r
 ABbh8u9UnADbPmJr9nb7J0tZUSS/zsFNBFJfN1MBEADBzhVn4XyGkPAaFbLPcMUfwcIgvvPF
 UsLi9Q53H/F00cf7BkMY40gLEXvsvdUjAFyfas6z89gzVoTUx3HXkJTIDTiPuUc1TOdUpGYP
 hlftgU+UqW5O8MMvKM8gx5qn64DU0UFcS+7/CQrKOJmzktr/72g98nVznf5VGysa44cgYeoA
 v1HuEoqGO9taA3Io1KcGrzr9cAZtlpwj/tcUJlc6H5mqPHn2EdWYmJeGvNnFtxd0qJDmxp5e
 YVe4HFNjUwsb3oJekIUopDksAP41RRV0FM/2XaPatkNlTZR2krIVq2YNr0dMU8MbMPxGHnI9
 b0GUI+T/EZYeFsbx3eRqjv1rnNg2A6kPRQpn8dN3BKhTR5CA7E/cs+4kTmV76aHpW8m/NmTc
 t7KNrkMKfi+luhU2P/sKh7Xqfbcs7txOWB2V4/sbco00PPxWr20JCA5hYidaKGyQxuXdPUlQ
 Qja4WJFnAtBhh3Oajgwhbvd6S79tz1acjNXZ89b8IN7yDm9sQ+4LhWoUQhB5EEUUUVQTrzYS
 yTGN1YTTO5IUU5UJHb5WGMnSPLLArASctOE01/FYnnOGeU+GFIeQp91p+Jhd07hUr6KWYeJY
 OgEmu+K8SyjfggCWdo8aGy0H3Yr0YzaHeK2HrfC3eZcUuo+yDW3tnrNwM1rd1i3F3+zJK18q
 GnBxEQARAQABwsFfBBgBCAAJBQJSXzdTAhsMAAoJEF2/za5fGU3xNDQP/ikzh1NK/UBrWtpN
 yXLbype4k5/zyQd9FIBxAOYEOogfKdkp+Yc66qNf36gO6vsokxsDXU9me1n8tFoB/DCdzKbQ
 /RjKQRMNNR4fT2Q9XV6GZYSL/P2A1wzDW06tEI+u+1dV40ciQULQ3ZH4idBW3LdN+nloQf/C
 qoYkOf4WoLyhSzW7xdNPZqiJCAdcz9djN79FOz8US+waBCJrL6q5dFSvvsYj6PoPJkCgXhiJ
 hI91/ERMuK9oA1oaBxCvuObBPiFlBDNXZCwmUk6qzLDjfZ3wdiZCxc5g7d2e2taBZw/MsKFc
 k+m6bN5+Hi1lkmZEP0L4MD6zcPuOjHmYYzX4XfQ61lQ8c4ztXp5cKkrvaMuN/bD57HJ6Y73Q
 Y+wVxs9x7srl4iRnbulCeiSOAqHmwBAoWaolthqe7EYL4d2+CjPCcfIuK7ezsEm8c3o3EqC4
 /UpL1nTi0rknRTGc0VmPef+IqQUj33GGj5JRzVJZPnYyCx8sCb35Lhs6X8ggpsafUkuKrH76
 XV2KRzaE359RgbM3pNEViXp3NclPYmeu+XI8Ls/y6tSq5e/o/egktdyJj+xvAj9ZS18b10Jp
 e67qK8wZC/+N7LGON05VcLrdZ+FXuEEojJWbabF6rJGN5X/UlH5OowVFEMhD9s31tciAvBwy
 T70V9SSrl2hiw38vRzsl
In-Reply-To: <34fdf2250fe166372a15d74d28adc8d2@matoro.tk>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit

On 2024-05-08 3:18 p.m., matoro wrote:
> On 2024-05-08 11:23, John David Anglin wrote:
>> On 2024-05-08 4:54 a.m., Vidra.Jonas@seznam.cz wrote:
>>> ---------- Original e-mail ----------
>>> From: John David Anglin
>>> To: linux-parisc@vger.kernel.org
>>> CC: Helge Deller
>>> Date: 5. 5. 2024 19:07:17
>>> Subject: [PATCH] parisc: Try to fix random segmentation faults in package builds
>>>
>>>> The majority of random segmentation faults that I have looked at
>>>> appear to be memory corruption in memory allocated using mmap and
>>>> malloc. This got me thinking that there might be issues with the
>>>> parisc implementation of flush_anon_page.
>>>>
>>>> [...]
>>>>
>>>> Lightly tested on rp3440 and c8000.
>>> Hello,
>>>
>>> thank you very much for working on the issue and for the patch! I tested
>>> it on my C8000 with the 6.8.9 kernel with Gentoo distribution patches.
>> Thanks for testing.  Trying to fix these faults is largely guess work.
>>
>> In my opinion, the 6.1.x branch is the most stable branch on parisc.  6.6.x and later
>> branches have folio changes and haven't had very much testing in build environments.
>> I did run 6.8.7 and 6.8.8 on rp3440 for some time but I have gone back to a slightly
>> modified 6.1.90.
>>>
>>> My machine is affected heavily by the segfaults – with some kernel
>>> configurations, I get several per hour when compiling Gentoo packages
>> That's more than normal although number seems to depend on package.
>> At this rate, you wouldn't be able to build gcc.
>>> on all four cores. This patch doesn't fix them, though. On the patched
>> Okay.  There are likely multiple problems.  The problem I was trying to address is null
>> objects in the hash tables used by ld and as.  The symptom is usually a null pointer
>> dereference after pointer has been loaded from null object. These occur in multiple
>> places in libbfd during hash table traversal.  Typically, a couple would occur in a gcc
>> testsuite run.  _objalloc_alloc uses malloc.  One can see the faults on the console and
>> in the gcc testsuite log.
>>
>> How these null objects are generated is not known.  It must be a kernel issue because
>> they don't occur with qemu.  I think the frequency of these faults is reduced with the
>> patch.  I suspect the objects are zeroed after they are initialized.  In some cases, ld can
>> successfully link by ignoring null objects.
>>
>> The next time I see a fault caused by a null object, I think it would be useful to see if
>> we have a full null page.  This might indicate a swap problem.
>>
>> random faults also occur during gcc compilations.  gcc uses mmap to allocate memory.
>>
>>> kernel, it happened after ~8h of uptime during installation of the
>>> perl-core/Test-Simple package. I got no error output from the running
>>> program, but an HPMC was logged to the serial console:
>>>
>>> [30007.186309] mm/pgtable-generic.c:54: bad pmd 539b0030.
>>> <Cpu3> 78000c6203e00000  a0e008c01100b009 CC_PAT_ENCODED_FIELD_WARNING
>>> <Cpu0> e800009800e00000  0000000041093be4 CC_ERR_CHECK_HPMC
>>> <Cpu1> e800009801e00000  00000000404ce130 CC_ERR_CHECK_HPMC
>>> <Cpu3> 76000c6803e00000  0000000000000520 CC_PAT_DATA_FIELD_WARNING
>>> <Cpu0> 37000f7300e00000  84000[30007.188321] Backtrace:
>>> [30007.188321]  [<00000000404eef9c>] pte_offset_map_nolock+0xe8/0x150
>>> [30007.188321]  [<00000000404d6784>] __handle_mm_fault+0x138/0x17e8
>>> [30007.188321]  [<00000000404d8004>] handle_mm_fault+0x1d0/0x3b0
>>> [30007.188321]  [<00000000401e4c98>] do_page_fault+0x1e4/0x8a0
>>> [30007.188321]  [<00000000401e95c0>] handle_interruption+0x330/0xe60
>>> [30007.188321]  [<0000000040295b44>] schedule_tail+0x78/0xe8
>>> [30007.188321]  [<00000000401e0f6c>] finish_child_return+0x0/0x58
>>>
>>> A longer excerpt of the logs is attached. The error happened at boot
>>> time 30007, the preceding unaligned accesses seem to be unrelated.
>> I doubt this HPMC is related to the patch.  In the above, the pmd table appears to have
>> become corrupted.
>>>
>>> The patch didn't apply cleanly, but all hunks succeeded with some
>>> offsets and fuzz. This may also be a part of it – I didn't check the
>>> code for merge conflicts manually.
>> Sorry, the patch was generated against 6.1.90.  This is likely the cause of the offsets
>> and fuzz.
>>>
>>> If you want me to provide you with more logs (such as the HPMC dumps)
>>> or run some experiments, let me know.
>>>
>>>
>>> Some speculation about the cause of the errors follows:
>>>
>>> I don't think it's a hardware error, as HP-UX 11i v1 works flawlessly on
>>> the same machine. The errors seem to be more frequent with a heavy IO
>>> load, so it might be system-bus or PCI-bus-related. Using X11 causes
>>> lockups rather quickly, but that could be caused by unrelated errors in
>>> the graphics subsystem and/or the Radeon drivers.
>> I am not using X11 on my c8000.  I have frame buffer support on. Radeon acceleration
>> is broken on parisc.
>>
>> Maybe there are more problems with debian kernels because of its use of X11.
>>>
>>> Limiting the machine to a single socket (2 cores) by disabling the other
>>> socket in firmware, or even booting on a single core using a maxcpus=1
>>> kernel cmdline option, decreases the error frequency, but doesn't
>>> prevent them completely, at least on an (unpatched) 6.1 kernel. So it's
>>> probably not an SMP bug. If it's related to cache coherency, it's
>>> coherency between the CPUs and bus IO.
>>>
>>> The errors typically manifest as a null page access to a very low
>>> address, so probably a null pointer dereference. I think the kernel
>>> accidentally maps a zeroed page in place of one that the program was
>>> using previously, making it load (and subsequently dereference) a null
>>> pointer instead of a valid one. There are two problems with this theory,
>>> though:
>>> 1. It would mean the program could also load zeroed /data/ instead of a
>>> zeroed /pointer/, causing data corruption. I never conclusively observed
>>> this, although I am getting GCC ICEs from time to time, which could
>>> be explained by data corruption.
>> GCC catches page faults and no core dump is generated when it ICEs. So, it's harder
>> to debug memory issues in gcc.
>>
>> I have observed zeroed data multiple times in ld faults.
>>> 2. The segfault is sometimes preceded by an unaligned access, which I
>>> believe is also caused by a corrupted machine state rather than by a
>>> coding error in the program – sometimes a bunch of unaligned accesses
>>> show up in the logs just prior to a segfault / lockup, even from
>>> unrelated programs such as random bash processes. Sometimes the machine
>>> keeps working afterwards (although I typically reboot it immediately
>>> to limit the consequences of potential kernel data structure damage),
>>> sometimes it HPMCs or LPMCs. This is difficult to explain by just a wild
>>> zeroed page appearance. But this typically happens when running X11, so
>>> again, it might be caused by another bug, such as the GPU randomly
>>> writing to memory via misconfigured DMA.
>> There was a bug in the unaligned handler for double word instructions (ldd) that was
>> recently fixed.  ldd/std are not used in userspace, so this problem didn't affect it.
>>
>> Kernel unaligned faults are not logged, so problems could occur internal to the kernel
>> and not be noticed till disaster.  Still, it seems unlikely that an unaligned fault would
>> corrupt more than a single word.
>>
>> We have observed that the faults appear SMP and memory size related.  A rp4440 with
>> 6 CPUs and 4 GB RAM faulted a lot.  It's mostly a PA8800/PA8900 issue.
>>
>> It's months since I had a HPMC or LPMC on rp3440 and c8000. Stalls still happen but they
>> are rare.
>>
>> Dave
>
> Hi, I also tested this patch on an rp3440 with PA8900. Unfortunately it seems to have exacerbated an existing issue which takes the whole 
> machine down.  Occasionally I would get a message:
>
> [ 7497.061892] Kernel panic - not syncing: Kernel Fault
>
> with no accompanying stack trace and then the BMC would restart the whole machine automatically.  These were infrequent enough that the 
> segfaults were the bigger problem, but after applying this patch on top of 6.8, this changed the dynamic.  It seems to occur during builds 
> with varying I/O loads.  For example, I was able to build gcc fine, with no segfaults, but I was unable to build perl, a much smaller build, 
> without crashing the machine.  I did not observe any segfaults over the day or 2 I ran this patch, but that's not an unheard-of stretch of 
> time even without it, and I am being forced to revert because of the panics.
Looks like there is a problem with 6.8.  I'll do some testing with it.

I haven't had any panics with 6.1 on rp3440 or c8000.

Trying a debian perl-5.38.2 build.

Dave

-- 
John David Anglin  dave.anglin@bell.net