All of lore.kernel.org
 help / color / mirror / Atom feed
From: Baoquan He <bhe@redhat.com>
To: Catalin Marinas <catalin.marinas@arm.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>,
	Zhen Lei <thunder.leizhen@huawei.com>,
	Ard Biesheuvel <ardb@kernel.org>,
	Mark Rutland <mark.rutland@arm.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	x86@kernel.org, "H . Peter Anvin" <hpa@zytor.com>,
	Eric Biederman <ebiederm@xmission.com>,
	Rob Herring <robh+dt@kernel.org>,
	Frank Rowand <frowand.list@gmail.com>,
	devicetree@vger.kernel.org, Dave Young <dyoung@redhat.com>,
	Vivek Goyal <vgoyal@redhat.com>,
	kexec@lists.infradead.org, linux-kernel@vger.kernel.org,
	Will Deacon <will@kernel.org>,
	linux-arm-kernel@lists.infradead.org,
	Jonathan Corbet <corbet@lwn.net>,
	linux-doc@vger.kernel.org, Randy Dunlap <rdunlap@infradead.org>,
	Feng Zhou <zhoufeng.zf@bytedance.com>,
	Chen Zhou <dingguo.cz@antgroup.com>,
	John Donnelly <John.p.donnelly@oracle.com>,
	Dave Kleikamp <dave.kleikamp@oracle.com>,
	liushixin <liushixin2@huawei.com>
Subject: Re: [PATCH 5/5] arm64: kdump: Don't defer the reservation of crash high memory
Date: Wed, 22 Jun 2022 16:35:16 +0800	[thread overview]
Message-ID: <YrLUREAoBMSZo7RR@MiWiFi-R3L-srv> (raw)
In-Reply-To: <YrIIJkhKWSuAqkCx@arm.com>

Hi Catalin,

On 06/21/22 at 07:04pm, Catalin Marinas wrote:
> On Tue, Jun 21, 2022 at 02:24:01PM +0800, Kefeng Wang wrote:
> > On 2022/6/21 13:33, Baoquan He wrote:
> > > On 06/13/22 at 04:09pm, Zhen Lei wrote:
> > > > If the crashkernel has both high memory above DMA zones and low memory
> > > > in DMA zones, kexec always loads the content such as Image and dtb to the
> > > > high memory instead of the low memory. This means that only high memory
> > > > requires write protection based on page-level mapping. The allocation of
> > > > high memory does not depend on the DMA boundary. So we can reserve the
> > > > high memory first even if the crashkernel reservation is deferred.
> > > > 
> > > > This means that the block mapping can still be performed on other kernel
> > > > linear address spaces, the TLB miss rate can be reduced and the system
> > > > performance will be improved.
> > > 
> > > Ugh, this looks a little ugly, honestly.
> > > 
> > > If that's for sure arm64 can't split large page mapping of linear
> > > region, this patch is one way to optimize linear mapping. Given kdump
> > > setting is necessary on arm64 server, the booting speed is truly
> > > impacted heavily.
> > 
> > Is there some conclusion or discussion that arm64 can't split large page
> > mapping?
> > 
> > Could the crashkernel reservation (and Kfence pool) be splited dynamically?
> > 
> > I found Mark replay "arm64: remove page granularity limitation from
> > KFENCE"[1],
> > 
> >   "We also avoid live changes from block<->table mappings, since the
> >   archtitecture gives us very weak guarantees there and generally requires
> >   a Break-Before-Make sequence (though IIRC this was tightened up
> >   somewhat, so maybe going one way is supposed to work). Unless it's
> >   really necessary, I'd rather not split these block mappings while
> >   they're live."
> 
> The problem with splitting is that you can end up with two entries in
> the TLB for the same VA->PA mapping (e.g. one for a 4KB page and another
> for a 2MB block). In the lucky case, the CPU will trigger a TLB conflict
> abort (but can be worse like loss of coherency).

Thanks for this explanation. Is this a drawback of arm64 design? X86
code do the same thing w/o issue, is there way to overcome this on
arm64 from hardware or software side?

I ever got a arm64 server with huge memory, w or w/o crashkernel setting 
have different bootup time. And the more often TLB miss and flush will
cause performance cost. It is really a pity if we have very powerful
arm64 cpu and system capacity, but bottlenecked by this drawback.

> 
> Prior to FEAT_BBM (added in ARMv8.4), such scenario was not allowed at
> all, the software would have to unmap the range, TLBI, remap. With
> FEAT_BBM (level 2), we can do this without tearing the mapping down but
> we still need to handle the potential TLB conflict abort. The handler
> only needs a TLBI but if it touches the memory range being changed it
> risks faulting again. With vmap stacks and the kernel image mapped in
> the vmalloc space, we have a small window where this could be handled
> but we probably can't go into the C part of the exception handling
> (tracing etc. may access a kmalloc'ed object for example).
> 
> Another option is to do a stop_machine() (if multi-processor at that
> point), disable the MMUs, modify the page tables, re-enable the MMU but
> it's also complicated.
> 
> -- 
> Catalin
> 


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

WARNING: multiple messages have this Message-ID (diff)
From: Baoquan He <bhe@redhat.com>
To: Catalin Marinas <catalin.marinas@arm.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>,
	Zhen Lei <thunder.leizhen@huawei.com>,
	Ard Biesheuvel <ardb@kernel.org>,
	Mark Rutland <mark.rutland@arm.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	x86@kernel.org, "H . Peter Anvin" <hpa@zytor.com>,
	Eric Biederman <ebiederm@xmission.com>,
	Rob Herring <robh+dt@kernel.org>,
	Frank Rowand <frowand.list@gmail.com>,
	devicetree@vger.kernel.org, Dave Young <dyoung@redhat.com>,
	Vivek Goyal <vgoyal@redhat.com>,
	kexec@lists.infradead.org, linux-kernel@vger.kernel.org,
	Will Deacon <will@kernel.org>,
	linux-arm-kernel@lists.infradead.org,
	Jonathan Corbet <corbet@lwn.net>,
	linux-doc@vger.kernel.org, Randy Dunlap <rdunlap@infradead.org>,
	Feng Zhou <zhoufeng.zf@bytedance.com>,
	Chen Zhou <dingguo.cz@antgroup.com>,
	John Donnelly <John.p.donnelly@oracle.com>,
	Dave Kleikamp <dave.kleikamp@oracle.com>,
	liushixin <liushixin2@huawei.com>
Subject: Re: [PATCH 5/5] arm64: kdump: Don't defer the reservation of crash high memory
Date: Wed, 22 Jun 2022 16:35:16 +0800	[thread overview]
Message-ID: <YrLUREAoBMSZo7RR@MiWiFi-R3L-srv> (raw)
In-Reply-To: <YrIIJkhKWSuAqkCx@arm.com>

Hi Catalin,

On 06/21/22 at 07:04pm, Catalin Marinas wrote:
> On Tue, Jun 21, 2022 at 02:24:01PM +0800, Kefeng Wang wrote:
> > On 2022/6/21 13:33, Baoquan He wrote:
> > > On 06/13/22 at 04:09pm, Zhen Lei wrote:
> > > > If the crashkernel has both high memory above DMA zones and low memory
> > > > in DMA zones, kexec always loads the content such as Image and dtb to the
> > > > high memory instead of the low memory. This means that only high memory
> > > > requires write protection based on page-level mapping. The allocation of
> > > > high memory does not depend on the DMA boundary. So we can reserve the
> > > > high memory first even if the crashkernel reservation is deferred.
> > > > 
> > > > This means that the block mapping can still be performed on other kernel
> > > > linear address spaces, the TLB miss rate can be reduced and the system
> > > > performance will be improved.
> > > 
> > > Ugh, this looks a little ugly, honestly.
> > > 
> > > If that's for sure arm64 can't split large page mapping of linear
> > > region, this patch is one way to optimize linear mapping. Given kdump
> > > setting is necessary on arm64 server, the booting speed is truly
> > > impacted heavily.
> > 
> > Is there some conclusion or discussion that arm64 can't split large page
> > mapping?
> > 
> > Could the crashkernel reservation (and Kfence pool) be splited dynamically?
> > 
> > I found Mark replay "arm64: remove page granularity limitation from
> > KFENCE"[1],
> > 
> >   "We also avoid live changes from block<->table mappings, since the
> >   archtitecture gives us very weak guarantees there and generally requires
> >   a Break-Before-Make sequence (though IIRC this was tightened up
> >   somewhat, so maybe going one way is supposed to work). Unless it's
> >   really necessary, I'd rather not split these block mappings while
> >   they're live."
> 
> The problem with splitting is that you can end up with two entries in
> the TLB for the same VA->PA mapping (e.g. one for a 4KB page and another
> for a 2MB block). In the lucky case, the CPU will trigger a TLB conflict
> abort (but can be worse like loss of coherency).

Thanks for this explanation. Is this a drawback of arm64 design? X86
code do the same thing w/o issue, is there way to overcome this on
arm64 from hardware or software side?

I ever got a arm64 server with huge memory, w or w/o crashkernel setting 
have different bootup time. And the more often TLB miss and flush will
cause performance cost. It is really a pity if we have very powerful
arm64 cpu and system capacity, but bottlenecked by this drawback.

> 
> Prior to FEAT_BBM (added in ARMv8.4), such scenario was not allowed at
> all, the software would have to unmap the range, TLBI, remap. With
> FEAT_BBM (level 2), we can do this without tearing the mapping down but
> we still need to handle the potential TLB conflict abort. The handler
> only needs a TLBI but if it touches the memory range being changed it
> risks faulting again. With vmap stacks and the kernel image mapped in
> the vmalloc space, we have a small window where this could be handled
> but we probably can't go into the C part of the exception handling
> (tracing etc. may access a kmalloc'ed object for example).
> 
> Another option is to do a stop_machine() (if multi-processor at that
> point), disable the MMUs, modify the page tables, re-enable the MMU but
> it's also complicated.
> 
> -- 
> Catalin
> 


WARNING: multiple messages have this Message-ID (diff)
From: Baoquan He <bhe@redhat.com>
To: Catalin Marinas <catalin.marinas@arm.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>,
	Zhen Lei <thunder.leizhen@huawei.com>,
	Ard Biesheuvel <ardb@kernel.org>,
	Mark Rutland <mark.rutland@arm.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	x86@kernel.org, "H . Peter Anvin" <hpa@zytor.com>,
	Eric Biederman <ebiederm@xmission.com>,
	Rob Herring <robh+dt@kernel.org>,
	Frank Rowand <frowand.list@gmail.com>,
	devicetree@vger.kernel.org, Dave Young <dyoung@redhat.com>,
	Vivek Goyal <vgoyal@redhat.com>,
	kexec@lists.infradead.org, linux-kernel@vger.kernel.org,
	Will Deacon <will@kernel.org>,
	linux-arm-kernel@lists.infradead.org,
	Jonathan Corbet <corbet@lwn.net>,
	linux-doc@vger.kernel.org, Randy Dunlap <rdunlap@infradead.org>,
	Feng Zhou <zhoufeng.zf@bytedance.com>,
	Chen Zhou <dingguo.cz@antgroup.com>,
	John Donnelly <John.p.donnelly@oracle.com>,
	Dave Kleikamp <dave.kleikamp@oracle.com>,
	liushixin <liushixin2@huawei.com>
Subject: Re: [PATCH 5/5] arm64: kdump: Don't defer the reservation of crash high memory
Date: Wed, 22 Jun 2022 16:35:16 +0800	[thread overview]
Message-ID: <YrLUREAoBMSZo7RR@MiWiFi-R3L-srv> (raw)
In-Reply-To: <YrIIJkhKWSuAqkCx@arm.com>

Hi Catalin,

On 06/21/22 at 07:04pm, Catalin Marinas wrote:
> On Tue, Jun 21, 2022 at 02:24:01PM +0800, Kefeng Wang wrote:
> > On 2022/6/21 13:33, Baoquan He wrote:
> > > On 06/13/22 at 04:09pm, Zhen Lei wrote:
> > > > If the crashkernel has both high memory above DMA zones and low memory
> > > > in DMA zones, kexec always loads the content such as Image and dtb to the
> > > > high memory instead of the low memory. This means that only high memory
> > > > requires write protection based on page-level mapping. The allocation of
> > > > high memory does not depend on the DMA boundary. So we can reserve the
> > > > high memory first even if the crashkernel reservation is deferred.
> > > > 
> > > > This means that the block mapping can still be performed on other kernel
> > > > linear address spaces, the TLB miss rate can be reduced and the system
> > > > performance will be improved.
> > > 
> > > Ugh, this looks a little ugly, honestly.
> > > 
> > > If that's for sure arm64 can't split large page mapping of linear
> > > region, this patch is one way to optimize linear mapping. Given kdump
> > > setting is necessary on arm64 server, the booting speed is truly
> > > impacted heavily.
> > 
> > Is there some conclusion or discussion that arm64 can't split large page
> > mapping?
> > 
> > Could the crashkernel reservation (and Kfence pool) be splited dynamically?
> > 
> > I found Mark replay "arm64: remove page granularity limitation from
> > KFENCE"[1],
> > 
> >   "We also avoid live changes from block<->table mappings, since the
> >   archtitecture gives us very weak guarantees there and generally requires
> >   a Break-Before-Make sequence (though IIRC this was tightened up
> >   somewhat, so maybe going one way is supposed to work). Unless it's
> >   really necessary, I'd rather not split these block mappings while
> >   they're live."
> 
> The problem with splitting is that you can end up with two entries in
> the TLB for the same VA->PA mapping (e.g. one for a 4KB page and another
> for a 2MB block). In the lucky case, the CPU will trigger a TLB conflict
> abort (but can be worse like loss of coherency).

Thanks for this explanation. Is this a drawback of arm64 design? X86
code do the same thing w/o issue, is there way to overcome this on
arm64 from hardware or software side?

I ever got a arm64 server with huge memory, w or w/o crashkernel setting 
have different bootup time. And the more often TLB miss and flush will
cause performance cost. It is really a pity if we have very powerful
arm64 cpu and system capacity, but bottlenecked by this drawback.

> 
> Prior to FEAT_BBM (added in ARMv8.4), such scenario was not allowed at
> all, the software would have to unmap the range, TLBI, remap. With
> FEAT_BBM (level 2), we can do this without tearing the mapping down but
> we still need to handle the potential TLB conflict abort. The handler
> only needs a TLBI but if it touches the memory range being changed it
> risks faulting again. With vmap stacks and the kernel image mapped in
> the vmalloc space, we have a small window where this could be handled
> but we probably can't go into the C part of the exception handling
> (tracing etc. may access a kmalloc'ed object for example).
> 
> Another option is to do a stop_machine() (if multi-processor at that
> point), disable the MMUs, modify the page tables, re-enable the MMU but
> it's also complicated.
> 
> -- 
> Catalin
> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

  reply	other threads:[~2022-06-22  8:35 UTC|newest]

Thread overview: 78+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-13  8:09 [PATCH 0/5] arm64: kdump: Function supplement and performance optimization Zhen Lei
2022-06-13  8:09 ` Zhen Lei
2022-06-13  8:09 ` Zhen Lei
2022-06-13  8:09 ` [PATCH 1/5] arm64: kdump: Provide default size when crashkernel=Y,low is not specified Zhen Lei
2022-06-13  8:09   ` Zhen Lei
2022-06-13  8:09   ` Zhen Lei
2022-06-17  2:40   ` Baoquan He
2022-06-17  2:40     ` Baoquan He
2022-06-17  2:40     ` Baoquan He
2022-06-17  7:39     ` Leizhen (ThunderTown)
2022-06-17  7:39       ` Leizhen (ThunderTown)
2022-06-17  7:39       ` Leizhen (ThunderTown)
2022-06-17  8:26   ` Baoquan He
2022-06-17  8:26     ` Baoquan He
2022-06-17  8:26     ` Baoquan He
2022-06-13  8:09 ` [PATCH 2/5] arm64: kdump: Support crashkernel=X fall back to reserve region above DMA zones Zhen Lei
2022-06-13  8:09   ` Zhen Lei
2022-06-13  8:09   ` Zhen Lei
2022-06-17  4:16   ` Baoquan He
2022-06-17  4:16     ` Baoquan He
2022-06-17  4:16     ` Baoquan He
2022-06-13  8:09 ` [PATCH 3/5] arm64: kdump: Remove some redundant checks in map_mem() Zhen Lei
2022-06-13  8:09   ` Zhen Lei
2022-06-13  8:09   ` Zhen Lei
2022-06-20  7:42   ` Baoquan He
2022-06-20  7:42     ` Baoquan He
2022-06-20  7:42     ` Baoquan He
2022-06-13  8:09 ` [PATCH 4/5] arm64: kdump: Decide when to reserve crash memory in reserve_crashkernel() Zhen Lei
2022-06-13  8:09   ` Zhen Lei
2022-06-13  8:09   ` Zhen Lei
2022-06-13  8:09 ` [PATCH 5/5] arm64: kdump: Don't defer the reservation of crash high memory Zhen Lei
2022-06-13  8:09   ` Zhen Lei
2022-06-13  8:09   ` Zhen Lei
2022-06-21  5:33   ` Baoquan He
2022-06-21  5:33     ` Baoquan He
2022-06-21  5:33     ` Baoquan He
2022-06-21  6:24     ` Kefeng Wang
2022-06-21  6:24       ` Kefeng Wang
2022-06-21  6:24       ` Kefeng Wang
2022-06-21  9:27       ` Baoquan He
2022-06-21  9:27         ` Baoquan He
2022-06-21  9:27         ` Baoquan He
2022-06-21 18:04       ` Catalin Marinas
2022-06-21 18:04         ` Catalin Marinas
2022-06-21 18:04         ` Catalin Marinas
2022-06-22  8:35         ` Baoquan He [this message]
2022-06-22  8:35           ` Baoquan He
2022-06-22  8:35           ` Baoquan He
2022-06-23 14:07           ` Catalin Marinas
2022-06-23 14:07             ` Catalin Marinas
2022-06-23 14:07             ` Catalin Marinas
2022-06-27  2:52             ` Baoquan He
2022-06-27  2:52               ` Baoquan He
2022-06-27  2:52               ` Baoquan He
2022-06-27  9:17               ` Leizhen (ThunderTown)
2022-06-27  9:17                 ` Leizhen (ThunderTown)
2022-06-27  9:17                 ` Leizhen (ThunderTown)
2022-06-27 10:17                 ` Baoquan He
2022-06-27 10:17                   ` Baoquan He
2022-06-27 10:17                   ` Baoquan He
2022-06-27 11:11                   ` Leizhen (ThunderTown)
2022-06-27 11:11                     ` Leizhen (ThunderTown)
2022-06-27 11:11                     ` Leizhen (ThunderTown)
2022-06-22 12:03         ` Kefeng Wang
2022-06-22 12:03           ` Kefeng Wang
2022-06-22 12:03           ` Kefeng Wang
2022-06-23 10:27           ` Catalin Marinas
2022-06-23 10:27             ` Catalin Marinas
2022-06-23 10:27             ` Catalin Marinas
2022-06-23 14:23             ` Kefeng Wang
2022-06-23 14:23               ` Kefeng Wang
2022-06-23 14:23               ` Kefeng Wang
2022-06-21  7:56     ` Leizhen (ThunderTown)
2022-06-21  7:56       ` Leizhen (ThunderTown)
2022-06-21  7:56       ` Leizhen (ThunderTown)
2022-06-21  9:35       ` Baoquan He
2022-06-21  9:35         ` Baoquan He
2022-06-21  9:35         ` Baoquan He

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YrLUREAoBMSZo7RR@MiWiFi-R3L-srv \
    --to=bhe@redhat.com \
    --cc=John.p.donnelly@oracle.com \
    --cc=ardb@kernel.org \
    --cc=bp@alien8.de \
    --cc=catalin.marinas@arm.com \
    --cc=corbet@lwn.net \
    --cc=dave.kleikamp@oracle.com \
    --cc=devicetree@vger.kernel.org \
    --cc=dingguo.cz@antgroup.com \
    --cc=dyoung@redhat.com \
    --cc=ebiederm@xmission.com \
    --cc=frowand.list@gmail.com \
    --cc=hpa@zytor.com \
    --cc=kexec@lists.infradead.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=liushixin2@huawei.com \
    --cc=mark.rutland@arm.com \
    --cc=mingo@redhat.com \
    --cc=rdunlap@infradead.org \
    --cc=robh+dt@kernel.org \
    --cc=tglx@linutronix.de \
    --cc=thunder.leizhen@huawei.com \
    --cc=vgoyal@redhat.com \
    --cc=wangkefeng.wang@huawei.com \
    --cc=will@kernel.org \
    --cc=x86@kernel.org \
    --cc=zhoufeng.zf@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.