* RE: [PATCH v1] hv_sock: Use consistent types for UUIDs
From: Dexuan Cui @ 2019-07-23 16:57 UTC (permalink / raw)
To: Andy Shevchenko, Haiyang Zhang, KY Srinivasan, Stephen Hemminger,
Sasha Levin, linux-hyperv@vger.kernel.org, David S. Miller,
netdev@vger.kernel.org
In-Reply-To: <20190723163943.65991-1-andriy.shevchenko@linux.intel.com>
> From: linux-hyperv-owner@vger.kernel.org
> <linux-hyperv-owner@vger.kernel.org> On Behalf Of Andy Shevchenko
> Sent: Tuesday, July 23, 2019 9:40 AM
>
> The rest of Hyper-V code is using new types for UUID handling.
> Convert hv_sock as well.
>
> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Looks good to me. Thanks, Andy!
Thanks,
-- Dexuan
^ permalink raw reply
* [PATCH v1] hv_sock: Use consistent types for UUIDs
From: Andy Shevchenko @ 2019-07-23 16:39 UTC (permalink / raw)
To: Haiyang Zhang, K. Y. Srinivasan, Stephen Hemminger, Sasha Levin,
linux-hyperv, David S. Miller, netdev
Cc: Andy Shevchenko
The rest of Hyper-V code is using new types for UUID handling.
Convert hv_sock as well.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
---
net/vmw_vsock/hyperv_transport.c | 24 ++++++++++++------------
1 file changed, 12 insertions(+), 12 deletions(-)
diff --git a/net/vmw_vsock/hyperv_transport.c b/net/vmw_vsock/hyperv_transport.c
index f2084e3f7aa4..2a1719c0f8d2 100644
--- a/net/vmw_vsock/hyperv_transport.c
+++ b/net/vmw_vsock/hyperv_transport.c
@@ -77,11 +77,11 @@ struct hvs_send_buf {
VMBUS_PKT_TRAILER_SIZE)
union hvs_service_id {
- uuid_le srv_id;
+ guid_t srv_id;
struct {
unsigned int svm_port;
- unsigned char b[sizeof(uuid_le) - sizeof(unsigned int)];
+ unsigned char b[sizeof(guid_t) - sizeof(unsigned int)];
};
};
@@ -89,8 +89,8 @@ union hvs_service_id {
struct hvsock {
struct vsock_sock *vsk;
- uuid_le vm_srv_id;
- uuid_le host_srv_id;
+ guid_t vm_srv_id;
+ guid_t host_srv_id;
struct vmbus_channel *chan;
struct vmpacket_descriptor *recv_desc;
@@ -159,21 +159,21 @@ struct hvsock {
#define MIN_HOST_EPHEMERAL_PORT (MAX_HOST_LISTEN_PORT + 1)
/* 00000000-facb-11e6-bd58-64006a7986d3 */
-static const uuid_le srv_id_template =
- UUID_LE(0x00000000, 0xfacb, 0x11e6, 0xbd, 0x58,
- 0x64, 0x00, 0x6a, 0x79, 0x86, 0xd3);
+static const guid_t srv_id_template =
+ GUID_INIT(0x00000000, 0xfacb, 0x11e6, 0xbd, 0x58,
+ 0x64, 0x00, 0x6a, 0x79, 0x86, 0xd3);
-static bool is_valid_srv_id(const uuid_le *id)
+static bool is_valid_srv_id(const guid_t *id)
{
- return !memcmp(&id->b[4], &srv_id_template.b[4], sizeof(uuid_le) - 4);
+ return !memcmp(&id->b[4], &srv_id_template.b[4], sizeof(guid_t) - 4);
}
-static unsigned int get_port_by_srv_id(const uuid_le *svr_id)
+static unsigned int get_port_by_srv_id(const guid_t *svr_id)
{
return *((unsigned int *)svr_id);
}
-static void hvs_addr_init(struct sockaddr_vm *addr, const uuid_le *svr_id)
+static void hvs_addr_init(struct sockaddr_vm *addr, const guid_t *svr_id)
{
unsigned int port = get_port_by_srv_id(svr_id);
@@ -316,7 +316,7 @@ static void hvs_close_connection(struct vmbus_channel *chan)
static void hvs_open_connection(struct vmbus_channel *chan)
{
- uuid_le *if_instance, *if_type;
+ guid_t *if_instance, *if_type;
unsigned char conn_from_host;
struct sockaddr_vm addr;
--
2.20.1
^ permalink raw reply related
* Re: [PATCH v3 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently
From: Peter Zijlstra @ 2019-07-22 19:32 UTC (permalink / raw)
To: Nadav Amit
Cc: Andy Lutomirski, Dave Hansen, the arch/x86 maintainers, LKML,
Thomas Gleixner, Ingo Molnar, K. Y. Srinivasan, Haiyang Zhang,
Stephen Hemminger, Sasha Levin, Borislav Petkov, Juergen Gross,
Paolo Bonzini, Boris Ostrovsky, linux-hyperv@vger.kernel.org,
virtualization@lists.linux-foundation.org, kvm@vger.kernel.org,
xen-devel@lists.xenproject.org
In-Reply-To: <58DA0841-33C2-4D16-A671-08064A15001C@vmware.com>
On Mon, Jul 22, 2019 at 07:27:09PM +0000, Nadav Amit wrote:
> > On Jul 22, 2019, at 12:14 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > But then we can still do something like the below, which doesn't change
> > things and still gets rid of that dual function crud, simplifying
> > smp_call_function_many again.
> Nice! I will add it on top, if you don’t mind (instead squashing it).
Not at all.
> The original decision to have local/remote functions was mostly to provide
> the generality.
>
> I would change the last argument of __smp_call_function_many() from “wait”
> to “flags” that would indicate whether to run the function locally, since I
> don’t want to change the semantics of smp_call_function_many() and decide
> whether to run the function locally purely based on the mask. Let me know if
> you disagree.
Agreed.
^ permalink raw reply
* Re: [PATCH v3 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently
From: Nadav Amit @ 2019-07-22 19:27 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Andy Lutomirski, Dave Hansen, the arch/x86 maintainers, LKML,
Thomas Gleixner, Ingo Molnar, K. Y. Srinivasan, Haiyang Zhang,
Stephen Hemminger, Sasha Levin, Borislav Petkov, Juergen Gross,
Paolo Bonzini, Boris Ostrovsky, linux-hyperv@vger.kernel.org,
virtualization@lists.linux-foundation.org, kvm@vger.kernel.org,
xen-devel@lists.xenproject.org
In-Reply-To: <20190722191433.GD6698@worktop.programming.kicks-ass.net>
> On Jul 22, 2019, at 12:14 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Jul 18, 2019 at 05:58:32PM -0700, Nadav Amit wrote:
>> @@ -709,8 +716,9 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
>> * doing a speculative memory access.
>> */
>> if (info->freed_tables) {
>> - smp_call_function_many(cpumask, flush_tlb_func_remote,
>> - (void *)info, 1);
>> + __smp_call_function_many(cpumask, flush_tlb_func_remote,
>> + flush_tlb_func_local,
>> + (void *)info, 1);
>> } else {
>> /*
>> * Although we could have used on_each_cpu_cond_mask(),
>> @@ -737,7 +745,8 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
>> if (tlb_is_not_lazy(cpu))
>> __cpumask_set_cpu(cpu, cond_cpumask);
>> }
>> - smp_call_function_many(cond_cpumask, flush_tlb_func_remote,
>> + __smp_call_function_many(cond_cpumask, flush_tlb_func_remote,
>> + flush_tlb_func_local,
>> (void *)info, 1);
>> }
>> }
>
> Do we really need that _local/_remote distinction? ISTR you had a patch
> that frobbed flush_tlb_info into the csd and that gave space
> constraints, but I'm not seeing that here (probably a wise, get stuff
> merged etc..).
>
> struct __call_single_data {
> struct llist_node llist; /* 0 8 */
> smp_call_func_t func; /* 8 8 */
> void * info; /* 16 8 */
> unsigned int flags; /* 24 4 */
>
> /* size: 32, cachelines: 1, members: 4 */
> /* padding: 4 */
> /* last cacheline: 32 bytes */
> };
>
> struct flush_tlb_info {
> struct mm_struct * mm; /* 0 8 */
> long unsigned int start; /* 8 8 */
> long unsigned int end; /* 16 8 */
> u64 new_tlb_gen; /* 24 8 */
> unsigned int stride_shift; /* 32 4 */
> bool freed_tables; /* 36 1 */
>
> /* size: 40, cachelines: 1, members: 6 */
> /* padding: 3 */
> /* last cacheline: 40 bytes */
> };
>
> IIRC what you did was make void *__call_single_data::info the last
> member and a union until the full cacheline size (64). Given the above
> that would get us 24 bytes for csd, leaving us 40 for that
> flush_tlb_info.
>
> But then we can still do something like the below, which doesn't change
> things and still gets rid of that dual function crud, simplifying
> smp_call_function_many again.
>
> Index: linux-2.6/arch/x86/include/asm/tlbflush.h
> ===================================================================
> --- linux-2.6.orig/arch/x86/include/asm/tlbflush.h
> +++ linux-2.6/arch/x86/include/asm/tlbflush.h
> @@ -546,8 +546,9 @@ struct flush_tlb_info {
> unsigned long start;
> unsigned long end;
> u64 new_tlb_gen;
> - unsigned int stride_shift;
> - bool freed_tables;
> + unsigned int cpu;
> + unsigned short stride_shift;
> + unsigned char freed_tables;
> };
>
> #define local_flush_tlb() __flush_tlb()
> Index: linux-2.6/arch/x86/mm/tlb.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/mm/tlb.c
> +++ linux-2.6/arch/x86/mm/tlb.c
> @@ -659,6 +659,27 @@ static void flush_tlb_func_remote(void *
> flush_tlb_func_common(f, false, TLB_REMOTE_SHOOTDOWN);
> }
>
> +static void flush_tlb_func(void *info)
> +{
> + const struct flush_tlb_info *f = info;
> + enum tlb_flush_reason reason = TLB_REMOTE_SHOOTDOWN;
> + bool local = false;
> +
> + if (f->cpu == smp_processor_id()) {
> + local = true;
> + reason = (f->mm == NULL) ? TLB_LOCAL_SHOOTDOWN : TLB_LOCAL_MM_SHOOTDOWN;
> + } else {
> + inc_irq_stat(irq_tlb_count);
> +
> + if (f->mm && f->mm != this_cpu_read(cpu_tlbstate.loaded_mm))
> + return;
> +
> + count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
> + }
> +
> + flush_tlb_func_common(f, local, reason);
> +}
> +
> static bool tlb_is_not_lazy(int cpu)
> {
> return !per_cpu(cpu_tlbstate_shared.is_lazy, cpu);
Nice! I will add it on top, if you don’t mind (instead squashing it).
The original decision to have local/remote functions was mostly to provide
the generality.
I would change the last argument of __smp_call_function_many() from “wait”
to “flags” that would indicate whether to run the function locally, since I
don’t want to change the semantics of smp_call_function_many() and decide
whether to run the function locally purely based on the mask. Let me know if
you disagree.
^ permalink raw reply
* Re: [PATCH v3 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently
From: Peter Zijlstra @ 2019-07-22 19:14 UTC (permalink / raw)
To: Nadav Amit
Cc: Andy Lutomirski, Dave Hansen, x86, linux-kernel, Thomas Gleixner,
Ingo Molnar, K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger,
Sasha Levin, Borislav Petkov, Juergen Gross, Paolo Bonzini,
Boris Ostrovsky, linux-hyperv, virtualization, kvm, xen-devel
In-Reply-To: <20190719005837.4150-5-namit@vmware.com>
On Thu, Jul 18, 2019 at 05:58:32PM -0700, Nadav Amit wrote:
> @@ -709,8 +716,9 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
> * doing a speculative memory access.
> */
> if (info->freed_tables) {
> - smp_call_function_many(cpumask, flush_tlb_func_remote,
> - (void *)info, 1);
> + __smp_call_function_many(cpumask, flush_tlb_func_remote,
> + flush_tlb_func_local,
> + (void *)info, 1);
> } else {
> /*
> * Although we could have used on_each_cpu_cond_mask(),
> @@ -737,7 +745,8 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
> if (tlb_is_not_lazy(cpu))
> __cpumask_set_cpu(cpu, cond_cpumask);
> }
> - smp_call_function_many(cond_cpumask, flush_tlb_func_remote,
> + __smp_call_function_many(cond_cpumask, flush_tlb_func_remote,
> + flush_tlb_func_local,
> (void *)info, 1);
> }
> }
Do we really need that _local/_remote distinction? ISTR you had a patch
that frobbed flush_tlb_info into the csd and that gave space
constraints, but I'm not seeing that here (probably a wise, get stuff
merged etc..).
struct __call_single_data {
struct llist_node llist; /* 0 8 */
smp_call_func_t func; /* 8 8 */
void * info; /* 16 8 */
unsigned int flags; /* 24 4 */
/* size: 32, cachelines: 1, members: 4 */
/* padding: 4 */
/* last cacheline: 32 bytes */
};
struct flush_tlb_info {
struct mm_struct * mm; /* 0 8 */
long unsigned int start; /* 8 8 */
long unsigned int end; /* 16 8 */
u64 new_tlb_gen; /* 24 8 */
unsigned int stride_shift; /* 32 4 */
bool freed_tables; /* 36 1 */
/* size: 40, cachelines: 1, members: 6 */
/* padding: 3 */
/* last cacheline: 40 bytes */
};
IIRC what you did was make void *__call_single_data::info the last
member and a union until the full cacheline size (64). Given the above
that would get us 24 bytes for csd, leaving us 40 for that
flush_tlb_info.
But then we can still do something like the below, which doesn't change
things and still gets rid of that dual function crud, simplifying
smp_call_function_many again.
Index: linux-2.6/arch/x86/include/asm/tlbflush.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/tlbflush.h
+++ linux-2.6/arch/x86/include/asm/tlbflush.h
@@ -546,8 +546,9 @@ struct flush_tlb_info {
unsigned long start;
unsigned long end;
u64 new_tlb_gen;
- unsigned int stride_shift;
- bool freed_tables;
+ unsigned int cpu;
+ unsigned short stride_shift;
+ unsigned char freed_tables;
};
#define local_flush_tlb() __flush_tlb()
Index: linux-2.6/arch/x86/mm/tlb.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/tlb.c
+++ linux-2.6/arch/x86/mm/tlb.c
@@ -659,6 +659,27 @@ static void flush_tlb_func_remote(void *
flush_tlb_func_common(f, false, TLB_REMOTE_SHOOTDOWN);
}
+static void flush_tlb_func(void *info)
+{
+ const struct flush_tlb_info *f = info;
+ enum tlb_flush_reason reason = TLB_REMOTE_SHOOTDOWN;
+ bool local = false;
+
+ if (f->cpu == smp_processor_id()) {
+ local = true;
+ reason = (f->mm == NULL) ? TLB_LOCAL_SHOOTDOWN : TLB_LOCAL_MM_SHOOTDOWN;
+ } else {
+ inc_irq_stat(irq_tlb_count);
+
+ if (f->mm && f->mm != this_cpu_read(cpu_tlbstate.loaded_mm))
+ return;
+
+ count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
+ }
+
+ flush_tlb_func_common(f, local, reason);
+}
+
static bool tlb_is_not_lazy(int cpu)
{
return !per_cpu(cpu_tlbstate_shared.is_lazy, cpu);
^ permalink raw reply
* Re: [PATCH] hv: Use the correct style for SPDX License Identifier
From: Greg Kroah-Hartman @ 2019-07-22 14:08 UTC (permalink / raw)
To: Nishad Kamdar
Cc: K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Sasha Levin,
Joe Perches, Uwe Kleine-König, linux-hyperv, linux-kernel
In-Reply-To: <20190722133112.GA7990@nishad>
On Mon, Jul 22, 2019 at 07:01:17PM +0530, Nishad Kamdar wrote:
> This patch corrects the SPDX License Identifier style
> in the trace header file related to Microsoft Hyper-V
> client drivers.
> For C header files Documentation/process/license-rules.rst
> mandates C-like comments (opposed to C source files where
> C++ style should be used)
>
> Changes made by using a script provided by Joe Perches here:
> https://lkml.org/lkml/2019/2/7/46
>
> Suggested-by: Joe Perches <joe@perches.com>
> Signed-off-by: Nishad Kamdar <nishadkamdar@gmail.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
^ permalink raw reply
* [PATCH] hv: Use the correct style for SPDX License Identifier
From: Nishad Kamdar @ 2019-07-22 13:31 UTC (permalink / raw)
To: K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger
Cc: Sasha Levin, Greg Kroah-Hartman, Joe Perches,
Uwe Kleine-König, linux-hyperv, linux-kernel
This patch corrects the SPDX License Identifier style
in the trace header file related to Microsoft Hyper-V
client drivers.
For C header files Documentation/process/license-rules.rst
mandates C-like comments (opposed to C source files where
C++ style should be used)
Changes made by using a script provided by Joe Perches here:
https://lkml.org/lkml/2019/2/7/46
Suggested-by: Joe Perches <joe@perches.com>
Signed-off-by: Nishad Kamdar <nishadkamdar@gmail.com>
---
drivers/hv/hv_trace.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/hv/hv_trace.h b/drivers/hv/hv_trace.h
index 999f80a63bff..e70783e33680 100644
--- a/drivers/hv/hv_trace.h
+++ b/drivers/hv/hv_trace.h
@@ -1,4 +1,4 @@
-// SPDX-License-Identifier: GPL-2.0
+/* SPDX-License-Identifier: GPL-2.0 */
#undef TRACE_SYSTEM
#define TRACE_SYSTEM hyperv
--
2.17.1
^ permalink raw reply related
* Re: [PATCH v4 3/5] hv: vmbus: Replace page definition with Hyper-V specific one
From: Thomas Gleixner @ 2019-07-22 9:08 UTC (permalink / raw)
To: Sasha Levin
Cc: Maya Nakamura, mikelley, kys, haiyangz, sthemmin, x86,
linux-hyperv, linux-kernel
In-Reply-To: <20190718022228.GF3079@sasha-vm>
On Wed, 17 Jul 2019, Sasha Levin wrote:
> On Fri, Jul 12, 2019 at 08:25:18AM +0000, Maya Nakamura wrote:
> > Replace PAGE_SIZE with HV_HYP_PAGE_SIZE because the guest page size may
> > not be 4096 on all architectures and Hyper-V always runs with a page
> > size of 4096.
> >
> > Signed-off-by: Maya Nakamura <m.maya.nakamura@gmail.com>
> > Reviewed-by: Michael Kelley <mikelley@microsoft.com>
> > Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
>
> Thomas, if you're taking this series, could you grab this patch as well
> please (dependencies)?
>
> Acked-by: Sasha Levin <sashal@kernel.org>
Picked it up. The three commits are in
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86/hyperv
If you have patches dependend on those either send them my way or pull the
branch into your hyper-v tree.
Thanks,
tglx
^ permalink raw reply
* Re: [PATCH 2/8] scsi: take the DMA max mapping size into account
From: Damien Le Moal @ 2019-07-22 7:40 UTC (permalink / raw)
To: Ming Lei, Bart van Assche
Cc: Christoph Hellwig, Martin K . Petersen, Sagi Grimberg,
Max Gurtovoy, linux-rdma, Linux SCSI List,
megaraidlinux.pdl@broadcom.com, MPT-FusionLinux.pdl@broadcom.com,
linux-hyperv@vger.kernel.org, Linux Kernel Mailing List
In-Reply-To: <CACVXFVMWM3xg6EEyoyNjkLPv=8+wQKiHj6erMS_gGX25f-Ot4g@mail.gmail.com>
On 2019/07/22 15:01, Ming Lei wrote:
> On Tue, Jun 18, 2019 at 4:57 AM Bart Van Assche <bvanassche@acm.org> wrote:
>>
>> On 6/17/19 5:19 AM, Christoph Hellwig wrote:
>>> We need to limit the devices max_sectors to what the DMA mapping
>>> implementation can support. If not we risk running out of swiotlb
>>> buffers easily.
>>>
>>> Signed-off-by: Christoph Hellwig <hch@lst.de>
>>> ---
>>> drivers/scsi/scsi_lib.c | 2 ++
>>> 1 file changed, 2 insertions(+)
>>>
>>> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
>>> index d333bb6b1c59..f233bfd84cd7 100644
>>> --- a/drivers/scsi/scsi_lib.c
>>> +++ b/drivers/scsi/scsi_lib.c
>>> @@ -1768,6 +1768,8 @@ void __scsi_init_queue(struct Scsi_Host *shost, struct request_queue *q)
>>> blk_queue_max_integrity_segments(q, shost->sg_prot_tablesize);
>>> }
>>>
>>> + shost->max_sectors = min_t(unsigned int, shost->max_sectors,
>>> + dma_max_mapping_size(dev) << SECTOR_SHIFT);
>>> blk_queue_max_hw_sectors(q, shost->max_sectors);
>>> if (shost->unchecked_isa_dma)
>>> blk_queue_bounce_limit(q, BLK_BOUNCE_ISA);
>>
>> Does dma_max_mapping_size() return a value in bytes? Is
>> shost->max_sectors a number of sectors? If so, are you sure that "<<
>> SECTOR_SHIFT" is the proper conversion? Shouldn't that be ">>
>> SECTOR_SHIFT" instead?
>
> Now the patch has been committed, '<< SECTOR_SHIFT' needs to be fixed.
>
> Also the following kernel oops is triggered on qemu, and looks
> device->dma_mask is NULL.
Just hit the exact same problem using tcmu-runner (ZBC file handler) on bare
metal (no QEMU). dev->dma_mask is NULL. No problem with real disks though.
>
> [ 5.826483] scsi host0: Virtio SCSI HBA
> [ 5.829302] st: Version 20160209, fixed bufsize 32768, s/g segs 256
> [ 5.831042] SCSI Media Changer driver v0.25
> [ 5.832491] ==================================================================
> [ 5.833332] BUG: KASAN: null-ptr-deref in
> dma_direct_max_mapping_size+0x30/0x94
> [ 5.833332] Read of size 8 at addr 0000000000000000 by task kworker/u17:0/7
> [ 5.835506] nvme nvme0: pci function 0000:00:07.0
> [ 5.833332]
> [ 5.833332] CPU: 2 PID: 7 Comm: kworker/u17:0 Not tainted 5.3.0-rc1 #1328
> [ 5.836999] ahci 0000:00:1f.2: version 3.0
> [ 5.833332] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
> BIOS ?-20180724_192412-buildhw-07.phx4
> [ 5.833332] Workqueue: events_unbound async_run_entry_fn
> [ 5.833332] Call Trace:
> [ 5.833332] dump_stack+0x6f/0x9d
> [ 5.833332] ? dma_direct_max_mapping_size+0x30/0x94
> [ 5.833332] __kasan_report+0x161/0x189
> [ 5.833332] ? dma_direct_max_mapping_size+0x30/0x94
> [ 5.833332] kasan_report+0xe/0x12
> [ 5.833332] dma_direct_max_mapping_size+0x30/0x94
> [ 5.833332] __scsi_init_queue+0xd8/0x1f3
> [ 5.833332] scsi_mq_alloc_queue+0x62/0x89
> [ 5.833332] scsi_alloc_sdev+0x38c/0x479
> [ 5.833332] scsi_probe_and_add_lun+0x22d/0x1093
> [ 5.833332] ? kobject_set_name_vargs+0xa4/0xb2
> [ 5.833332] ? mutex_lock+0x88/0xc4
> [ 5.833332] ? scsi_free_host_dev+0x4a/0x4a
> [ 5.833332] ? _raw_spin_lock_irqsave+0x8c/0xde
> [ 5.833332] ? _raw_write_unlock_irqrestore+0x23/0x23
> [ 5.833332] ? ata_tdev_match+0x22/0x45
> [ 5.833332] ? attribute_container_add_device+0x160/0x17e
> [ 5.833332] ? rpm_resume+0x26a/0x7c0
> [ 5.833332] ? kobject_get+0x12/0x43
> [ 5.833332] ? rpm_put_suppliers+0x7e/0x7e
> [ 5.833332] ? _raw_spin_lock_irqsave+0x8c/0xde
> [ 5.833332] ? _raw_write_unlock_irqrestore+0x23/0x23
> [ 5.833332] ? scsi_target_destroy+0x135/0x135
> [ 5.833332] __scsi_scan_target+0x14b/0x6aa
> [ 5.833332] ? pvclock_clocksource_read+0xc0/0x14e
> [ 5.833332] ? scsi_add_device+0x20/0x20
> [ 5.833332] ? rpm_resume+0x1ae/0x7c0
> [ 5.833332] ? rpm_put_suppliers+0x7e/0x7e
> [ 5.833332] ? _raw_spin_lock_irqsave+0x8c/0xde
> [ 5.833332] ? _raw_write_unlock_irqrestore+0x23/0x23
> [ 5.833332] ? pick_next_task_fair+0x976/0xa3d
> [ 5.833332] ? mutex_lock+0x88/0xc4
> [ 5.833332] scsi_scan_channel+0x76/0x9e
> [ 5.833332] scsi_scan_host_selected+0x131/0x176
> [ 5.833332] ? scsi_scan_host+0x241/0x241
> [ 5.833332] do_scan_async+0x27/0x219
> [ 5.833332] ? scsi_scan_host+0x241/0x241
> [ 5.833332] async_run_entry_fn+0xdc/0x23d
> [ 5.833332] process_one_work+0x327/0x539
> [ 5.833332] worker_thread+0x330/0x492
> [ 5.833332] ? rescuer_thread+0x41f/0x41f
> [ 5.833332] kthread+0x1c6/0x1d5
> [ 5.833332] ? kthread_park+0xd3/0xd3
> [ 5.833332] ret_from_fork+0x1f/0x30
> [ 5.833332] ==================================================================
>
>
>
> Thanks,
> Ming Lei
>
--
Damien Le Moal
Western Digital Research
^ permalink raw reply
* Re: [PATCH 2/8] scsi: take the DMA max mapping size into account
From: Dexuan-Linux Cui @ 2019-07-22 6:18 UTC (permalink / raw)
To: Ming Lei
Cc: Bart Van Assche, Christoph Hellwig, Martin K . Petersen,
Sagi Grimberg, Max Gurtovoy, linux-rdma, Linux SCSI List,
megaraidlinux.pdl, MPT-FusionLinux.pdl, linux-hyperv,
Linux Kernel Mailing List, Dexuan Cui, v-lide
In-Reply-To: <CACVXFVMWM3xg6EEyoyNjkLPv=8+wQKiHj6erMS_gGX25f-Ot4g@mail.gmail.com>
On Sun, Jul 21, 2019 at 11:01 PM Ming Lei <tom.leiming@gmail.com> wrote:
>
> On Tue, Jun 18, 2019 at 4:57 AM Bart Van Assche <bvanassche@acm.org> wrote:
> >
> > On 6/17/19 5:19 AM, Christoph Hellwig wrote:
> > > We need to limit the devices max_sectors to what the DMA mapping
> > > implementation can support. If not we risk running out of swiotlb
> > > buffers easily.
> > >
> > > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > > ---
> > > drivers/scsi/scsi_lib.c | 2 ++
> > > 1 file changed, 2 insertions(+)
> > >
> > > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> > > index d333bb6b1c59..f233bfd84cd7 100644
> > > --- a/drivers/scsi/scsi_lib.c
> > > +++ b/drivers/scsi/scsi_lib.c
> > > @@ -1768,6 +1768,8 @@ void __scsi_init_queue(struct Scsi_Host *shost, struct request_queue *q)
> > > blk_queue_max_integrity_segments(q, shost->sg_prot_tablesize);
> > > }
> > >
> > > + shost->max_sectors = min_t(unsigned int, shost->max_sectors,
> > > + dma_max_mapping_size(dev) << SECTOR_SHIFT);
> > > blk_queue_max_hw_sectors(q, shost->max_sectors);
> > > if (shost->unchecked_isa_dma)
> > > blk_queue_bounce_limit(q, BLK_BOUNCE_ISA);
> >
> > Does dma_max_mapping_size() return a value in bytes? Is
> > shost->max_sectors a number of sectors? If so, are you sure that "<<
> > SECTOR_SHIFT" is the proper conversion? Shouldn't that be ">>
> > SECTOR_SHIFT" instead?
>
> Now the patch has been committed, '<< SECTOR_SHIFT' needs to be fixed.
>
> Also the following kernel oops is triggered on qemu, and looks
> device->dma_mask is NULL.
>
> Ming Lei
FYI: we also see the panic with a Linux kernel 5.2.0-next-20190719
running on Hyper-V:
[ 7.429053] RIP: 0010:dma_direct_max_mapping_size+0x26/0x80
[ 7.429053] Code: 0f b6 c0 c3 0f 1f 44 00 00 55 48 89 e5 41 54 53
48 89 fb e8 4c 14 00 00 84 c0 74 45 48 8b 83 28 02 00 00 4c 8b a3 38
02 00 00 <48> 8b 00 48 85 c0 74 0c 4d 85 e4 74 36 49 39 c4 4c 0f 47 e0
48 89
[ 7.429053] RSP: 0018:ffffc1d5005efbc0 EFLAGS: 00010202
[ 7.429053] RAX: 0000000000000000 RBX: ffff9cf86d24c428 RCX: 0000000000000000
[ 7.429053] RDX: ffff9cf86d12dd00 RSI: 0000000000000200 RDI: ffff9cf86d24c428
[ 7.429053] RBP: ffffc1d5005efbd0 R08: ffff9cf86fcaf0e0 R09: ffff9cf86e0072c0
[ 7.429053] R10: ffffc1d5005efa70 R11: 00000000000301a0 R12: 0000000000000000
[ 7.429053] R13: ffff9cf86d24c428 R14: 0000000000000400 R15: ffff9cf825cff000
[ 7.429053] FS: 0000000000000000(0000) GS:ffff9cf86fc80000(0000)
knlGS:0000000000000000
[ 7.429053] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7.429053] CR2: 0000000000000000 CR3: 00000003c700a001 CR4: 00000000003606e0
[ 7.456569] NET: Registered protocol family 17
[ 7.429053] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 7.469803] Key type dns_resolver registered
[ 7.429053] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 7.429053] Call Trace:
[ 7.429053] dma_max_mapping_size+0x39/0x50
[ 7.429053] __scsi_init_queue+0x7f/0x140
[ 7.429053] scsi_mq_alloc_queue+0x38/0x60
[ 7.429053] scsi_alloc_sdev+0x1da/0x2b0
[ 7.429053] scsi_probe_and_add_lun+0x471/0xe60
[ 7.429053] __scsi_scan_target+0xfc/0x610
[ 7.429053] scsi_scan_channel+0x66/0xa0
[ 7.429053] scsi_scan_host_selected+0xf3/0x160
[ 7.429053] do_scsi_scan_host+0x93/0xa0
[ 7.429053] do_scan_async+0x1c/0x190
[ 7.429053] async_run_entry_fn+0x3c/0x150
[ 7.429053] process_one_work+0x1f7/0x3f0
[ 7.429053] worker_thread+0x34/0x400
[ 7.429053] kthread+0x121/0x140
[ 7.429053] ret_from_fork+0x35/0x40
[ 7.429053] Modules linked in:
[ 7.429053] CR2: 0000000000000000
[ 7.766122] BUG: kernel NULL pointer dereference, address: 0000000000000000
Thanks,
-- Dexuan
^ permalink raw reply
* Re: [PATCH 2/8] scsi: take the DMA max mapping size into account
From: Ming Lei @ 2019-07-22 6:00 UTC (permalink / raw)
To: Bart Van Assche
Cc: Christoph Hellwig, Martin K . Petersen, Sagi Grimberg,
Max Gurtovoy, linux-rdma, Linux SCSI List, megaraidlinux.pdl,
MPT-FusionLinux.pdl, linux-hyperv, Linux Kernel Mailing List
In-Reply-To: <5d143a03-edd5-5878-780b-45d87313a813@acm.org>
On Tue, Jun 18, 2019 at 4:57 AM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 6/17/19 5:19 AM, Christoph Hellwig wrote:
> > We need to limit the devices max_sectors to what the DMA mapping
> > implementation can support. If not we risk running out of swiotlb
> > buffers easily.
> >
> > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > ---
> > drivers/scsi/scsi_lib.c | 2 ++
> > 1 file changed, 2 insertions(+)
> >
> > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> > index d333bb6b1c59..f233bfd84cd7 100644
> > --- a/drivers/scsi/scsi_lib.c
> > +++ b/drivers/scsi/scsi_lib.c
> > @@ -1768,6 +1768,8 @@ void __scsi_init_queue(struct Scsi_Host *shost, struct request_queue *q)
> > blk_queue_max_integrity_segments(q, shost->sg_prot_tablesize);
> > }
> >
> > + shost->max_sectors = min_t(unsigned int, shost->max_sectors,
> > + dma_max_mapping_size(dev) << SECTOR_SHIFT);
> > blk_queue_max_hw_sectors(q, shost->max_sectors);
> > if (shost->unchecked_isa_dma)
> > blk_queue_bounce_limit(q, BLK_BOUNCE_ISA);
>
> Does dma_max_mapping_size() return a value in bytes? Is
> shost->max_sectors a number of sectors? If so, are you sure that "<<
> SECTOR_SHIFT" is the proper conversion? Shouldn't that be ">>
> SECTOR_SHIFT" instead?
Now the patch has been committed, '<< SECTOR_SHIFT' needs to be fixed.
Also the following kernel oops is triggered on qemu, and looks
device->dma_mask is NULL.
[ 5.826483] scsi host0: Virtio SCSI HBA
[ 5.829302] st: Version 20160209, fixed bufsize 32768, s/g segs 256
[ 5.831042] SCSI Media Changer driver v0.25
[ 5.832491] ==================================================================
[ 5.833332] BUG: KASAN: null-ptr-deref in
dma_direct_max_mapping_size+0x30/0x94
[ 5.833332] Read of size 8 at addr 0000000000000000 by task kworker/u17:0/7
[ 5.835506] nvme nvme0: pci function 0000:00:07.0
[ 5.833332]
[ 5.833332] CPU: 2 PID: 7 Comm: kworker/u17:0 Not tainted 5.3.0-rc1 #1328
[ 5.836999] ahci 0000:00:1f.2: version 3.0
[ 5.833332] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
BIOS ?-20180724_192412-buildhw-07.phx4
[ 5.833332] Workqueue: events_unbound async_run_entry_fn
[ 5.833332] Call Trace:
[ 5.833332] dump_stack+0x6f/0x9d
[ 5.833332] ? dma_direct_max_mapping_size+0x30/0x94
[ 5.833332] __kasan_report+0x161/0x189
[ 5.833332] ? dma_direct_max_mapping_size+0x30/0x94
[ 5.833332] kasan_report+0xe/0x12
[ 5.833332] dma_direct_max_mapping_size+0x30/0x94
[ 5.833332] __scsi_init_queue+0xd8/0x1f3
[ 5.833332] scsi_mq_alloc_queue+0x62/0x89
[ 5.833332] scsi_alloc_sdev+0x38c/0x479
[ 5.833332] scsi_probe_and_add_lun+0x22d/0x1093
[ 5.833332] ? kobject_set_name_vargs+0xa4/0xb2
[ 5.833332] ? mutex_lock+0x88/0xc4
[ 5.833332] ? scsi_free_host_dev+0x4a/0x4a
[ 5.833332] ? _raw_spin_lock_irqsave+0x8c/0xde
[ 5.833332] ? _raw_write_unlock_irqrestore+0x23/0x23
[ 5.833332] ? ata_tdev_match+0x22/0x45
[ 5.833332] ? attribute_container_add_device+0x160/0x17e
[ 5.833332] ? rpm_resume+0x26a/0x7c0
[ 5.833332] ? kobject_get+0x12/0x43
[ 5.833332] ? rpm_put_suppliers+0x7e/0x7e
[ 5.833332] ? _raw_spin_lock_irqsave+0x8c/0xde
[ 5.833332] ? _raw_write_unlock_irqrestore+0x23/0x23
[ 5.833332] ? scsi_target_destroy+0x135/0x135
[ 5.833332] __scsi_scan_target+0x14b/0x6aa
[ 5.833332] ? pvclock_clocksource_read+0xc0/0x14e
[ 5.833332] ? scsi_add_device+0x20/0x20
[ 5.833332] ? rpm_resume+0x1ae/0x7c0
[ 5.833332] ? rpm_put_suppliers+0x7e/0x7e
[ 5.833332] ? _raw_spin_lock_irqsave+0x8c/0xde
[ 5.833332] ? _raw_write_unlock_irqrestore+0x23/0x23
[ 5.833332] ? pick_next_task_fair+0x976/0xa3d
[ 5.833332] ? mutex_lock+0x88/0xc4
[ 5.833332] scsi_scan_channel+0x76/0x9e
[ 5.833332] scsi_scan_host_selected+0x131/0x176
[ 5.833332] ? scsi_scan_host+0x241/0x241
[ 5.833332] do_scan_async+0x27/0x219
[ 5.833332] ? scsi_scan_host+0x241/0x241
[ 5.833332] async_run_entry_fn+0xdc/0x23d
[ 5.833332] process_one_work+0x327/0x539
[ 5.833332] worker_thread+0x330/0x492
[ 5.833332] ? rescuer_thread+0x41f/0x41f
[ 5.833332] kthread+0x1c6/0x1d5
[ 5.833332] ? kthread_park+0xd3/0xd3
[ 5.833332] ret_from_fork+0x1f/0x30
[ 5.833332] ==================================================================
Thanks,
Ming Lei
^ permalink raw reply
* Re: [PATCH net] hv_netvsc: Fix extra rcu_read_unlock in netvsc_recv_callback()
From: David Miller @ 2019-07-22 3:41 UTC (permalink / raw)
To: haiyangz
Cc: sashal, linux-hyperv, netdev, kys, sthemmin, olaf, vkuznets,
linux-kernel
In-Reply-To: <1563557581-17669-1-git-send-email-haiyangz@microsoft.com>
From: Haiyang Zhang <haiyangz@microsoft.com>
Date: Fri, 19 Jul 2019 17:33:51 +0000
> There is an extra rcu_read_unlock left in netvsc_recv_callback(),
> after a previous patch that removes RCU from this function.
> This patch removes the extra RCU unlock.
>
> Fixes: 345ac08990b8 ("hv_netvsc: pass netvsc_device to receive callback")
> Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Applied and queued up for -stable.
^ permalink raw reply
* Re: [PATCH v3 0/9] x86: Concurrent TLB flushes
From: Dave Hansen @ 2019-07-19 21:36 UTC (permalink / raw)
To: Nadav Amit, Andy Lutomirski, Dave Hansen
Cc: x86, linux-kernel, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Boris Ostrovsky, Haiyang Zhang, Josh Poimboeuf,
Juergen Gross, K. Y. Srinivasan, Paolo Bonzini, Rik van Riel,
Sasha Levin, Stephen Hemminger, kvm, linux-hyperv, virtualization,
xen-devel
In-Reply-To: <20190719005837.4150-1-namit@vmware.com>
Thanks for doing this, it's something I've been hoping someone would do
for a long time.
While I kinda wish we had performance data for each individual patch (at
least the behavior-changing ones), this does look like a good
improvement. That might, for instance, tell is a bit about how the
separating out "is_lazy" compares to the "check before setting"
optimization. But, they're both sane enough on their own that I'm not
too worried.
I had some nits that I hope get covered in later revisions, if sent.
But, overall looks fine. For the series:
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
^ permalink raw reply
* [PATCH net] hv_netvsc: Fix extra rcu_read_unlock in netvsc_recv_callback()
From: Haiyang Zhang @ 2019-07-19 17:33 UTC (permalink / raw)
To: sashal@kernel.org, linux-hyperv@vger.kernel.org,
netdev@vger.kernel.org
Cc: Haiyang Zhang, KY Srinivasan, Stephen Hemminger, olaf@aepfle.de,
vkuznets, davem@davemloft.net, linux-kernel@vger.kernel.org
There is an extra rcu_read_unlock left in netvsc_recv_callback(),
after a previous patch that removes RCU from this function.
This patch removes the extra RCU unlock.
Fixes: 345ac08990b8 ("hv_netvsc: pass netvsc_device to receive callback")
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
---
drivers/net/hyperv/netvsc_drv.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index afdcc56..3544e19 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -836,7 +836,6 @@ int netvsc_recv_callback(struct net_device *net,
if (unlikely(!skb)) {
++net_device_ctx->eth_stats.rx_no_memory;
- rcu_read_unlock();
return NVSP_STAT_FAIL;
}
--
1.8.3.1
^ permalink raw reply related
* [PATCH v2] x86/hyper-v: Zero out the VP ASSIST PAGE to fix CPU offlining
From: Dexuan Cui @ 2019-07-19 3:22 UTC (permalink / raw)
To: Thomas Gleixner, vkuznets, Haiyang Zhang, KY Srinivasan,
Stephen Hemminger, Sasha Levin, linux-hyperv@vger.kernel.org,
Michael Kelley, Long Li, x86@kernel.org
Cc: Ingo Molnar, Borislav Petkov, H. Peter Anvin, jasowang@redhat.com,
driverdev-devel@linuxdriverproject.org,
linux-kernel@vger.kernel.org, apw@canonical.com,
marcelo.cerri@canonical.com, olaf@aepfle.de
The VP ASSIST PAGE is an "overlay" page (see Hyper-V TLFS's Section
5.2.1 "GPA Overlay Pages" for the details) and here is an excerpt:
"
The hypervisor defines several special pages that "overlay" the guest's
Guest Physical Addresses (GPA) space. Overlays are addressed GPA but are
not included in the normal GPA map maintained internally by the hypervisor.
Conceptually, they exist in a separate map that overlays the GPA map.
If a page within the GPA space is overlaid, any SPA page mapped to the
GPA page is effectively "obscured" and generally unreachable by the
virtual processor through processor memory accesses.
If an overlay page is disabled, the underlying GPA page is "uncovered",
and an existing mapping becomes accessible to the guest.
"
SPA = System Physical Address = the final real physical address.
When a CPU (e.g. CPU1) is being onlined, in hv_cpu_init(), we allocate the
VP ASSIST PAGE and enable the EOI optimization for this CPU by writing the
MSR HV_X64_MSR_VP_ASSIST_PAGE. From now on, hvp->apic_assist belongs to the
special SPA page, and this CPU *always* uses hvp->apic_assist (which is
shared with the hypervisor) to decide if it needs to write the EOI MSR.
When a CPU (e.g. CPU1) is being offlined, on this CPU, we do:
1. in hv_cpu_die(), we disable the EOI optimizaton for this CPU, and from
now on hvp->apic_assist belongs to the original "normal" SPA page;
2. we finish the remaining work of stopping this CPU;
3. this CPU is completely stopped.
Between 1 and 3, this CPU can still receive interrupts (e.g. reschedule
IPIs from CPU0, and Local APIC timer interrupts), and this CPU *must* write
the EOI MSR for every interrupt received, otherwise the hypervisor may not
deliver further interrupts, which may be needed to completely stop the CPU.
So, after we disable the EOI optimization in hv_cpu_die(), we need to make
sure hvp->apic_assist's bit0 is zero. The easiest way is we just zero out
the page when it's allocated in hv_cpu_init().
Note 1: after the "normal" SPA page is allocted and zeroed out, neither the
hypervisor nor the guest writes into the page, so the page remains with
zeros.
Note 2: see Section 10.3.5 "EOI Assist" for the details of the EOI
optimization. When the optimization is enabled, the guest can still write
the EOI MSR register irrespective of the "No EOI required" value, though
by doing so we can't benefit from the optimization.
Fixes: ba696429d290 ("x86/hyper-v: Implement EOI assist")
Signed-off-by: Dexuan Cui <decui@microsoft.com>
---
v2: there is no code change. I just improved the comment and the changelog
according to the discussion with tglx:
https://lkml.org/lkml/2019/7/17/781
https://lkml.org/lkml/2019/7/18/91
arch/x86/hyperv/hv_init.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index 0e033ef11a9f..d26832cb38bb 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -60,8 +60,16 @@ static int hv_cpu_init(unsigned int cpu)
if (!hv_vp_assist_page)
return 0;
+ /*
+ * The VP ASSIST PAGE is an "overlay" page (see Hyper-V TLFS's Section
+ * 5.2.1 "GPA Overlay Pages"). Here it must be zeroed out to make sure
+ * we always write the EOI MSR in hv_apic_eoi_write() *after* the
+ * EOI optimization is disabled in hv_cpu_die(), otherwise a CPU may
+ * not be stopped in the case of CPU offlining and the VM will hang.
+ */
if (!*hvp)
- *hvp = __vmalloc(PAGE_SIZE, GFP_KERNEL, PAGE_KERNEL);
+ *hvp = __vmalloc(PAGE_SIZE, GFP_KERNEL | __GFP_ZERO,
+ PAGE_KERNEL);
if (*hvp) {
u64 val;
--
2.19.1
^ permalink raw reply related
* [PATCH v3 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently
From: Nadav Amit @ 2019-07-19 0:58 UTC (permalink / raw)
To: Andy Lutomirski, Dave Hansen
Cc: x86, linux-kernel, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
Nadav Amit, K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger,
Sasha Levin, Borislav Petkov, Juergen Gross, Paolo Bonzini,
Boris Ostrovsky, linux-hyperv, virtualization, kvm, xen-devel
In-Reply-To: <20190719005837.4150-1-namit@vmware.com>
To improve TLB shootdown performance, flush the remote and local TLBs
concurrently. Introduce flush_tlb_multi() that does so. Introduce
paravirtual versions of flush_tlb_multi() for KVM, Xen and hyper-v (Xen
and hyper-v are only compile-tested).
While the updated smp infrastructure is capable of running a function on
a single local core, it is not optimized for this case. The multiple
function calls and the indirect branch introduce some overhead, and
might make local TLB flushes slower than they were before the recent
changes.
Before calling the SMP infrastructure, check if only a local TLB flush
is needed to restore the lost performance in this common case. This
requires to check mm_cpumask() one more time, but unless this mask is
updated very frequently, this should impact performance negatively.
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: x86@kernel.org
Cc: Juergen Gross <jgross@suse.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: linux-hyperv@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Cc: kvm@vger.kernel.org
Cc: xen-devel@lists.xenproject.org
Signed-off-by: Nadav Amit <namit@vmware.com>
---
arch/x86/hyperv/mmu.c | 10 +++---
arch/x86/include/asm/paravirt.h | 6 ++--
arch/x86/include/asm/paravirt_types.h | 4 +--
arch/x86/include/asm/tlbflush.h | 8 ++---
arch/x86/include/asm/trace/hyperv.h | 2 +-
arch/x86/kernel/kvm.c | 11 +++++--
arch/x86/kernel/paravirt.c | 2 +-
arch/x86/mm/tlb.c | 47 ++++++++++++++++++---------
arch/x86/xen/mmu_pv.c | 11 +++----
include/trace/events/xen.h | 2 +-
10 files changed, 62 insertions(+), 41 deletions(-)
diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
index e65d7fe6489f..8740d8b21db3 100644
--- a/arch/x86/hyperv/mmu.c
+++ b/arch/x86/hyperv/mmu.c
@@ -50,8 +50,8 @@ static inline int fill_gva_list(u64 gva_list[], int offset,
return gva_n - offset;
}
-static void hyperv_flush_tlb_others(const struct cpumask *cpus,
- const struct flush_tlb_info *info)
+static void hyperv_flush_tlb_multi(const struct cpumask *cpus,
+ const struct flush_tlb_info *info)
{
int cpu, vcpu, gva_n, max_gvas;
struct hv_tlb_flush **flush_pcpu;
@@ -59,7 +59,7 @@ static void hyperv_flush_tlb_others(const struct cpumask *cpus,
u64 status = U64_MAX;
unsigned long flags;
- trace_hyperv_mmu_flush_tlb_others(cpus, info);
+ trace_hyperv_mmu_flush_tlb_multi(cpus, info);
if (!hv_hypercall_pg)
goto do_native;
@@ -156,7 +156,7 @@ static void hyperv_flush_tlb_others(const struct cpumask *cpus,
if (!(status & HV_HYPERCALL_RESULT_MASK))
return;
do_native:
- native_flush_tlb_others(cpus, info);
+ native_flush_tlb_multi(cpus, info);
}
static u64 hyperv_flush_tlb_others_ex(const struct cpumask *cpus,
@@ -231,6 +231,6 @@ void hyperv_setup_mmu_ops(void)
return;
pr_info("Using hypercall for remote TLB flush\n");
- pv_ops.mmu.flush_tlb_others = hyperv_flush_tlb_others;
+ pv_ops.mmu.flush_tlb_multi = hyperv_flush_tlb_multi;
pv_ops.mmu.tlb_remove_table = tlb_remove_table;
}
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index dce26f1d13e1..8c6c2394393b 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -62,10 +62,10 @@ static inline void __flush_tlb_one_user(unsigned long addr)
PVOP_VCALL1(mmu.flush_tlb_one_user, addr);
}
-static inline void flush_tlb_others(const struct cpumask *cpumask,
- const struct flush_tlb_info *info)
+static inline void flush_tlb_multi(const struct cpumask *cpumask,
+ const struct flush_tlb_info *info)
{
- PVOP_VCALL2(mmu.flush_tlb_others, cpumask, info);
+ PVOP_VCALL2(mmu.flush_tlb_multi, cpumask, info);
}
static inline void paravirt_tlb_remove_table(struct mmu_gather *tlb, void *table)
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 639b2df445ee..c82969f38845 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -211,8 +211,8 @@ struct pv_mmu_ops {
void (*flush_tlb_user)(void);
void (*flush_tlb_kernel)(void);
void (*flush_tlb_one_user)(unsigned long addr);
- void (*flush_tlb_others)(const struct cpumask *cpus,
- const struct flush_tlb_info *info);
+ void (*flush_tlb_multi)(const struct cpumask *cpus,
+ const struct flush_tlb_info *info);
void (*tlb_remove_table)(struct mmu_gather *tlb, void *table);
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index dee375831962..610e47dc66ef 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -517,7 +517,7 @@ static inline void __flush_tlb_one_kernel(unsigned long addr)
* - flush_tlb_page(vma, vmaddr) flushes one page
* - flush_tlb_range(vma, start, end) flushes a range of pages
* - flush_tlb_kernel_range(start, end) flushes a range of kernel pages
- * - flush_tlb_others(cpumask, info) flushes TLBs on other cpus
+ * - flush_tlb_multi(cpumask, info) flushes TLBs on multiple cpus
*
* ..but the i386 has somewhat limited tlb flushing capabilities,
* and page-granular flushes are available only on i486 and up.
@@ -569,7 +569,7 @@ static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, PAGE_SHIFT, false);
}
-void native_flush_tlb_others(const struct cpumask *cpumask,
+void native_flush_tlb_multi(const struct cpumask *cpumask,
const struct flush_tlb_info *info);
static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
@@ -593,8 +593,8 @@ static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
#ifndef CONFIG_PARAVIRT
-#define flush_tlb_others(mask, info) \
- native_flush_tlb_others(mask, info)
+#define flush_tlb_multi(mask, info) \
+ native_flush_tlb_multi(mask, info)
#define paravirt_tlb_remove_table(tlb, page) \
tlb_remove_page(tlb, (void *)(page))
diff --git a/arch/x86/include/asm/trace/hyperv.h b/arch/x86/include/asm/trace/hyperv.h
index ace464f09681..85ca8560c7f9 100644
--- a/arch/x86/include/asm/trace/hyperv.h
+++ b/arch/x86/include/asm/trace/hyperv.h
@@ -8,7 +8,7 @@
#if IS_ENABLED(CONFIG_HYPERV)
-TRACE_EVENT(hyperv_mmu_flush_tlb_others,
+TRACE_EVENT(hyperv_mmu_flush_tlb_multi,
TP_PROTO(const struct cpumask *cpus,
const struct flush_tlb_info *info),
TP_ARGS(cpus, info),
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index b7f34fe2171e..de40657d9025 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -595,7 +595,7 @@ static void __init kvm_apf_trap_init(void)
static DEFINE_PER_CPU(cpumask_var_t, __pv_tlb_mask);
-static void kvm_flush_tlb_others(const struct cpumask *cpumask,
+static void kvm_flush_tlb_multi(const struct cpumask *cpumask,
const struct flush_tlb_info *info)
{
u8 state;
@@ -609,6 +609,11 @@ static void kvm_flush_tlb_others(const struct cpumask *cpumask,
* queue flush_on_enter for pre-empted vCPUs
*/
for_each_cpu(cpu, flushmask) {
+ /*
+ * The local vCPU is never preempted, so we do not explicitly
+ * skip check for local vCPU - it will never be cleared from
+ * flushmask.
+ */
src = &per_cpu(steal_time, cpu);
state = READ_ONCE(src->preempted);
if ((state & KVM_VCPU_PREEMPTED)) {
@@ -618,7 +623,7 @@ static void kvm_flush_tlb_others(const struct cpumask *cpumask,
}
}
- native_flush_tlb_others(flushmask, info);
+ native_flush_tlb_multi(flushmask, info);
}
static void __init kvm_guest_init(void)
@@ -643,7 +648,7 @@ static void __init kvm_guest_init(void)
if (kvm_para_has_feature(KVM_FEATURE_PV_TLB_FLUSH) &&
!kvm_para_has_hint(KVM_HINTS_REALTIME) &&
kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
- pv_ops.mmu.flush_tlb_others = kvm_flush_tlb_others;
+ pv_ops.mmu.flush_tlb_multi = kvm_flush_tlb_multi;
pv_ops.mmu.tlb_remove_table = tlb_remove_table;
}
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 0aa6256eedd8..6af40844a730 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -363,7 +363,7 @@ struct paravirt_patch_template pv_ops = {
.mmu.flush_tlb_user = native_flush_tlb,
.mmu.flush_tlb_kernel = native_flush_tlb_global,
.mmu.flush_tlb_one_user = native_flush_tlb_one_user,
- .mmu.flush_tlb_others = native_flush_tlb_others,
+ .mmu.flush_tlb_multi = native_flush_tlb_multi,
.mmu.tlb_remove_table =
(void (*)(struct mmu_gather *, void *))tlb_remove_page,
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index abbf55fa8b81..63c00908bdd9 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -551,7 +551,7 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
* garbage into our TLB. Since switching to init_mm is barely
* slower than a minimal flush, just switch to init_mm.
*
- * This should be rare, with native_flush_tlb_others skipping
+ * This should be rare, with native_flush_tlb_multi() skipping
* IPIs to lazy TLB mode CPUs.
*/
switch_mm_irqs_off(NULL, &init_mm, NULL);
@@ -665,9 +665,14 @@ static bool tlb_is_not_lazy(int cpu)
static DEFINE_PER_CPU(cpumask_t, flush_tlb_mask);
-void native_flush_tlb_others(const struct cpumask *cpumask,
- const struct flush_tlb_info *info)
+void native_flush_tlb_multi(const struct cpumask *cpumask,
+ const struct flush_tlb_info *info)
{
+ /*
+ * Do accounting and tracing. Note that there are (and have always been)
+ * cases in which a remote TLB flush will be traced, but eventually
+ * would not happen.
+ */
count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
if (info->end == TLB_FLUSH_ALL)
trace_tlb_flush(TLB_REMOTE_SEND_IPI, TLB_FLUSH_ALL);
@@ -687,10 +692,12 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
* means that the percpu tlb_gen variables won't be updated
* and we'll do pointless flushes on future context switches.
*
- * Rather than hooking native_flush_tlb_others() here, I think
+ * Rather than hooking native_flush_tlb_multi() here, I think
* that UV should be updated so that smp_call_function_many(),
* etc, are optimal on UV.
*/
+ flush_tlb_func_local((void *)info);
+
cpumask = uv_flush_tlb_others(cpumask, info);
if (cpumask)
smp_call_function_many(cpumask, flush_tlb_func_remote,
@@ -709,8 +716,9 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
* doing a speculative memory access.
*/
if (info->freed_tables) {
- smp_call_function_many(cpumask, flush_tlb_func_remote,
- (void *)info, 1);
+ __smp_call_function_many(cpumask, flush_tlb_func_remote,
+ flush_tlb_func_local,
+ (void *)info, 1);
} else {
/*
* Although we could have used on_each_cpu_cond_mask(),
@@ -737,7 +745,8 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
if (tlb_is_not_lazy(cpu))
__cpumask_set_cpu(cpu, cond_cpumask);
}
- smp_call_function_many(cond_cpumask, flush_tlb_func_remote,
+ __smp_call_function_many(cond_cpumask, flush_tlb_func_remote,
+ flush_tlb_func_local,
(void *)info, 1);
}
}
@@ -818,16 +827,20 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
info = get_flush_tlb_info(mm, start, end, stride_shift, freed_tables,
new_tlb_gen);
- if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
+ /*
+ * flush_tlb_multi() is not optimized for the common case in which only
+ * a local TLB flush is needed. Optimize this use-case by calling
+ * flush_tlb_func_local() directly in this case.
+ */
+ if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) {
+ flush_tlb_multi(mm_cpumask(mm), info);
+ } else if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
lockdep_assert_irqs_enabled();
local_irq_disable();
flush_tlb_func_local(info);
local_irq_enable();
}
- if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
- flush_tlb_others(mm_cpumask(mm), info);
-
put_flush_tlb_info();
put_cpu();
}
@@ -890,16 +903,20 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
{
int cpu = get_cpu();
- if (cpumask_test_cpu(cpu, &batch->cpumask)) {
+ /*
+ * flush_tlb_multi() is not optimized for the common case in which only
+ * a local TLB flush is needed. Optimize this use-case by calling
+ * flush_tlb_func_local() directly in this case.
+ */
+ if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) {
+ flush_tlb_multi(&batch->cpumask, &full_flush_tlb_info);
+ } else if (cpumask_test_cpu(cpu, &batch->cpumask)) {
lockdep_assert_irqs_enabled();
local_irq_disable();
flush_tlb_func_local((void *)&full_flush_tlb_info);
local_irq_enable();
}
- if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids)
- flush_tlb_others(&batch->cpumask, &full_flush_tlb_info);
-
cpumask_clear(&batch->cpumask);
put_cpu();
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index 26e8b326966d..48f7c7eb4dbc 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -1345,8 +1345,8 @@ static void xen_flush_tlb_one_user(unsigned long addr)
preempt_enable();
}
-static void xen_flush_tlb_others(const struct cpumask *cpus,
- const struct flush_tlb_info *info)
+static void xen_flush_tlb_multi(const struct cpumask *cpus,
+ const struct flush_tlb_info *info)
{
struct {
struct mmuext_op op;
@@ -1356,7 +1356,7 @@ static void xen_flush_tlb_others(const struct cpumask *cpus,
const size_t mc_entry_size = sizeof(args->op) +
sizeof(args->mask[0]) * BITS_TO_LONGS(num_possible_cpus());
- trace_xen_mmu_flush_tlb_others(cpus, info->mm, info->start, info->end);
+ trace_xen_mmu_flush_tlb_multi(cpus, info->mm, info->start, info->end);
if (cpumask_empty(cpus))
return; /* nothing to do */
@@ -1365,9 +1365,8 @@ static void xen_flush_tlb_others(const struct cpumask *cpus,
args = mcs.args;
args->op.arg2.vcpumask = to_cpumask(args->mask);
- /* Remove us, and any offline CPUS. */
+ /* Remove any offline CPUs */
cpumask_and(to_cpumask(args->mask), cpus, cpu_online_mask);
- cpumask_clear_cpu(smp_processor_id(), to_cpumask(args->mask));
args->op.cmd = MMUEXT_TLB_FLUSH_MULTI;
if (info->end != TLB_FLUSH_ALL &&
@@ -2396,7 +2395,7 @@ static const struct pv_mmu_ops xen_mmu_ops __initconst = {
.flush_tlb_user = xen_flush_tlb,
.flush_tlb_kernel = xen_flush_tlb,
.flush_tlb_one_user = xen_flush_tlb_one_user,
- .flush_tlb_others = xen_flush_tlb_others,
+ .flush_tlb_multi = xen_flush_tlb_multi,
.tlb_remove_table = tlb_remove_table,
.pgd_alloc = xen_pgd_alloc,
diff --git a/include/trace/events/xen.h b/include/trace/events/xen.h
index 9a0e8af21310..546022acf160 100644
--- a/include/trace/events/xen.h
+++ b/include/trace/events/xen.h
@@ -362,7 +362,7 @@ TRACE_EVENT(xen_mmu_flush_tlb_one_user,
TP_printk("addr %lx", __entry->addr)
);
-TRACE_EVENT(xen_mmu_flush_tlb_others,
+TRACE_EVENT(xen_mmu_flush_tlb_multi,
TP_PROTO(const struct cpumask *cpus, struct mm_struct *mm,
unsigned long addr, unsigned long end),
TP_ARGS(cpus, mm, addr, end),
--
2.20.1
^ permalink raw reply related
* [PATCH v3 0/9] x86: Concurrent TLB flushes
From: Nadav Amit @ 2019-07-19 0:58 UTC (permalink / raw)
To: Andy Lutomirski, Dave Hansen
Cc: x86, linux-kernel, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
Nadav Amit, Borislav Petkov, Boris Ostrovsky, Haiyang Zhang,
Josh Poimboeuf, Juergen Gross, K. Y. Srinivasan, Paolo Bonzini,
Rik van Riel, Sasha Levin, Stephen Hemminger, kvm, linux-hyperv,
virtualization, xen-devel
[ Cover-letter is identical to v2, including benchmark results,
excluding the change log. ]
Currently, local and remote TLB flushes are not performed concurrently,
which introduces unnecessary overhead - each INVLPG can take 100s of
cycles. This patch-set allows TLB flushes to be run concurrently: first
request the remote CPUs to initiate the flush, then run it locally, and
finally wait for the remote CPUs to finish their work.
In addition, there are various small optimizations to avoid unwarranted
false-sharing and atomic operations.
The proposed changes should also improve the performance of other
invocations of on_each_cpu(). Hopefully, no one has relied on this
behavior of on_each_cpu() that invoked functions first remotely and only
then locally [Peter says he remembers someone might do so, but without
further information it is hard to know how to address it].
Running sysbench on dax/ext4 w/emulated-pmem, write-cache disabled on
2-socket, 48-logical-cores (24+SMT) Haswell-X, 5 repetitions:
sysbench fileio --file-total-size=3G --file-test-mode=rndwr \
--file-io-mode=mmap --threads=X --file-fsync-mode=fdatasync run
Th. tip-jun28 avg (stdev) +patch-set avg (stdev) change
--- --------------------- ---------------------- ------
1 1267765 (14146) 1299253 (5715) +2.4%
2 1734644 (11936) 1799225 (19577) +3.7%
4 2821268 (41184) 2919132 (40149) +3.4%
8 4171652 (31243) 4376925 (65416) +4.9%
16 5590729 (24160) 5829866 (8127) +4.2%
24 6250212 (24481) 6522303 (28044) +4.3%
32 3994314 (26606) 4077543 (10685) +2.0%
48 4345177 (28091) 4417821 (41337) +1.6%
(Note that on configurations with up to 24 threads numactl was used to
set all threads on socket 1, which explains the drop in performance when
going to 32 threads).
Running the same benchmark with security mitigations disabled (PTI,
Spectre, MDS):
Th. tip-jun28 avg (stdev) +patch-set avg (stdev) change
--- --------------------- ---------------------- ------
1 1598896 (5174) 1607903 (4091) +0.5%
2 2109472 (17827) 2224726 (4372) +5.4%
4 3448587 (11952) 3668551 (30219) +6.3%
8 5425778 (29641) 5606266 (33519) +3.3%
16 6931232 (34677) 7054052 (27873) +1.7%
24 7612473 (23482) 7783138 (13871) +2.2%
32 4296274 (18029) 4283279 (32323) -0.3%
48 4770029 (35541) 4764760 (13575) -0.1%
Presumably, PTI requires two invalidations of each mapping, which allows
to get higher benefits from concurrency when PTI is on. At the same
time, when mitigations are on, other overheads reduce the potential
speedup.
I tried to reduce the size of the code of the main patch, which required
restructuring of the series.
v2 -> v3:
* Open-code the remote/local-flush decision code [Andy]
* Fix hyper-v, Xen implementations [Andrew]
* Fix redundant TLB flushes.
v1 -> v2:
* Removing the patches that Thomas took [tglx]
* Adding hyper-v, Xen compile-tested implementations [Dave]
* Removing UV [Andy]
* Adding lazy optimization, removing inline keyword [Dave]
* Restructuring patch-set
RFCv2 -> v1:
* Fix comment on flush_tlb_multi [Juergen]
* Removing async invalidation optimizations [Andy]
* Adding KVM support [Paolo]
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kvm@vger.kernel.org
Cc: linux-hyperv@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Cc: x86@kernel.org
Cc: xen-devel@lists.xenproject.org
Nadav Amit (9):
smp: Run functions concurrently in smp_call_function_many()
x86/mm/tlb: Remove reason as argument for flush_tlb_func_local()
x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()
x86/mm/tlb: Flush remote and local TLBs concurrently
x86/mm/tlb: Privatize cpu_tlbstate
x86/mm/tlb: Do not make is_lazy dirty for no reason
cpumask: Mark functions as pure
x86/mm/tlb: Remove UV special case
x86/mm/tlb: Remove unnecessary uses of the inline keyword
arch/x86/hyperv/mmu.c | 10 +-
arch/x86/include/asm/paravirt.h | 6 +-
arch/x86/include/asm/paravirt_types.h | 4 +-
arch/x86/include/asm/tlbflush.h | 47 ++++-----
arch/x86/include/asm/trace/hyperv.h | 2 +-
arch/x86/kernel/kvm.c | 11 ++-
arch/x86/kernel/paravirt.c | 2 +-
arch/x86/mm/init.c | 2 +-
arch/x86/mm/tlb.c | 133 ++++++++++++++++----------
arch/x86/xen/mmu_pv.c | 11 +--
include/linux/cpumask.h | 6 +-
include/linux/smp.h | 27 ++++--
include/trace/events/xen.h | 2 +-
kernel/smp.c | 133 ++++++++++++--------------
14 files changed, 218 insertions(+), 178 deletions(-)
--
2.20.1
^ permalink raw reply
* RE: [PATCH] x86/hyper-v: Zero out the VP assist page to fix CPU offlining
From: Dexuan Cui @ 2019-07-18 8:04 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Haiyang Zhang, KY Srinivasan, Stephen Hemminger, Sasha Levin,
linux-hyperv@vger.kernel.org, Michael Kelley, Long Li, vkuznets,
Ingo Molnar, H. Peter Anvin, Borislav Petkov, x86@kernel.org,
linux-kernel@vger.kernel.org, marcelo.cerri@canonical.com,
driverdev-devel@linuxdriverproject.org, olaf@aepfle.de,
apw@canonical.com, jasowang@redhat.com
In-Reply-To: <alpine.DEB.2.21.1907180955130.1778@nanos.tec.linutronix.de>
> From: Thomas Gleixner <tglx@linutronix.de>
> Sent: Thursday, July 18, 2019 12:56 AM
> To: Dexuan Cui <decui@microsoft.com>
>
> On Thu, 18 Jul 2019, Dexuan Cui wrote:
> >
> > The concept of the "overlay page" seems weird, and frankly speaking,
> > I don't really understand why the Hyper-V guys invented it, but as far
> > as this patch here is concerned, I think the patch is safe and it can
> > indeed fix the CPU offlining issue I described.
>
> Then this needs some really good explanation and in the change log because
> that's really obscure behaviour.
>
> tglx
Agreed. I'll combine my replies into the changelog and post a v2 of
the one-line patch.
Thanks,
-- Dexuan
^ permalink raw reply
* RE: [PATCH] x86/hyper-v: Zero out the VP assist page to fix CPU offlining
From: Thomas Gleixner @ 2019-07-18 7:56 UTC (permalink / raw)
To: Dexuan Cui
Cc: Haiyang Zhang, KY Srinivasan, Stephen Hemminger, Sasha Levin,
linux-hyperv@vger.kernel.org, Michael Kelley, Long Li, vkuznets,
Ingo Molnar, H. Peter Anvin, Borislav Petkov, x86@kernel.org,
linux-kernel@vger.kernel.org, marcelo.cerri@canonical.com,
driverdev-devel@linuxdriverproject.org, olaf@aepfle.de,
apw@canonical.com, jasowang@redhat.com
In-Reply-To: <PU1P153MB0169BE20761D77E7FD9A3D57BFC80@PU1P153MB0169.APCP153.PROD.OUTLOOK.COM>
On Thu, 18 Jul 2019, Dexuan Cui wrote:
>
> The concept of the "overlay page" seems weird, and frankly speaking,
> I don't really understand why the Hyper-V guys invented it, but as far
> as this patch here is concerned, I think the patch is safe and it can
> indeed fix the CPU offlining issue I described.
Then this needs some really good explanation and in the change log because
that's really obscure behaviour.
Thanks,
tglx
^ permalink raw reply
* RE: [PATCH] x86/hyper-v: Zero out the VP assist page to fix CPU offlining
From: Dexuan Cui @ 2019-07-18 7:52 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Haiyang Zhang, KY Srinivasan, Stephen Hemminger, Sasha Levin,
linux-hyperv@vger.kernel.org, Michael Kelley, Long Li, vkuznets,
Ingo Molnar, H. Peter Anvin, Borislav Petkov, x86@kernel.org,
linux-kernel@vger.kernel.org, marcelo.cerri@canonical.com,
driverdev-devel@linuxdriverproject.org, olaf@aepfle.de,
apw@canonical.com, jasowang@redhat.com
In-Reply-To: <alpine.DEB.2.21.1907180846290.1778@nanos.tec.linutronix.de>
> From: Thomas Gleixner <tglx@linutronix.de>
> Sent: Thursday, July 18, 2019 12:01 AM
> ...
> Those are two different things. The GPF_ZERO allocation makes sense on its
> own but it _cannot_ prevent the following scenario:
Hi tglx,
The scenario can be prevented.
The VP ASSIST PAGE is an "overlay" page (please see Hyper-V TLFS's Section
5.2.1 "GPA Overlay Pages", on page 38 of the spec).
The spec can be downloaded from
https://docs.microsoft.com/en-us/virtualization/hyper-v-on-windows/reference/tlfs
(choose the v5.0c release)
Here is an excerpt of the section:
"
The hypervisor defines several special pages that "overlay" the guest's GPA
space. The hypercall code page is an example of an overlay page. Overlays
are addressed by Guest Physical Addresses (GPA) but are not included in the
normal GPA map maintained internally by the hypervisor. Conceptually,
they exist in a separate map that overlays the GPA map.
If a page within the GPA space is overlaid, any SPA page mapped to the
GPA page is effectively "obscured" and generally unreachable by the
virtual processor through processor memory accesses.
...
If an overlay page is disabled or is moved to a new location in the GPA
space, the underlying GPA page is "uncovered", and an existing
mapping becomes accessible to the guest.
"
Here, SPA = System Physical Address = the final real physical address.
> cpu_init()
> if (!hvp)
> hvp = vmalloc(...., GFP_ZERO);
> ...
>
> hvp->apic_assist |= 1;
When the VP ASSIST PAGE feature is enabled and the hypervisor sets
the bit in hvp->apic_assist, the bit belongs to the special SPA page.
> #1 cpu_die()
> if (....)
> wrmsrl(HV_X64_MSR_VP_ASSIST_PAGE, 0);
After the VP ASSIST PAGE is disabled, hvp->apic_assist belongs to
the "normal" SPA page mapped to the GPA.
> ---> IPI
> if (!(hvp->apic_assist & 1))
> wrmsr(APIC_EOI); <- PATH not taken
So, with the one-line patch or the three-line patch, here we're sure
vp->apic_assist.bit0 must be 0.
> #3 cpu is dead
>
> cpu_init()
> if (!hvp)
> hvp = vmalloc(....m, GFP_ZERO); <- NOT TAKEN because hvp !=
> NULL
It does not matter, because with the 1-line patch, the initial content of
the "normal" SPA page is filled with zeros; later, neither the hypervisor nor
the guest writes into the page, so the page always remains with zeros.
> So you have to come up with a better fairy tale why GFP_ZERO 'fixes' this.
>
> Allocating hvp with GFP_ZERO makes sense on it's own so the allocated
> memory has a defined state, but that's a different story.
>
> The 3 liner patch above makes way more sense and you can spare the
> local_irq_save/restore by moving the whole condition into the
> irq_save/restore region above.
>
> tglx
The concept of the "overlay page" seems weird, and frankly speaking,
I don't really understand why the Hyper-V guys invented it, but as far
as this patch here is concerned, I think the patch is safe and it can
indeed fix the CPU offlining issue I described.
Thanks,
-- Dexuan
^ permalink raw reply
* RE: [PATCH] x86/hyper-v: Zero out the VP assist page to fix CPU offlining
From: Thomas Gleixner @ 2019-07-18 7:00 UTC (permalink / raw)
To: Dexuan Cui
Cc: Haiyang Zhang, KY Srinivasan, Stephen Hemminger, Sasha Levin,
linux-hyperv@vger.kernel.org, Michael Kelley, Long Li, vkuznets,
Ingo Molnar, H. Peter Anvin, Borislav Petkov, x86@kernel.org,
linux-kernel@vger.kernel.org, marcelo.cerri@canonical.com,
driverdev-devel@linuxdriverproject.org, olaf@aepfle.de,
apw@canonical.com, jasowang@redhat.com
In-Reply-To: <PU1P153MB01693AB444C4A432FBA2507BBFC80@PU1P153MB0169.APCP153.PROD.OUTLOOK.COM>
On Thu, 18 Jul 2019, Dexuan Cui wrote:
> > On Thu, 4 Jul 2019, Dexuan Cui wrote:
> > This is the allocation when the CPU is brought online for the first
> > time. So what effect has zeroing at allocation time vs. offlining and
> > potentially receiving IPIs? That allocation is never freed.
> >
> > Neither the comment nor the changelog make any sense to me.
> > tglx
>
> That allocation was introduced by the commit
> a46d15cc1ae5 ("x86/hyper-v: allocate and use Virtual Processor Assist Pages").
>
> I think it's ok to not free the page when a CPU is offlined: every
> CPU uses only 1 page and CPU offlining is not really a very usual
> operation except for the scenario of hibernation (and suspend-to-memory),
> where the CPUs are quickly onlined again, when we resume from hibernation.
> IMO Vitaly intentionally decided to not free the page for simplicity of the
> code.
>
> When a CPU (e.g. CPU1) is being onlined, in hv_cpu_init(), we allocate the
> VP_ASSIST_PAGE page and enable the PV EOI optimization for this CPU by
> writing the MSR HV_X64_MSR_VP_ASSIST_PAGE. From now on, this CPU
> *always* uses hvp->apic_assist (which is updated by the hypervisor) to
> decide if it needs to write the EOI MSR:
>
> static void hv_apic_eoi_write(u32 reg, u32 val)
> {
> struct hv_vp_assist_page *hvp = hv_vp_assist_page[smp_processor_id()];
>
> if (hvp && (xchg(&hvp->apic_assist, 0) & 0x1))
> return;
>
> wrmsr(HV_X64_MSR_EOI, val, 0);
> }
>
> When a CPU (e.g. CPU1) is being offlined, on this CPU, we do:
> 1. in hv_cpu_die(), we disable the PV EOI optimizaton for this CPU;
> 2. we finish the remaining work of stopping this CPU;
> 3. this CPU is completed stopped.
>
> Between 1 and 3, this CPU can still receive interrupts (e.g. IPIs from CPU0,
> and Local APIC timer interrupts), and this CPU *must* write the EOI MSR for
> every interrupt received, otherwise the hypervisor may not deliver further
> interrupts, which may be needed to stop this CPU completely.
>
> So we need to make sure hvp->apic_assist.bit0 is zero, after we run the line
> "wrmsrl(HV_X64_MSR_VP_ASSIST_PAGE, 0);" in hv_cpu_die(). The easiest
> way is what I do in this patch. Alternatively, we can use the below patch:
>
> @@ -188,8 +188,12 @@ static int hv_cpu_die(unsigned int cpu)
> local_irq_restore(flags);
> free_page((unsigned long)input_pg);
>
> - if (hv_vp_assist_page && hv_vp_assist_page[cpu])
> + if (hv_vp_assist_page && hv_vp_assist_page[cpu]) {
> + local_irq_save(flags);
> wrmsrl(HV_X64_MSR_VP_ASSIST_PAGE, 0);
> + hvp->apic_assist &= ~1;
> + local_irq_restore(flags);
> + }
>
> if (hv_reenlightenment_cb == NULL)
> return 0;
>
> This second version needs 3+ lines, so I prefer the one-line version. :-)
Those are two different things. The GPF_ZERO allocation makes sense on it's
own but it _cannot_ prevent the following scenario:
cpu_init()
if (!hvp)
hvp = vmalloc(...., GFP_ZERO);
...
hvp->apic_assist |= 1;
#1 cpu_die()
if (....)
wrmsrl(HV_X64_MSR_VP_ASSIST_PAGE, 0);
---> IPI
if (!(hvp->apic_assist & 1))
wrmsr(APIC_EOI); <- PATH not taken
#3 cpu is dead
cpu_init()
if (!hvp)
hvp = vmalloc(....m, GFP_ZERO); <- NOT TAKEN because hvp != NULL
So you have to come up with a better fairy tale why GFP_ZERO 'fixes' this.
Allocating hvp with GFP_ZERO makes sense on it's own so the allocated
memory has a defined state, but that's a different story.
The 3 liner patch above makes way more sense and you can spare the
local_irq_save/restore by moving the whole condition into the
irq_save/restore region above.
Thanks,
tglx
^ permalink raw reply
* Re: [PATCH v4 3/5] hv: vmbus: Replace page definition with Hyper-V specific one
From: Sasha Levin @ 2019-07-18 2:22 UTC (permalink / raw)
To: Maya Nakamura
Cc: mikelley, kys, haiyangz, sthemmin, x86, linux-hyperv,
linux-kernel, tglx
In-Reply-To: <0d9e80ecabcc950dc279fdd2e39bea4060123ba4.1562916939.git.m.maya.nakamura@gmail.com>
On Fri, Jul 12, 2019 at 08:25:18AM +0000, Maya Nakamura wrote:
>Replace PAGE_SIZE with HV_HYP_PAGE_SIZE because the guest page size may
>not be 4096 on all architectures and Hyper-V always runs with a page
>size of 4096.
>
>Signed-off-by: Maya Nakamura <m.maya.nakamura@gmail.com>
>Reviewed-by: Michael Kelley <mikelley@microsoft.com>
>Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Thomas, if you're taking this series, could you grab this patch as well
please (dependencies)?
Acked-by: Sasha Levin <sashal@kernel.org>
--
Thanks,
Sasha
^ permalink raw reply
* RE: [PATCH] x86/hyper-v: Zero out the VP assist page to fix CPU offlining
From: Dexuan Cui @ 2019-07-18 1:22 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Haiyang Zhang, KY Srinivasan, Stephen Hemminger, Sasha Levin,
linux-hyperv@vger.kernel.org, Michael Kelley, Long Li, vkuznets,
Ingo Molnar, H. Peter Anvin, Borislav Petkov, x86@kernel.org,
linux-kernel@vger.kernel.org, marcelo.cerri@canonical.com,
driverdev-devel@linuxdriverproject.org, olaf@aepfle.de,
apw@canonical.com, jasowang@redhat.com
In-Reply-To: <alpine.DEB.2.21.1907180058210.1778@nanos.tec.linutronix.de>
> From: Thomas Gleixner <tglx@linutronix.de>
> Sent: Wednesday, July 17, 2019 4:04 PM
> To: Dexuan Cui <decui@microsoft.com>
> ...
> On Thu, 4 Jul 2019, Dexuan Cui wrote:
> > When a CPU is being offlined, the CPU usually still receives a few
> > interrupts (e.g. reschedule IPIs), after hv_cpu_die() disables the
> > HV_X64_MSR_VP_ASSIST_PAGE, so hv_apic_eoi_write() may not write
> > the EOI MSR, if the apic_assist field's bit0 happens to be 1; as a result,
> > Hyper-V may not be able to deliver all the interrupts to the CPU, and the
> > CPU may not be stopped, and the kernel will hang soon.
> >
> > The VP ASSIST PAGE is an "overlay" page (see Hyper-V TLFS's Section
> > 5.2.1 "GPA Overlay Pages"), so with this fix we're sure the apic_assist
> > field is still zero, after the VP ASSIST PAGE is disabled.
> >
> > Fixes: ba696429d290 ("x86/hyper-v: Implement EOI assist")
> > Signed-off-by: Dexuan Cui <decui@microsoft.com>
> >
> > diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
> > index 0e033ef11a9f..db51a301f759 100644
> > --- a/arch/x86/hyperv/hv_init.c
> > +++ b/arch/x86/hyperv/hv_init.c
> > @@ -60,8 +60,14 @@ static int hv_cpu_init(unsigned int cpu)
> > if (!hv_vp_assist_page)
> > return 0;
> >
> > + /*
> > + * The ZERO flag is necessary, because in the case of CPU offlining
> > + * the page can still be used by hv_apic_eoi_write() for a while,
> > + * after the VP ASSIST PAGE is disabled in hv_cpu_die().
> > + */
> > if (!*hvp)
> > - *hvp = __vmalloc(PAGE_SIZE, GFP_KERNEL, PAGE_KERNEL);
> > + *hvp = __vmalloc(PAGE_SIZE, GFP_KERNEL | __GFP_ZERO,
> > + PAGE_KERNEL);
>
> This is the allocation when the CPU is brought online for the first
> time. So what effect has zeroing at allocation time vs. offlining and
> potentially receiving IPIs? That allocation is never freed.
>
> Neither the comment nor the changelog make any sense to me.
> tglx
That allocation was introduced by the commit
a46d15cc1ae5 ("x86/hyper-v: allocate and use Virtual Processor Assist Pages").
I think it's ok to not free the page when a CPU is offlined: every
CPU uses only 1 page and CPU offlining is not really a very usual
operation except for the scenario of hibernation (and suspend-to-memory),
where the CPUs are quickly onlined again, when we resume from hibernation.
IMO Vitaly intentionally decided to not free the page for simplicity of the
code.
When a CPU (e.g. CPU1) is being onlined, in hv_cpu_init(), we allocate the
VP_ASSIST_PAGE page and enable the PV EOI optimization for this CPU by
writing the MSR HV_X64_MSR_VP_ASSIST_PAGE. From now on, this CPU
*always* uses hvp->apic_assist (which is updated by the hypervisor) to
decide if it needs to write the EOI MSR:
static void hv_apic_eoi_write(u32 reg, u32 val)
{
struct hv_vp_assist_page *hvp = hv_vp_assist_page[smp_processor_id()];
if (hvp && (xchg(&hvp->apic_assist, 0) & 0x1))
return;
wrmsr(HV_X64_MSR_EOI, val, 0);
}
When a CPU (e.g. CPU1) is being offlined, on this CPU, we do:
1. in hv_cpu_die(), we disable the PV EOI optimizaton for this CPU;
2. we finish the remaining work of stopping this CPU;
3. this CPU is completed stopped.
Between 1 and 3, this CPU can still receive interrupts (e.g. IPIs from CPU0,
and Local APIC timer interrupts), and this CPU *must* write the EOI MSR for
every interrupt received, otherwise the hypervisor may not deliver further
interrupts, which may be needed to stop this CPU completely.
So we need to make sure hvp->apic_assist.bit0 is zero, after we run the line
"wrmsrl(HV_X64_MSR_VP_ASSIST_PAGE, 0);" in hv_cpu_die(). The easiest
way is what I do in this patch. Alternatively, we can use the below patch:
@@ -188,8 +188,12 @@ static int hv_cpu_die(unsigned int cpu)
local_irq_restore(flags);
free_page((unsigned long)input_pg);
- if (hv_vp_assist_page && hv_vp_assist_page[cpu])
+ if (hv_vp_assist_page && hv_vp_assist_page[cpu]) {
+ local_irq_save(flags);
wrmsrl(HV_X64_MSR_VP_ASSIST_PAGE, 0);
+ hvp->apic_assist &= ~1;
+ local_irq_restore(flags);
+ }
if (hv_reenlightenment_cb == NULL)
return 0;
This second version needs 3+ lines, so I prefer the one-line version. :-)
Thanks,
-- Dexuan
^ permalink raw reply
* Re: [PATCH] x86/hyper-v: Zero out the VP assist page to fix CPU offlining
From: Thomas Gleixner @ 2019-07-17 23:03 UTC (permalink / raw)
To: Dexuan Cui
Cc: Haiyang Zhang, KY Srinivasan, Stephen Hemminger, Sasha Levin,
linux-hyperv@vger.kernel.org, Michael Kelley, Long Li, vkuznets,
Ingo Molnar, H. Peter Anvin, Borislav Petkov, x86@kernel.org,
linux-kernel@vger.kernel.org, marcelo.cerri@canonical.com,
driverdev-devel@linuxdriverproject.org, olaf@aepfle.de,
apw@canonical.com, jasowang@redhat.com
In-Reply-To: <PU1P153MB01697CBE66649B4BA91D8B48BFFA0@PU1P153MB0169.APCP153.PROD.OUTLOOK.COM>
Dexuan,
On Thu, 4 Jul 2019, Dexuan Cui wrote:
> When a CPU is being offlined, the CPU usually still receives a few
> interrupts (e.g. reschedule IPIs), after hv_cpu_die() disables the
> HV_X64_MSR_VP_ASSIST_PAGE, so hv_apic_eoi_write() may not write the EOI
> MSR, if the apic_assist field's bit0 happens to be 1; as a result, Hyper-V
> may not be able to deliver all the interrupts to the CPU, and the CPU may
> not be stopped, and the kernel will hang soon.
>
> The VP ASSIST PAGE is an "overlay" page (see Hyper-V TLFS's Section
> 5.2.1 "GPA Overlay Pages"), so with this fix we're sure the apic_assist
> field is still zero, after the VP ASSIST PAGE is disabled.
>
> Fixes: ba696429d290 ("x86/hyper-v: Implement EOI assist")
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> ---
> arch/x86/hyperv/hv_init.c | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
> index 0e033ef11a9f..db51a301f759 100644
> --- a/arch/x86/hyperv/hv_init.c
> +++ b/arch/x86/hyperv/hv_init.c
> @@ -60,8 +60,14 @@ static int hv_cpu_init(unsigned int cpu)
> if (!hv_vp_assist_page)
> return 0;
>
> + /*
> + * The ZERO flag is necessary, because in the case of CPU offlining
> + * the page can still be used by hv_apic_eoi_write() for a while,
> + * after the VP ASSIST PAGE is disabled in hv_cpu_die().
> + */
> if (!*hvp)
> - *hvp = __vmalloc(PAGE_SIZE, GFP_KERNEL, PAGE_KERNEL);
> + *hvp = __vmalloc(PAGE_SIZE, GFP_KERNEL | __GFP_ZERO,
> + PAGE_KERNEL);
This is the allocation when the CPU is brought online for the first
time. So what effect has zeroing at allocation time vs. offlining and
potentially receiving IPIs? That allocation is never freed.
Neither the comment nor the changelog make any sense to me.
Thanks,
tglx
^ permalink raw reply
* Re: properly communicate queue limits to the DMA layer v2
From: Martin K. Petersen @ 2019-07-17 3:07 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Martin K. Petersen, Sagi Grimberg, Max Gurtovoy, Bart Van Assche,
linux-rdma, linux-scsi, megaraidlinux.pdl, MPT-FusionLinux.pdl,
linux-hyperv, linux-kernel
In-Reply-To: <20190715174617.GA11094@lst.de>
Christoph,
> I think all the patches on the block side went into 5.2, but it's been
> a while, so I might misremember..
I checked my notes and the reason I held them back was that I was
waiting for a response from Broadcom wrt. the megaraid segment size
limitation. However, given that mpt3sas was acked, I assume it's the
same thing.
I'm not so keen on how big the last batch of patches for the merge
window is getting. But I queued your fixes up for 5.3.
--
Martin K. Petersen Oracle Linux Engineering
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox