Linux-HyperV List
 help / color / mirror / Atom feed
* Negocjacje-wskaźnik skuteczności
From:  Tomasz Wiewiór  @ 2023-06-07  8:11 UTC (permalink / raw)
  To: linux-hyperv

Szanowni Państwo,

jako specjaliści w dziedzinie badania potrzeb i projektowania programów szkoleniowych, wiemy, jak ważne jest zapewnienie odpowiednich narzędzi i wiedzy, aby skutecznie zwiększyć efektywność zespołu, szczególnie w dużych organizacjach.

Zamknięte szkolenia są najlepszym rozwiązaniem dla tych, którzy chcą skupić się na konkretnych potrzebach i zagadnieniach. 

Nasza metodyka opiera się na identyfikacji obszarów, które wymagają poprawy, a następnie dostarczaniu spersonalizowanych rozwiązań, które pomagają osiągnąć cele biznesowe.

Z powodzeniem prowadziliśmy projekty szkoleniowe dla renomowanych marek, takich jak: PZU, Bank Pekao S.A., PWC, Ronson Development, Gedeon Richter i wielu innych.

Możemy niezobowiązująco porozmawiać o możliwościach zorganizowania szkolenia dla Państwa firmy?


Pozdrawiam
Tomasz Wiewiór

^ permalink raw reply

* Re: [PATCH 4/9] drivers: hv: Mark shared pages unencrypted in SEV-SNP enlightened guest
From: Tianyu Lan @ 2023-06-07  8:16 UTC (permalink / raw)
  To: Vitaly Kuznetsov, kys, haiyangz, wei.liu, decui, tglx, mingo, bp,
	dave.hansen, x86, hpa, daniel.lezcano, arnd, michael.h.kelley
  Cc: Tianyu Lan, linux-arch, linux-hyperv, linux-kernel
In-Reply-To: <87zg5ejchp.fsf@redhat.com>

On 6/5/2023 8:54 PM, Vitaly Kuznetsov wrote:
>> @@ -402,7 +417,14 @@ int hv_common_cpu_die(unsigned int cpu)
>>   
>>   	local_irq_restore(flags);
>>   
>> -	kfree(mem);
>> +	if (hv_isolation_type_en_snp()) {
>> +		ret = set_memory_encrypted((unsigned long)mem, pgcount);
>> +		if (ret)
>> +			pr_warn("Hyper-V: Failed to encrypt input arg on cpu%d: %d\n",
>> +				cpu, ret);
>> +		/* It's unsafe to free 'mem'. */
>> +		return 0;
> Why is it unsafe to free 'mem' if ret == 0? Also, why don't we want to
> proparate non-zero 'ret' from here to fail CPU offlining?
> 

Based on Michael's patch the mem will not be freed during cpu offline.
https://lwn.net/ml/linux-kernel/87cz2j5zrc.fsf@redhat.com/
So I think it's unnessary to encrypt the mem again here.

^ permalink raw reply

* Re: [PATCH v6] hv_netvsc: Allocate rx indirection table size dynamically
From: Simon Horman @ 2023-06-07  7:49 UTC (permalink / raw)
  To: Shradha Gupta
  Cc: linux-kernel, linux-hyperv, netdev, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Long Li, Michael Kelley, David S. Miller, Steen Hegelund,
	Praveen Kumar
In-Reply-To: <ZH8Bv624GxCf1PKq@corigine.com>

On Tue, Jun 06, 2023 at 11:52:06AM +0200, Simon Horman wrote:
> + Praveen Kumar
> 
> On Mon, Jun 05, 2023 at 04:30:06AM -0700, Shradha Gupta wrote:
> > Allocate the size of rx indirection table dynamically in netvsc
> > from the value of size provided by OID_GEN_RECEIVE_SCALE_CAPABILITIES
> > query instead of using a constant value of ITAB_NUM.
> > 
> > Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
> > Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
> > Tested-on: Ubuntu22 (azure VM, SKU size: Standard_F72s_v2)
> > Testcases:
> > 1. ethtool -x eth0 output
> > 2. LISA testcase:PERF-NETWORK-TCP-THROUGHPUT-MULTICONNECTION-NTTTCP-Synthetic
> > 3. LISA testcase:PERF-NETWORK-TCP-THROUGHPUT-MULTICONNECTION-NTTTCP-SRIOV
> 
> Hi Praveen, all,
> 
> it seems that there has not been any non-trivial review of this
> patchset since v3. But at that point there was some feedback
> from you. So I'd like to ask if you have any feedback on v6 [1].
> 
> [1] https://lore.kernel.org/all/1685964606-24690-1-git-send-email-shradhagupta@linux.microsoft.com/
> 

For the record:

I see that Praveen has now provided a Reviewed-by for v3 [2], thanks!

I think this patch is clear now.

Reviewed-by: Simon Horman <simon.horman@corigine.com>

[2] https://lore.kernel.org/all/bb461c30-3eb0-74c0-d637-c4a3bdf84565@linux.microsoft.com/

^ permalink raw reply

* Re: [PATCH v3] hv_netvsc: Allocate rx indirection table size dynamically
From: Praveen Kumar @ 2023-06-07  7:19 UTC (permalink / raw)
  To: Shradha Gupta
  Cc: linux-kernel, linux-hyperv, netdev, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Long Li, Michael Kelley, David S. Miller, Steen Hegelund,
	Simon Horman
In-Reply-To: <20230529111733.GA21447@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net>

On 5/29/2023 4:47 PM, Shradha Gupta wrote:
> On Mon, May 29, 2023 at 03:09:49PM +0530, Praveen Kumar wrote:
>> On 5/26/2023 11:32 AM, Shradha Gupta wrote:
>>> Allocate the size of rx indirection table dynamically in netvsc
>>> from the value of size provided by OID_GEN_RECEIVE_SCALE_CAPABILITIES
>>> query instead of using a constant value of ITAB_NUM.
>>>
>>> Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
>>> Tested-on: Ubuntu22 (azure VM, SKU size: Standard_F72s_v2)
>>> Testcases:
>>> 1. ethtool -x eth0 output
>>> 2. LISA testcase:PERF-NETWORK-TCP-THROUGHPUT-MULTICONNECTION-NTTTCP-Synthetic
>>> 3. LISA testcase:PERF-NETWORK-TCP-THROUGHPUT-MULTICONNECTION-NTTTCP-SRIOV
>>>
>>> ---
>>> Changes in v3:
>>>  * Changed the data type of rx_table_sz to u32
>>>  * Moved the rx indirection table free to rndis_filter_device_remove()
>>>  * Device add will fail with error if not enough memory is available
>>>  * Changed kzmalloc to kcalloc as suggested in checkpatch script
>>>  * Removed redundant log if memory allocation failed.
>>> ---
>>>  drivers/net/hyperv/hyperv_net.h   |  5 ++++-
>>>  drivers/net/hyperv/netvsc_drv.c   | 10 ++++++----
>>>  drivers/net/hyperv/rndis_filter.c | 27 +++++++++++++++++++++++----
>>>  3 files changed, 33 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
>>> index dd5919ec408b..c40868f287a9 100644
>>> --- a/drivers/net/hyperv/hyperv_net.h
>>> +++ b/drivers/net/hyperv/hyperv_net.h
>>> @@ -74,6 +74,7 @@ struct ndis_recv_scale_cap { /* NDIS_RECEIVE_SCALE_CAPABILITIES */
>>>  #define NDIS_RSS_HASH_SECRET_KEY_MAX_SIZE_REVISION_2   40
>>>  
>>>  #define ITAB_NUM 128
>>> +#define ITAB_NUM_MAX 256
>>>  
>>>  struct ndis_recv_scale_param { /* NDIS_RECEIVE_SCALE_PARAMETERS */
>>>  	struct ndis_obj_header hdr;
>>> @@ -1034,7 +1035,9 @@ struct net_device_context {
>>>  
>>>  	u32 tx_table[VRSS_SEND_TAB_SIZE];
>>>  
>>> -	u16 rx_table[ITAB_NUM];
>>> +	u16 *rx_table;
>>> +
>>> +	u32 rx_table_sz;
>>>  
>>>  	/* Ethtool settings */
>>>  	u8 duplex;
>>> diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
>>> index 0103ff914024..3ba3c8fb28a5 100644
>>> --- a/drivers/net/hyperv/netvsc_drv.c
>>> +++ b/drivers/net/hyperv/netvsc_drv.c
>>> @@ -1747,7 +1747,9 @@ static u32 netvsc_get_rxfh_key_size(struct net_device *dev)
>>>  
>>>  static u32 netvsc_rss_indir_size(struct net_device *dev)
>>>  {
>>> -	return ITAB_NUM;
>>> +	struct net_device_context *ndc = netdev_priv(dev);
>>> +
>>> +	return ndc->rx_table_sz;
>>>  }
>>>  
>>>  static int netvsc_get_rxfh(struct net_device *dev, u32 *indir, u8 *key,
>>> @@ -1766,7 +1768,7 @@ static int netvsc_get_rxfh(struct net_device *dev, u32 *indir, u8 *key,
>>>  
>>>  	rndis_dev = ndev->extension;
>>>  	if (indir) {
>>> -		for (i = 0; i < ITAB_NUM; i++)
>>> +		for (i = 0; i < ndc->rx_table_sz; i++)
>>>  			indir[i] = ndc->rx_table[i];
>>>  	}
>>>  
>>> @@ -1792,11 +1794,11 @@ static int netvsc_set_rxfh(struct net_device *dev, const u32 *indir,
>>>  
>>>  	rndis_dev = ndev->extension;
>>>  	if (indir) {
>>> -		for (i = 0; i < ITAB_NUM; i++)
>>> +		for (i = 0; i < ndc->rx_table_sz; i++)
>>>  			if (indir[i] >= ndev->num_chn)
>>>  				return -EINVAL;
>>>  
>>> -		for (i = 0; i < ITAB_NUM; i++)
>>> +		for (i = 0; i < ndc->rx_table_sz; i++)
>>>  			ndc->rx_table[i] = indir[i];
>>>  	}
>>>  
>>> diff --git a/drivers/net/hyperv/rndis_filter.c b/drivers/net/hyperv/rndis_filter.c
>>> index eea777ec2541..dc7b9b326690 100644
>>> --- a/drivers/net/hyperv/rndis_filter.c
>>> +++ b/drivers/net/hyperv/rndis_filter.c
>>> @@ -21,6 +21,7 @@
>>>  #include <linux/rtnetlink.h>
>>>  #include <linux/ucs2_string.h>
>>>  #include <linux/string.h>
>>> +#include <linux/slab.h>
>>>  
>>>  #include "hyperv_net.h"
>>>  #include "netvsc_trace.h"
>>> @@ -927,7 +928,7 @@ static int rndis_set_rss_param_msg(struct rndis_device *rdev,
>>>  	struct rndis_set_request *set;
>>>  	struct rndis_set_complete *set_complete;
>>>  	u32 extlen = sizeof(struct ndis_recv_scale_param) +
>>> -		     4 * ITAB_NUM + NETVSC_HASH_KEYLEN;
>>> +		     4 * ndc->rx_table_sz + NETVSC_HASH_KEYLEN;
>>>  	struct ndis_recv_scale_param *rssp;
>>>  	u32 *itab;
>>>  	u8 *keyp;
>>> @@ -953,7 +954,7 @@ static int rndis_set_rss_param_msg(struct rndis_device *rdev,
>>>  	rssp->hashinfo = NDIS_HASH_FUNC_TOEPLITZ | NDIS_HASH_IPV4 |
>>>  			 NDIS_HASH_TCP_IPV4 | NDIS_HASH_IPV6 |
>>>  			 NDIS_HASH_TCP_IPV6;
>>> -	rssp->indirect_tabsize = 4*ITAB_NUM;
>>> +	rssp->indirect_tabsize = 4 * ndc->rx_table_sz;
>>>  	rssp->indirect_taboffset = sizeof(struct ndis_recv_scale_param);
>>>  	rssp->hashkey_size = NETVSC_HASH_KEYLEN;
>>>  	rssp->hashkey_offset = rssp->indirect_taboffset +
>>> @@ -961,7 +962,7 @@ static int rndis_set_rss_param_msg(struct rndis_device *rdev,
>>>  
>>>  	/* Set indirection table entries */
>>>  	itab = (u32 *)(rssp + 1);
>>> -	for (i = 0; i < ITAB_NUM; i++)
>>> +	for (i = 0; i < ndc->rx_table_sz; i++)
>>>  		itab[i] = ndc->rx_table[i];
>>>  
>>>  	/* Set hask key values */
>>> @@ -1548,6 +1549,17 @@ struct netvsc_device *rndis_filter_device_add(struct hv_device *dev,
>>>  	if (ret || rsscap.num_recv_que < 2)
>>>  		goto out;
>>>  
>>> +	if (rsscap.num_indirect_tabent &&
>>> +	    rsscap.num_indirect_tabent <= ITAB_NUM_MAX)
>>> +		ndc->rx_table_sz = rsscap.num_indirect_tabent;
>>> +	else
>>> +		ndc->rx_table_sz = ITAB_NUM;
>>> +
>>> +	ndc->rx_table = kcalloc(ndc->rx_table_sz, sizeof(u16),
>>> +				GFP_KERNEL);
>>> +	if (!ndc->rx_table)
>>> +		goto err_dev_remv;
>>> +
>>>  	/* This guarantees that num_possible_rss_qs <= num_online_cpus */
>>>  	num_possible_rss_qs = min_t(u32, num_online_cpus(),
>>>  				    rsscap.num_recv_que);
>>> @@ -1558,7 +1570,7 @@ struct netvsc_device *rndis_filter_device_add(struct hv_device *dev,
>>>  	net_device->num_chn = min(net_device->max_chn, device_info->num_chn);
>>>  
>>>  	if (!netif_is_rxfh_configured(net)) {
>>> -		for (i = 0; i < ITAB_NUM; i++)
>>> +		for (i = 0; i < ndc->rx_table_sz; i++)
>>>  			ndc->rx_table[i] = ethtool_rxfh_indir_default(
>>>  						i, net_device->num_chn);
>>>  	}
>>> @@ -1596,11 +1608,18 @@ void rndis_filter_device_remove(struct hv_device *dev,
>>>  				struct netvsc_device *net_dev)
>>>  {
>>>  	struct rndis_device *rndis_dev = net_dev->extension;
>>> +	struct net_device *net = hv_get_drvdata(dev);
>>> +	struct net_device_context *ndc = netdev_priv(net);
>>>  
>>>  	/* Halt and release the rndis device */
>>>  	rndis_filter_halt_device(net_dev, rndis_dev);
>>>  
>>>  	netvsc_device_remove(dev);
>>
>> Shouldn't the netvsc_device_remove be called post table cleanup ? or better, the cleanup should happen as part of netvsc_device_remove operation ? This looks a bug to me as with remove operation, we already cleaned up the device and the association between context and device is removed.
> 
> The netvsc_device_remove() function is responsible for cleaning up/removing the netvsc_device structures upon events like remove/suspend. The net_device and net_device_context structures(where the rx indirection table exists) remain untouched in netvsc_device_remove(). They(net_device, net_device_context) only get cleaned up in netvsc_remove(). So, the netvsc_device_remove() should not affect the cleanup for the rx indirection table, that we did.
> 

Thanks.

Reviewed-by: Praveen Kumar <kumarpraveen@linux.microsoft.com>


^ permalink raw reply

* Re: [RFC PATCH 1/1] sched/fair: Fix SMT balance dependency on CPU numbering
From: Chen Yu @ 2023-06-07  4:06 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Michael Kelley (LINUX), mingo@redhat.com, peterz@infradead.org,
	juri.lelli@redhat.com, dietmar.eggemann@arm.com,
	rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
	bristot@redhat.com, vschneid@redhat.com,
	linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org
In-Reply-To: <CAKfTPtAeA1SxLD7VQ0sVc2G0AAKrNs9ZxoZPj2uR8x5DZQiomQ@mail.gmail.com>

On 2023-06-06 at 17:38:28 +0200, Vincent Guittot wrote:
> On Tue, 6 Jun 2023 at 16:08, Michael Kelley (LINUX)
> <mikelley@microsoft.com> wrote:
> >
> > From: Vincent Guittot <vincent.guittot@linaro.org> Sent: Monday, June 5, 2023 2:31 AM
> > >
> > > Hi Michael,
> > >
> > > On Wed, 31 May 2023 at 19:55, Michael Kelley <mikelley@microsoft.com> wrote:
> > > >
> > > > With some CPU numbering schemes, the function select_idle_cpu() currently
> > > > has a subtle bias to return the first hyper-thread in a core. As a result
> > > > work is not evenly balanced across hyper-threads in a core. The difference
> > > > is often as much as 15 to 20 percentage points -- i.e., the first
> > > > hyper-thread might run at 45% load while the second hyper-thread runs at
> > > > only 30% load or less.
> > > >
> > > > Two likely CPU numbering schemes make sense with today's typical case
> > > > of two hyper-threads per core:
> > > >
> > > > A. Enumerate all the first hyper-theads in a core, then all the second
> > > >    hyper-threads in a core.  If a system has 8 cores with 16 hyper-threads,
> > > >    CPUs #0 and #8 are in the same core, as are CPUs #1 and #9, etc.
> > > >
> > > > B. Enumerate all hyper-threads in a core, then all hyper-threads in the
> > > >    next core, etc.  Again with 8 cores and 16 hyper-threads, CPUs #0 and
> > > >    #1 are in the same core, as are CPUs #2 and #3, etc.
> > > >
> > > > Scheme A is used in most ACPI bare metal systems and in VMs running on
> > > > KVM.  The enumeration order is determined by the order of the processor
> > > > entries in the ACPI MADT, and the ACPI spec *recommends* that the MADT
> > > > be set up for scheme A.
> > > >
> > > > However, for reasons that pre-date the ACPI recommendation, Hyper-V
> > > > guests have an ACPI MADT that is set up for scheme B.  When using scheme B,
> > > > the subtle bias is evident in select_idle_cpu().  While having Hyper-V
> > > > conform to the ACPI spec recommendation would solve the Hyper-V problem,
> > > > it is also desirable for the fair scheduler code to be independent of the
> > > > CPU numbering scheme.  ACPI is not always the firmware configuration
> > > > mechanism, and CPU numbering schemes might vary more in architectures
> > > > other than x86/x64.
> > > >
> > > > The bias occurs with scheme B when "has_idle_cpu" is true and
> > >
> > > I assume that you mean has_idle_core as I can't find has_idle_cpu in the code
> >
> > Yes.  You are right.
> >
> > >
> > > > select_idle_core() is called in the for_each_cpu_wrap() loop. Regardless
> > > > of where the loop starts, it will almost immediately encountered a CPU
> > > > that is the first hyper-thread in a core. If that core is idle, the CPU
> > > > number of that first hyper-thread is returned. If that core is not idle,
> > > > both hyper-threads are removed from the cpus mask, and the loop iterates
> > > > to choose another CPU that is the first hyper-thread in a core.  As a
> > > > result, select_idle_core() almost always returns the first hyper-thread
> > > > in a core.
> > > >
> > > > The bias does not occur with scheme A because half of the CPU numbering
> > > > space is a series of CPUs that are the second hyper-thread in all the
> > > > cores. Assuming that the "target" CPU is evenly distributed throughout
> > > > the CPU numbering space, there's a 50/50 chance of starting in the portion
> > > > of the CPU numbering space that is all second hyper-threads.  If
> > > > select_idle_core() finds a idle core, it will likely return a CPU that
> > > > is the second hyper-thread in the core.  On average over the time,
> > > > both the first and second hyper-thread are equally likely to be
> > > > returned.
> > > >
> > > > Fix this bias by determining which hyper-thread in a core the "target"
> > > > CPU is -- i.e., the "smt index" of that CPU.  Then when select_idle_core()
> > > > finds an idle core, it returns the CPU in the core that has the same
> > > > smt index. If that CPU is not valid to be chosen, just return the CPU
> > > > that was passed into select_idle_core() and don't worry about bias.
> > > >
> > > > With scheme B, this fix solves the bias problem by making the chosen
> > > > CPU be roughly equally likely to either hyper-thread.  With scheme A
> > > > there's no real effect as the chosen CPU was already equally likely
> > > > to be either hyper-thread, and still is.
> > > >
> > > > The imbalance in hyper-thread loading was originally observed in a
> > > > customer workload, and then reproduced with a synthetic workload.
> > > > The change has been tested with the synthetic workload in a Hyper-V VM
> > > > running the normal scheme B CPU numbering, and then with the MADT
> > > > replaced with a scheme A version using Linux's ability to override
> > > > ACPI tables. The testing showed no change hyper-thread loading
> > > > balance with the scheme A CPU numbering, but the imbalance is
> > > > corrected if the CPU numbering is scheme B.
> > >
> > > You failed to explain why it's important to evenly select 1st or 2nd
> > > hyper-threads on the system.  I don't see any performance figures.
> > > What would be the benefit ?
> >
> > As I noted below, it's not completely clear to me whether this is a
> > problem.  I don't have a specific workload where overall runtime is
> > longer due to the SMT level imbalance.  I'm not a scheduler expert,
> > and wanted input from those who are.  Linux generally does a good
> > job of balancing load, and given the code in fair.c that is devoted to
> > balancing, I inferred at least some importance.  But maybe balancing
> > is more important for the higher-level domains, and less so for the
> > SMT domain.
> 
> As long as the hyper-threads on a core are the same, I don't see any
> need to add more code and complexity to evenly balance the number of
> time that we select each CPU of the same core when the whole core is
> idle.
>
In theory if a core is idle, either the 1st hyper thread or the 2nd hyper
thread is ok to run the wakee. Would there be a race condition between the
check of has_idle_core + idle core checking in select_idle_cpu() and the
task enqueue?  If the 2 hyper threads within 1 core are falling asleep
and wake up quickly, we have a false positive of has_idle_core, and
incorrectly think coreX is idle, and queue too many tasks on the first hyper
thread on coreX in B scheme, although coreX is not idle.

thanks,
Chenyu 

^ permalink raw reply

* [PATCH] Drivers: hv: vmbus: Add missing check for dma_set_mask
From: Jiasheng Jiang @ 2023-06-07  1:43 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, mikelley, nathan
  Cc: linux-hyperv, linux-kernel, Jiasheng Jiang

Add check for dma_set_mask() and return the error if it fails.

Fixes: 6bf625a4140f ("Drivers: hv: vmbus: Rework use of DMA_BIT_MASK(64)")
Signed-off-by: Jiasheng Jiang <jiasheng@iscas.ac.cn>
---
 drivers/hv/vmbus_drv.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index 1c65a6dfb9fa..68b7be2762ea 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -1908,7 +1908,11 @@ int vmbus_device_register(struct hv_device *child_device_obj)
 
 	child_device_obj->device.dma_parms = &child_device_obj->dma_parms;
 	child_device_obj->device.dma_mask = &child_device_obj->dma_mask;
-	dma_set_mask(&child_device_obj->device, DMA_BIT_MASK(64));
+	ret = dma_set_mask(&child_device_obj->device, DMA_BIT_MASK(64));
+	if (ret) {
+		put_device(&child_device_obj->device);
+		return ret;
+	}
 
 	/*
 	 * Register with the LDM. This will kick off the driver/device
-- 
2.25.1


^ permalink raw reply related

* [PATCH] x86/hyperv: Trace hv_set_register()
From: Nischala Yelchuri @ 2023-06-07  1:35 UTC (permalink / raw)
  To: linux-hyperv, linux-kernel
  Cc: Tyler Hicks, boqun.feng, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Nischala Yelchuri,
	Nischala Yelchuri

Add a new trace point trace_hyperv_set_register()
to capture register address and its value in hv_set_register().

Signed-off-by: Nischala Yelchuri <niyelchu@linux.microsoft.com>
---
 arch/x86/include/asm/trace/hyperv.h | 14 ++++++++++++++
 arch/x86/kernel/cpu/mshyperv.c      |  3 +++
 2 files changed, 17 insertions(+)

diff --git a/arch/x86/include/asm/trace/hyperv.h b/arch/x86/include/asm/trace/hyperv.h
index a8e5a7a2b..54b2f69f5 100644
--- a/arch/x86/include/asm/trace/hyperv.h
+++ b/arch/x86/include/asm/trace/hyperv.h
@@ -86,6 +86,20 @@ TRACE_EVENT(hyperv_send_ipi_one,
 		      __entry->cpu, __entry->vector)
 	);
 
+TRACE_EVENT(hyperv_set_register,
+	    TP_PROTO(unsigned int reg,
+		     u64 value),
+	    TP_ARGS(reg, value),
+	    TP_STRUCT__entry(
+		    __field(unsigned int, reg)
+		    __field(u64, value)
+		    ),
+	    TP_fast_assign(__entry->reg = reg;
+		    __entry->value = value;
+		    ),
+	    TP_printk("reg %u value %llu",
+		    __entry->reg, __entry->value)
+	);
 #endif /* CONFIG_HYPERV */
 
 #undef TRACE_INCLUDE_PATH
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index c7969e806..d4ef63f4e 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -32,6 +32,7 @@
 #include <asm/nmi.h>
 #include <clocksource/hyperv_timer.h>
 #include <asm/numa.h>
+#include <asm/trace/hyperv.h>
 
 /* Is Linux running as the root partition? */
 bool hv_root_partition;
@@ -98,6 +99,8 @@ EXPORT_SYMBOL_GPL(hv_get_register);
 
 void hv_set_register(unsigned int reg, u64 value)
 {
+	trace_hyperv_set_register(reg, value);
+
 	if (hv_nested)
 		reg = hv_get_nested_reg(reg);
 
-- 
2.25.1


^ permalink raw reply related

* Re: [PATCH RFC net-next v3 0/8] virtio/vsock: support datagrams
From: Arseniy Krasnov @ 2023-06-06 18:15 UTC (permalink / raw)
  To: Bobby Eshleman
  Cc: Bobby Eshleman, Stefan Hajnoczi, Stefano Garzarella,
	Michael S. Tsirkin, Jason Wang, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Bryan Tan, kvm, virtualization, netdev,
	linux-kernel, linux-hyperv, Jiang Wang
In-Reply-To: <ZHe3v8PHcIdFk+R5@bullseye>



On 01.06.2023 00:10, Bobby Eshleman wrote:
> On Mon, Jun 05, 2023 at 11:42:06PM +0300, Arseniy Krasnov wrote:
>> Hello Bobby!
>>
>> Thanks for this patchset, really interesting!
>>
>> I applied it on head:
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=d20dd0ea14072e8a90ff864b2c1603bd68920b4b
>>
>> And tried to run ./vsock_test (client in the guest, server in the host), I had the following crash:
>>
>> Control socket connected to 192.168.1.1:12345.                          
>> 0 - SOCK_STREAM connection reset...                                     
>> [    8.050215] BUG: kernel NULL pointer derefer                         
>> [    8.050960] #PF: supervisor read access in kernel mode               
>> [    8.050960] #PF: error_code(0x0000) - not-present page               
>> [    8.050960] PGD 0 P4D 0                                              
>> [    8.050960] Oops: 0000 [#1] PREEMPT SMP PTI                          
>> [    8.050960] CPU: 0 PID: 109 Comm: vsock_test Not tainted 6.4.0-rc3-gd707c220a700
>> [    8.050960] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14
>> [    8.050960] RIP: 0010:static_key_count+0x0/0x20                      
>> [    8.050960] Code: 04 4c 8b 46 08 49 29 c0 4c 01 c8 4c 89 47 08 89 0e 89 56 04 4f
>> [    8.050960] RSP: 0018:ffffa9a1c021bdc0 EFLAGS: 00010202              
>> [    8.050960] RAX: ffffffffac309880 RBX: ffffffffc02fc140 RCX: 0000000000000000
>> [    8.050960] RDX: ffff9a5eff944600 RSI: 0000000000000000 RDI: 0000000000000000
>> [    8.050960] RBP: ffff9a5ec2371900 R08: ffffa9a1c021bd30 R09: ffff9a5eff98e0c0
>> [    8.050960] R10: 0000000000001000 R11: 0000000000000000 R12: ffffa9a1c021be80
>> [    8.050960] R13: 0000000000000000 R14: 0000000000000002 R15: ffff9a5ec1cfca80
>> [    8.050960] FS:  00007fa9bf88c5c0(0000) GS:ffff9a5efe400000(0000) knlGS:00000000
>> [    8.050960] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033        
>> [    8.050960] CR2: 0000000000000000 CR3: 00000000023e0000 CR4: 00000000000006f0
>> [    8.050960] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [    8.050960] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> [    8.050960] Call Trace:                                              
>> [    8.050960]  <TASK>                                                  
>> [    8.050960]  once_deferred+0xd/0x30                                  
>> [    8.050960]  vsock_assign_transport+0xa2/0x1b0 [vsock]               
>> [    8.050960]  vsock_connect+0xb4/0x3a0 [vsock]                        
>> [    8.050960]  ? var_wake_function+0x60/0x60                           
>> [    8.050960]  __sys_connect+0x9e/0xd0                                 
>> [    8.050960]  ? _raw_spin_unlock_irq+0xe/0x30                         
>> [    8.050960]  ? do_setitimer+0x128/0x1f0                              
>> [    8.050960]  ? alarm_setitimer+0x4c/0x90                             
>> [    8.050960]  ? fpregs_assert_state_consistent+0x1d/0x50              
>> [    8.050960]  ? exit_to_user_mode_prepare+0x36/0x130                  
>> [    8.050960]  __x64_sys_connect+0x11/0x20                             
>> [    8.050960]  do_syscall_64+0x3b/0xc0                                 
>> [    8.050960]  entry_SYSCALL_64_after_hwframe+0x4b/0xb5                
>> [    8.050960] RIP: 0033:0x7fa9bf7c4d13                                 
>> [    8.050960] Code: 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 48
>> [    8.050960] RSP: 002b:00007ffdf2d96cc8 EFLAGS: 00000246 ORIG_RAX: 0000000000000a
>> [    8.050960] RAX: ffffffffffffffda RBX: 0000560c305d0020 RCX: 00007fa9bf7c4d13
>> [    8.050960] RDX: 0000000000000010 RSI: 00007ffdf2d96ce0 RDI: 0000000000000004
>> [    8.050960] RBP: 0000000000000004 R08: 0000560c317dc018 R09: 0000000000000000
>> [    8.050960] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
>> [    8.050960] R13: 0000560c305ccc2d R14: 00007ffdf2d96ce0 R15: 00007ffdf2d96d70
>> [    8.050960]  </TASK>  
>>
>>
>> I guess crash is somewhere near:
>>
>> old_info->transport->release(vsk); in vsock_assign_transport(). May be my config is wrong...
>>
>> Thanks, Arseniy
> 
> Thanks Arseniy!
> 
> I now see I broke the tests, but did't break the stream/dgram socket
> utility I was using in development.
> 
> I'll track this down and include a fix in the next rev.

Great! Thanks!

Thanks, Arseniy

> 
> I should have warned this v3 is pretty under-tested. Being unsure if
> some of the design choices would be accepted at all, I didn't want to
> waste too much time until things were accepted at a high level.
> 
> Best,
> Bobby

^ permalink raw reply

* Re: [PATCH RFC net-next v3 0/8] virtio/vsock: support datagrams
From: Bobby Eshleman @ 2023-05-31 21:10 UTC (permalink / raw)
  To: Arseniy Krasnov
  Cc: Bobby Eshleman, Stefan Hajnoczi, Stefano Garzarella,
	Michael S. Tsirkin, Jason Wang, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Bryan Tan, kvm, virtualization, netdev,
	linux-kernel, linux-hyperv, Jiang Wang
In-Reply-To: <2830ac58-fd77-7e5f-5565-eb47dd027d81@sberdevices.ru>

On Mon, Jun 05, 2023 at 11:42:06PM +0300, Arseniy Krasnov wrote:
> Hello Bobby!
> 
> Thanks for this patchset, really interesting!
> 
> I applied it on head:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=d20dd0ea14072e8a90ff864b2c1603bd68920b4b
> 
> And tried to run ./vsock_test (client in the guest, server in the host), I had the following crash:
> 
> Control socket connected to 192.168.1.1:12345.                          
> 0 - SOCK_STREAM connection reset...                                     
> [    8.050215] BUG: kernel NULL pointer derefer                         
> [    8.050960] #PF: supervisor read access in kernel mode               
> [    8.050960] #PF: error_code(0x0000) - not-present page               
> [    8.050960] PGD 0 P4D 0                                              
> [    8.050960] Oops: 0000 [#1] PREEMPT SMP PTI                          
> [    8.050960] CPU: 0 PID: 109 Comm: vsock_test Not tainted 6.4.0-rc3-gd707c220a700
> [    8.050960] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14
> [    8.050960] RIP: 0010:static_key_count+0x0/0x20                      
> [    8.050960] Code: 04 4c 8b 46 08 49 29 c0 4c 01 c8 4c 89 47 08 89 0e 89 56 04 4f
> [    8.050960] RSP: 0018:ffffa9a1c021bdc0 EFLAGS: 00010202              
> [    8.050960] RAX: ffffffffac309880 RBX: ffffffffc02fc140 RCX: 0000000000000000
> [    8.050960] RDX: ffff9a5eff944600 RSI: 0000000000000000 RDI: 0000000000000000
> [    8.050960] RBP: ffff9a5ec2371900 R08: ffffa9a1c021bd30 R09: ffff9a5eff98e0c0
> [    8.050960] R10: 0000000000001000 R11: 0000000000000000 R12: ffffa9a1c021be80
> [    8.050960] R13: 0000000000000000 R14: 0000000000000002 R15: ffff9a5ec1cfca80
> [    8.050960] FS:  00007fa9bf88c5c0(0000) GS:ffff9a5efe400000(0000) knlGS:00000000
> [    8.050960] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033        
> [    8.050960] CR2: 0000000000000000 CR3: 00000000023e0000 CR4: 00000000000006f0
> [    8.050960] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [    8.050960] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [    8.050960] Call Trace:                                              
> [    8.050960]  <TASK>                                                  
> [    8.050960]  once_deferred+0xd/0x30                                  
> [    8.050960]  vsock_assign_transport+0xa2/0x1b0 [vsock]               
> [    8.050960]  vsock_connect+0xb4/0x3a0 [vsock]                        
> [    8.050960]  ? var_wake_function+0x60/0x60                           
> [    8.050960]  __sys_connect+0x9e/0xd0                                 
> [    8.050960]  ? _raw_spin_unlock_irq+0xe/0x30                         
> [    8.050960]  ? do_setitimer+0x128/0x1f0                              
> [    8.050960]  ? alarm_setitimer+0x4c/0x90                             
> [    8.050960]  ? fpregs_assert_state_consistent+0x1d/0x50              
> [    8.050960]  ? exit_to_user_mode_prepare+0x36/0x130                  
> [    8.050960]  __x64_sys_connect+0x11/0x20                             
> [    8.050960]  do_syscall_64+0x3b/0xc0                                 
> [    8.050960]  entry_SYSCALL_64_after_hwframe+0x4b/0xb5                
> [    8.050960] RIP: 0033:0x7fa9bf7c4d13                                 
> [    8.050960] Code: 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 48
> [    8.050960] RSP: 002b:00007ffdf2d96cc8 EFLAGS: 00000246 ORIG_RAX: 0000000000000a
> [    8.050960] RAX: ffffffffffffffda RBX: 0000560c305d0020 RCX: 00007fa9bf7c4d13
> [    8.050960] RDX: 0000000000000010 RSI: 00007ffdf2d96ce0 RDI: 0000000000000004
> [    8.050960] RBP: 0000000000000004 R08: 0000560c317dc018 R09: 0000000000000000
> [    8.050960] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
> [    8.050960] R13: 0000560c305ccc2d R14: 00007ffdf2d96ce0 R15: 00007ffdf2d96d70
> [    8.050960]  </TASK>  
> 
> 
> I guess crash is somewhere near:
> 
> old_info->transport->release(vsk); in vsock_assign_transport(). May be my config is wrong...
> 
> Thanks, Arseniy

Thanks Arseniy!

I now see I broke the tests, but did't break the stream/dgram socket
utility I was using in development.

I'll track this down and include a fix in the next rev.

I should have warned this v3 is pretty under-tested. Being unsure if
some of the design choices would be accepted at all, I didn't want to
waste too much time until things were accepted at a high level.

Best,
Bobby

^ permalink raw reply

* Re: [PATCH 3/9] x86/hyperv: Mark Hyper-V vp assist page unencrypted in SEV-SNP enlightened guest
From: Vitaly Kuznetsov @ 2023-06-06 15:49 UTC (permalink / raw)
  To: Tianyu Lan, kys, haiyangz, wei.liu, decui, tglx, mingo, bp,
	dave.hansen, x86, hpa, daniel.lezcano, arnd, michael.h.kelley
  Cc: Tianyu Lan, linux-arch, linux-hyperv, linux-kernel
In-Reply-To: <4103a70f-cc09-a966-3efa-5ab9273f5c55@gmail.com>

Tianyu Lan <ltykernel@gmail.com> writes:

> On 6/5/2023 8:13 PM, Vitaly Kuznetsov wrote:
>>> @@ -113,6 +114,11 @@ static int hv_cpu_init(unsigned int cpu)
>>>   
>>>   	}
>>>   	if (!WARN_ON(!(*hvp))) {
>>> +		if (hv_isolation_type_en_snp()) {
>>> +			WARN_ON_ONCE(set_memory_decrypted((unsigned long)(*hvp), 1));
>>> +			memset(*hvp, 0, PAGE_SIZE);
>>> +		}
>> Why do we need to set the page as decrypted here and not when we
>> allocate the page (a few lines above)?
>
> If Linux root partition boots in the SEV-SNP guest, the page still needs 
> to be decrypted.
>

I'd suggest we add a flag to indicate that VP assist page was actually
set (on the first invocation of hv_cpu_init() for guest partitions and
all invocations for root partition) and only call
set_memory_decrypted()/memset() then: that would both help with the
potential issue with KVM using enlightened vmcs and avoid the unneeded
hypercall.

-- 
Vitaly


^ permalink raw reply

* Re: [RFC PATCH 1/1] sched/fair: Fix SMT balance dependency on CPU numbering
From: Vincent Guittot @ 2023-06-06 15:38 UTC (permalink / raw)
  To: Michael Kelley (LINUX)
  Cc: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com,
	dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com,
	mgorman@suse.de, bristot@redhat.com, vschneid@redhat.com,
	linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org
In-Reply-To: <BYAPR21MB1688C7D7309FEC2B01529C0FD752A@BYAPR21MB1688.namprd21.prod.outlook.com>

On Tue, 6 Jun 2023 at 16:08, Michael Kelley (LINUX)
<mikelley@microsoft.com> wrote:
>
> From: Vincent Guittot <vincent.guittot@linaro.org> Sent: Monday, June 5, 2023 2:31 AM
> >
> > Hi Michael,
> >
> > On Wed, 31 May 2023 at 19:55, Michael Kelley <mikelley@microsoft.com> wrote:
> > >
> > > With some CPU numbering schemes, the function select_idle_cpu() currently
> > > has a subtle bias to return the first hyper-thread in a core. As a result
> > > work is not evenly balanced across hyper-threads in a core. The difference
> > > is often as much as 15 to 20 percentage points -- i.e., the first
> > > hyper-thread might run at 45% load while the second hyper-thread runs at
> > > only 30% load or less.
> > >
> > > Two likely CPU numbering schemes make sense with today's typical case
> > > of two hyper-threads per core:
> > >
> > > A. Enumerate all the first hyper-theads in a core, then all the second
> > >    hyper-threads in a core.  If a system has 8 cores with 16 hyper-threads,
> > >    CPUs #0 and #8 are in the same core, as are CPUs #1 and #9, etc.
> > >
> > > B. Enumerate all hyper-threads in a core, then all hyper-threads in the
> > >    next core, etc.  Again with 8 cores and 16 hyper-threads, CPUs #0 and
> > >    #1 are in the same core, as are CPUs #2 and #3, etc.
> > >
> > > Scheme A is used in most ACPI bare metal systems and in VMs running on
> > > KVM.  The enumeration order is determined by the order of the processor
> > > entries in the ACPI MADT, and the ACPI spec *recommends* that the MADT
> > > be set up for scheme A.
> > >
> > > However, for reasons that pre-date the ACPI recommendation, Hyper-V
> > > guests have an ACPI MADT that is set up for scheme B.  When using scheme B,
> > > the subtle bias is evident in select_idle_cpu().  While having Hyper-V
> > > conform to the ACPI spec recommendation would solve the Hyper-V problem,
> > > it is also desirable for the fair scheduler code to be independent of the
> > > CPU numbering scheme.  ACPI is not always the firmware configuration
> > > mechanism, and CPU numbering schemes might vary more in architectures
> > > other than x86/x64.
> > >
> > > The bias occurs with scheme B when "has_idle_cpu" is true and
> >
> > I assume that you mean has_idle_core as I can't find has_idle_cpu in the code
>
> Yes.  You are right.
>
> >
> > > select_idle_core() is called in the for_each_cpu_wrap() loop. Regardless
> > > of where the loop starts, it will almost immediately encountered a CPU
> > > that is the first hyper-thread in a core. If that core is idle, the CPU
> > > number of that first hyper-thread is returned. If that core is not idle,
> > > both hyper-threads are removed from the cpus mask, and the loop iterates
> > > to choose another CPU that is the first hyper-thread in a core.  As a
> > > result, select_idle_core() almost always returns the first hyper-thread
> > > in a core.
> > >
> > > The bias does not occur with scheme A because half of the CPU numbering
> > > space is a series of CPUs that are the second hyper-thread in all the
> > > cores. Assuming that the "target" CPU is evenly distributed throughout
> > > the CPU numbering space, there's a 50/50 chance of starting in the portion
> > > of the CPU numbering space that is all second hyper-threads.  If
> > > select_idle_core() finds a idle core, it will likely return a CPU that
> > > is the second hyper-thread in the core.  On average over the time,
> > > both the first and second hyper-thread are equally likely to be
> > > returned.
> > >
> > > Fix this bias by determining which hyper-thread in a core the "target"
> > > CPU is -- i.e., the "smt index" of that CPU.  Then when select_idle_core()
> > > finds an idle core, it returns the CPU in the core that has the same
> > > smt index. If that CPU is not valid to be chosen, just return the CPU
> > > that was passed into select_idle_core() and don't worry about bias.
> > >
> > > With scheme B, this fix solves the bias problem by making the chosen
> > > CPU be roughly equally likely to either hyper-thread.  With scheme A
> > > there's no real effect as the chosen CPU was already equally likely
> > > to be either hyper-thread, and still is.
> > >
> > > The imbalance in hyper-thread loading was originally observed in a
> > > customer workload, and then reproduced with a synthetic workload.
> > > The change has been tested with the synthetic workload in a Hyper-V VM
> > > running the normal scheme B CPU numbering, and then with the MADT
> > > replaced with a scheme A version using Linux's ability to override
> > > ACPI tables. The testing showed no change hyper-thread loading
> > > balance with the scheme A CPU numbering, but the imbalance is
> > > corrected if the CPU numbering is scheme B.
> >
> > You failed to explain why it's important to evenly select 1st or 2nd
> > hyper-threads on the system.  I don't see any performance figures.
> > What would be the benefit ?
>
> As I noted below, it's not completely clear to me whether this is a
> problem.  I don't have a specific workload where overall runtime is
> longer due to the SMT level imbalance.  I'm not a scheduler expert,
> and wanted input from those who are.  Linux generally does a good
> job of balancing load, and given the code in fair.c that is devoted to
> balancing, I inferred at least some importance.  But maybe balancing
> is more important for the higher-level domains, and less so for the
> SMT domain.

As long as the hyper-threads on a core are the same, I don't see any
need to add more code and complexity to evenly balance the number of
time that we select each CPU of the same core when the whole core is
idle.

>
> From a slightly different perspective, sysadmins are accustomed
> to 'htop' or whatever tools showing balanced load.  This is a case
> where the load is persistently unbalanced, and as such it drew
> attention.  And at a slightly theoretical level, it *is* interesting that
> the CPU numbering scheme affects the outcome of scheduler
> decisions in a noticeable way.
>
> But if the sense is that an SMT-level imbalance really doesn't affect
> anything, we'll live with CPU numbering scheme B being a bit of an
> anomaly.  When necessary, we can explain that it doesn't have any
> negative effects.
>
> Michael
>
> >
> > >
> > > Signed-off-by: Michael Kelley <mikelley@microsoft.com>
> > > ---
> > >
> > > I haven't previously worked in Linux scheduler code, so I'm posting this
> > > as an RFC to point out the observed problem, and to suggest a solution.
> > > There may well be considerations in the design of a solution that I'm not
> > > aware of, so please educate me or suggest an alternative.
> > >
> > > It's also not completely clear whether an imbalance in hyper-thread
> > > loading is actually a problem. It looks weird, and causes customer
> > > concern when it is observed consistently across all cores in some
> > > production workload. The fair scheduler strives to balance load evenly, so
> > > I'm treating it as a problem that should be fixed, if for no other reason
> > > than general goodness. But again, I'm sure reviewers will feel free to
> > > tell me otherwise. :-) The fix takes relatively few CPU cycles, but it's
> > > still a non-zero cost.
> > >
> > > FWIW, the same imbalance has been observed with kernels as far back as
> > > 5.4, and the root cause in the code is essentially the same. So it's not
> > > a recently introduced issue. I haven't tried anything earlier than 5.4.
> > >
> > >  kernel/sched/fair.c | 36 ++++++++++++++++++++++++++++++------
> > >  1 file changed, 30 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 373ff5f..8b56e9d 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -6832,6 +6832,19 @@ static inline bool test_idle_cores(int cpu)
> > >         return false;
> > >  }
> > >
> > > +static inline int get_smt_index(int core)
> > > +{
> > > +       int cpu, n = 0;
> > > +
> > > +       for_each_cpu(cpu, cpu_smt_mask(core)) {
> > > +               if (cpu == core)
> > > +                       return n;
> > > +               n++;
> > > +       }
> > > +       /* If get here, cpu_smt_mask is set up incorrectly */
> > > +       return 0;
> > > +}
> > > +
> > >  /*
> > >   * Scans the local SMT mask to see if the entire core is idle, and records this
> > >   * information in sd_llc_shared->has_idle_cores.
> > > @@ -6866,10 +6879,11 @@ void __update_idle_core(struct rq *rq)
> > >   * there are no idle cores left in the system; tracked through
> > >   * sd_llc->shared->has_idle_cores and enabled through update_idle_core() above.
> > >   */
> > > -static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int
> > *idle_cpu)
> > > +static int select_idle_core(struct task_struct *p, int core, int smt_index,
> > > +                           struct cpumask *cpus, int *idle_cpu)
> > >  {
> > >         bool idle = true;
> > > -       int cpu;
> > > +       int cpu, index_cpu, n = 0;
> > >
> > >         for_each_cpu(cpu, cpu_smt_mask(core)) {
> > >                 if (!available_idle_cpu(cpu)) {
> > > @@ -6885,10 +6899,13 @@ static int select_idle_core(struct task_struct *p, int core,
> > struct cpumask *cpu
> > >                 }
> > >                 if (*idle_cpu == -1 && cpumask_test_cpu(cpu, p->cpus_ptr))
> > >                         *idle_cpu = cpu;
> > > +
> > > +               if (n++ == smt_index)
> > > +                       index_cpu = cpu;
> > >         }
> > >
> > >         if (idle)
> > > -               return core;
> > > +               return cpumask_test_cpu(index_cpu, p->cpus_ptr) ? index_cpu : core;
> > >
> > >         cpumask_andnot(cpus, cpus, cpu_smt_mask(core));
> > >         return -1;
> > > @@ -6922,7 +6939,13 @@ static inline bool test_idle_cores(int cpu)
> > >         return false;
> > >  }
> > >
> > > -static inline int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus,
> > int *idle_cpu)
> > > +static inline int get_smt_index(int core)
> > > +{
> > > +       return 0;
> > > +}
> > > +
> > > +static inline int select_idle_core(struct task_struct *p, int core, int smt_index,
> > > +                                  struct cpumask *cpus, int *idle_cpu)
> > >  {
> > >         return __select_idle_cpu(core, p);
> > >  }
> > > @@ -6942,7 +6965,7 @@ static inline int select_idle_smt(struct task_struct *p, int
> > target)
> > >  static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> > has_idle_core, int target)
> > >  {
> > >         struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
> > > -       int i, cpu, idle_cpu = -1, nr = INT_MAX;
> > > +       int i, cpu, smt_index, idle_cpu = -1, nr = INT_MAX;
> > >         struct sched_domain_shared *sd_share;
> > >         struct rq *this_rq = this_rq();
> > >         int this = smp_processor_id();
> > > @@ -6994,9 +7017,10 @@ static int select_idle_cpu(struct task_struct *p, struct
> > sched_domain *sd, bool
> > >                 }
> > >         }
> > >
> > > +       smt_index = get_smt_index(target);
> > >         for_each_cpu_wrap(cpu, cpus, target + 1) {
> > >                 if (has_idle_core) {
> > > -                       i = select_idle_core(p, cpu, cpus, &idle_cpu);
> > > +                       i = select_idle_core(p, cpu, smt_index, cpus, &idle_cpu);
> > >                         if ((unsigned int)i < nr_cpumask_bits)
> > >                                 return i;
> > >
> > > --
> > > 1.8.3.1
> > >

^ permalink raw reply

* Re: [PATCH 3/9] x86/hyperv: Mark Hyper-V vp assist page unencrypted in SEV-SNP enlightened guest
From: Tianyu Lan @ 2023-06-06 15:22 UTC (permalink / raw)
  To: Vitaly Kuznetsov, kys, haiyangz, wei.liu, decui, tglx, mingo, bp,
	dave.hansen, x86, hpa, daniel.lezcano, arnd, michael.h.kelley
  Cc: Tianyu Lan, linux-arch, linux-hyperv, linux-kernel
In-Reply-To: <873536ksye.fsf@redhat.com>

On 6/5/2023 8:13 PM, Vitaly Kuznetsov wrote:
>> @@ -113,6 +114,11 @@ static int hv_cpu_init(unsigned int cpu)
>>   
>>   	}
>>   	if (!WARN_ON(!(*hvp))) {
>> +		if (hv_isolation_type_en_snp()) {
>> +			WARN_ON_ONCE(set_memory_decrypted((unsigned long)(*hvp), 1));
>> +			memset(*hvp, 0, PAGE_SIZE);
>> +		}
> Why do we need to set the page as decrypted here and not when we
> allocate the page (a few lines above)?

If Linux root partition boots in the SEV-SNP guest, the page still needs 
to be decrypted.

> And why do we need to clear it
> _after_  we made it decrypted? In case we care about not leaking the
> stale content to the hypervisor, we should've cleared it_before_, but
> the bigger problem I see is that memset() is problemmatic e.g. for KVM
> which uses enlightened VMCS. You put a CPU offline and then back online
> and this path will be taken. Clearing VP assist page will likely brake
> things. (AFAIU SEV-SNP Hyper-V guests don't expose SVM yet so the
> problem is likely theoretical only, but still).
> 

The page will be made dirt by hardware after decrypting operation and so 
memset the page after that.


^ permalink raw reply

* [PATCH v2 1/1] RDMA/mana_ib: Add EQ interrupt support to mana ib driver.
From: Wei Hu @ 2023-06-06 15:17 UTC (permalink / raw)
  To: netdev, linux-hyperv, linux-rdma, longli, sharmaajay, jgg, leon,
	kys, haiyangz, wei.liu, decui, davem, edumazet, kuba, pabeni,
	vkuznets, ssengar, shradhagupta, weh

Add EQ interrupt support for mana ib driver. Allocate EQs per ucontext
to receive interrupt. Attach EQ when CQ is created. Call CQ interrupt
handler when completion interrupt happens. EQs are destroyed when
ucontext is deallocated.

The change calls some public APIs in mana ethernet driver to
allocate EQs and other resources. Ehe EQ process routine is also shared
by mana ethernet and mana ib drivers.

Co-developed-by: Ajay Sharma <sharmaajay@microsoft.com>
Signed-off-by: Ajay Sharma <sharmaajay@microsoft.com>
Signed-off-by: Wei Hu <weh@microsoft.com>
---

v2: Use ibdev_dbg to print error messages and return -ENOMEN
    when kzalloc fails.

 drivers/infiniband/hw/mana/cq.c               |  32 ++++-
 drivers/infiniband/hw/mana/main.c             |  87 ++++++++++++
 drivers/infiniband/hw/mana/mana_ib.h          |   4 +
 drivers/infiniband/hw/mana/qp.c               |  90 +++++++++++-
 .../net/ethernet/microsoft/mana/gdma_main.c   | 131 ++++++++++--------
 drivers/net/ethernet/microsoft/mana/mana_en.c |   1 +
 include/net/mana/gdma.h                       |   9 +-
 7 files changed, 290 insertions(+), 64 deletions(-)

diff --git a/drivers/infiniband/hw/mana/cq.c b/drivers/infiniband/hw/mana/cq.c
index d141cab8a1e6..3cd680e0e753 100644
--- a/drivers/infiniband/hw/mana/cq.c
+++ b/drivers/infiniband/hw/mana/cq.c
@@ -12,13 +12,20 @@ int mana_ib_create_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
 	struct ib_device *ibdev = ibcq->device;
 	struct mana_ib_create_cq ucmd = {};
 	struct mana_ib_dev *mdev;
+	struct gdma_context *gc;
+	struct gdma_dev *gd;
 	int err;
 
 	mdev = container_of(ibdev, struct mana_ib_dev, ib_dev);
+	gd = mdev->gdma_dev;
+	gc = gd->gdma_context;
 
 	if (udata->inlen < sizeof(ucmd))
 		return -EINVAL;
 
+	cq->comp_vector = attr->comp_vector > gc->max_num_queues ?
+				0 : attr->comp_vector;
+
 	err = ib_copy_from_udata(&ucmd, udata, min(sizeof(ucmd), udata->inlen));
 	if (err) {
 		ibdev_dbg(ibdev,
@@ -69,11 +76,32 @@ int mana_ib_destroy_cq(struct ib_cq *ibcq, struct ib_udata *udata)
 	struct mana_ib_cq *cq = container_of(ibcq, struct mana_ib_cq, ibcq);
 	struct ib_device *ibdev = ibcq->device;
 	struct mana_ib_dev *mdev;
+	struct gdma_context *gc;
+	struct gdma_dev *gd;
+
 
 	mdev = container_of(ibdev, struct mana_ib_dev, ib_dev);
+	gd = mdev->gdma_dev;
+	gc = gd->gdma_context;
 
-	mana_ib_gd_destroy_dma_region(mdev, cq->gdma_region);
-	ib_umem_release(cq->umem);
+
+
+	if (atomic_read(&ibcq->usecnt) == 0) {
+		mana_ib_gd_destroy_dma_region(mdev, cq->gdma_region);
+		ibdev_dbg(ibdev, "freeing gdma cq %p\n", gc->cq_table[cq->id]);
+		kfree(gc->cq_table[cq->id]);
+		gc->cq_table[cq->id] = NULL;
+		ib_umem_release(cq->umem);
+	}
 
 	return 0;
 }
+
+void mana_ib_cq_handler(void *ctx, struct gdma_queue *gdma_cq)
+{
+	struct mana_ib_cq *cq = ctx;
+	struct ib_device *ibdev = cq->ibcq.device;
+
+	ibdev_dbg(ibdev, "Enter %s %d\n", __func__, __LINE__);
+	cq->ibcq.comp_handler(&cq->ibcq, cq->ibcq.cq_context);
+}
diff --git a/drivers/infiniband/hw/mana/main.c b/drivers/infiniband/hw/mana/main.c
index 7be4c3adb4e2..e4efbcaed10e 100644
--- a/drivers/infiniband/hw/mana/main.c
+++ b/drivers/infiniband/hw/mana/main.c
@@ -143,6 +143,81 @@ int mana_ib_dealloc_pd(struct ib_pd *ibpd, struct ib_udata *udata)
 	return err;
 }
 
+static void mana_ib_destroy_eq(struct mana_ib_ucontext *ucontext,
+			       struct mana_ib_dev *mdev)
+{
+	struct gdma_context *gc = mdev->gdma_dev->gdma_context;
+	struct ib_device *ibdev = ucontext->ibucontext.device;
+	struct gdma_queue *eq;
+	int i;
+
+	if (!ucontext->eqs)
+		return;
+
+	for (i = 0; i < gc->max_num_queues; i++) {
+		eq = ucontext->eqs[i].eq;
+		if (!eq)
+			continue;
+
+		mana_gd_destroy_queue(gc, eq);
+	}
+
+	kfree(ucontext->eqs);
+	ucontext->eqs = NULL;
+
+	ibdev_dbg(ibdev, "destroyed eq's count %d\n", gc->max_num_queues);
+}
+
+static int mana_ib_create_eq(struct mana_ib_ucontext *ucontext,
+			     struct mana_ib_dev *mdev)
+{
+	struct gdma_queue_spec spec = {};
+	struct gdma_queue *queue;
+	struct gdma_context *gc;
+	struct ib_device *ibdev;
+	struct gdma_dev *gd;
+	int err;
+	int i;
+
+	if (!ucontext || !mdev)
+		return -EINVAL;
+
+	ibdev = ucontext->ibucontext.device;
+	gd = mdev->gdma_dev;
+
+	gc = gd->gdma_context;
+
+	ucontext->eqs = kcalloc(gc->max_num_queues, sizeof(struct mana_eq),
+				GFP_KERNEL);
+	if (!ucontext->eqs)
+		return -ENOMEM;
+
+	spec.type = GDMA_EQ;
+	spec.monitor_avl_buf = false;
+	spec.queue_size = EQ_SIZE;
+	spec.eq.callback = NULL;
+	spec.eq.context = ucontext->eqs;
+	spec.eq.log2_throttle_limit = LOG2_EQ_THROTTLE;
+	spec.eq.msix_allocated = true;
+
+	for (i = 0; i < gc->max_num_queues; i++) {
+		spec.eq.msix_index = i;
+		err = mana_gd_create_mana_eq(gd, &spec, &queue);
+		if (err)
+			goto out;
+
+		queue->eq.disable_needed = true;
+		ucontext->eqs[i].eq = queue;
+	}
+
+	return 0;
+
+out:
+	ibdev_dbg(ibdev, "Failed to allocated eq err %d\n", err);
+	mana_ib_destroy_eq(ucontext, mdev);
+	return err;
+}
+
 static int mana_gd_destroy_doorbell_page(struct gdma_context *gc,
 					 int doorbell_page)
 {
@@ -225,7 +300,17 @@ int mana_ib_alloc_ucontext(struct ib_ucontext *ibcontext,
 
 	ucontext->doorbell = doorbell_page;
 
+	ret = mana_ib_create_eq(ucontext, mdev);
+	if (ret) {
+		ibdev_dbg(ibdev, "Failed to create eq's , ret %d\n", ret);
+		goto err;
+	}
+
 	return 0;
+
+err:
+	mana_gd_destroy_doorbell_page(gc, doorbell_page);
+	return ret;
 }
 
 void mana_ib_dealloc_ucontext(struct ib_ucontext *ibcontext)
@@ -240,6 +325,8 @@ void mana_ib_dealloc_ucontext(struct ib_ucontext *ibcontext)
 	mdev = container_of(ibdev, struct mana_ib_dev, ib_dev);
 	gc = mdev->gdma_dev->gdma_context;
 
+	mana_ib_destroy_eq(mana_ucontext, mdev);
+
 	ret = mana_gd_destroy_doorbell_page(gc, mana_ucontext->doorbell);
 	if (ret)
 		ibdev_dbg(ibdev, "Failed to destroy doorbell page %d\n", ret);
diff --git a/drivers/infiniband/hw/mana/mana_ib.h b/drivers/infiniband/hw/mana/mana_ib.h
index 502cc8672eef..9672fa1670a5 100644
--- a/drivers/infiniband/hw/mana/mana_ib.h
+++ b/drivers/infiniband/hw/mana/mana_ib.h
@@ -67,6 +67,7 @@ struct mana_ib_cq {
 	int cqe;
 	u64 gdma_region;
 	u64 id;
+	u32 comp_vector;
 };
 
 struct mana_ib_qp {
@@ -86,6 +87,7 @@ struct mana_ib_qp {
 struct mana_ib_ucontext {
 	struct ib_ucontext ibucontext;
 	u32 doorbell;
+	struct mana_eq *eqs;
 };
 
 struct mana_ib_rwq_ind_table {
@@ -159,4 +161,6 @@ int mana_ib_query_gid(struct ib_device *ibdev, u32 port, int index,
 
 void mana_ib_disassociate_ucontext(struct ib_ucontext *ibcontext);
 
+void mana_ib_cq_handler(void *ctx, struct gdma_queue *gdma_cq);
+
 #endif
diff --git a/drivers/infiniband/hw/mana/qp.c b/drivers/infiniband/hw/mana/qp.c
index 54b61930a7fd..e133d86c0875 100644
--- a/drivers/infiniband/hw/mana/qp.c
+++ b/drivers/infiniband/hw/mana/qp.c
@@ -96,16 +96,20 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
 	struct mana_ib_qp *qp = container_of(ibqp, struct mana_ib_qp, ibqp);
 	struct mana_ib_dev *mdev =
 		container_of(pd->device, struct mana_ib_dev, ib_dev);
+	struct ib_ucontext *ib_ucontext = pd->uobject->context;
 	struct ib_rwq_ind_table *ind_tbl = attr->rwq_ind_tbl;
 	struct mana_ib_create_qp_rss_resp resp = {};
 	struct mana_ib_create_qp_rss ucmd = {};
+	struct mana_ib_ucontext *mana_ucontext;
 	struct gdma_dev *gd = mdev->gdma_dev;
 	mana_handle_t *mana_ind_table;
 	struct mana_port_context *mpc;
+	struct gdma_queue *gdma_cq;
 	struct mana_context *mc;
 	struct net_device *ndev;
 	struct mana_ib_cq *cq;
 	struct mana_ib_wq *wq;
+	struct mana_eq *eq;
 	unsigned int ind_tbl_size;
 	struct ib_cq *ibcq;
 	struct ib_wq *ibwq;
@@ -114,6 +118,8 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
 	int ret;
 
 	mc = gd->driver_data;
+	mana_ucontext =
+		container_of(ib_ucontext, struct mana_ib_ucontext, ibucontext);
 
 	if (!udata || udata->inlen < sizeof(ucmd))
 		return -EINVAL;
@@ -180,6 +186,7 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
 	for (i = 0; i < ind_tbl_size; i++) {
 		struct mana_obj_spec wq_spec = {};
 		struct mana_obj_spec cq_spec = {};
+		unsigned int max_num_queues = gd->gdma_context->max_num_queues;
 
 		ibwq = ind_tbl->ind_tbl[i];
 		wq = container_of(ibwq, struct mana_ib_wq, ibwq);
@@ -193,7 +200,8 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
 		cq_spec.gdma_region = cq->gdma_region;
 		cq_spec.queue_size = cq->cqe * COMP_ENTRY_SIZE;
 		cq_spec.modr_ctx_id = 0;
-		cq_spec.attached_eq = GDMA_CQ_NO_EQ;
+		eq = &mana_ucontext->eqs[cq->comp_vector % max_num_queues];
+		cq_spec.attached_eq = eq->eq->id;
 
 		ret = mana_create_wq_obj(mpc, mpc->port_handle, GDMA_RQ,
 					 &wq_spec, &cq_spec, &wq->rx_object);
@@ -207,6 +215,9 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
 		wq->id = wq_spec.queue_index;
 		cq->id = cq_spec.queue_index;
 
+		ibdev_dbg(&mdev->ib_dev, "attached eq id %u  cq with id %llu\n",
+			eq->eq->id, cq->id);
+
 		ibdev_dbg(&mdev->ib_dev,
 			  "ret %d rx_object 0x%llx wq id %llu cq id %llu\n",
 			  ret, wq->rx_object, wq->id, cq->id);
@@ -215,6 +226,27 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
 		resp.entries[i].wqid = wq->id;
 
 		mana_ind_table[i] = wq->rx_object;
+
+		if (gd->gdma_context->cq_table[cq->id] == NULL) {
+
+			gdma_cq = kzalloc(sizeof(*gdma_cq), GFP_KERNEL);
+			if (!gdma_cq) {
+				ibdev_dbg(&mdev->ib_dev,
+					 "failed to allocate gdma_cq\n");
+				ret = -ENOMEM;
+				goto free_cq;
+			}
+
+			ibdev_dbg(&mdev->ib_dev, "gdma cq allocated %p\n",
+				  gdma_cq);
+
+			gdma_cq->cq.context = cq;
+			gdma_cq->type = GDMA_CQ;
+			gdma_cq->cq.callback = mana_ib_cq_handler;
+			gdma_cq->id = cq->id;
+			gd->gdma_context->cq_table[cq->id] = gdma_cq;
+		}
+
 	}
 	resp.num_entries = i;
 
@@ -224,7 +256,7 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
 					 ucmd.rx_hash_key_len,
 					 ucmd.rx_hash_key);
 	if (ret)
-		goto fail;
+		goto free_cq;
 
 	ret = ib_copy_to_udata(udata, &resp, sizeof(resp));
 	if (ret) {
@@ -238,6 +270,23 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
 
 	return 0;
 
+free_cq:
+	{
+		int j = i;
+		u64 cqid;
+
+		while (j-- > 0) {
+			cqid = resp.entries[j].cqid;
+			gdma_cq = gd->gdma_context->cq_table[cqid];
+			cq = gdma_cq->cq.context;
+			if (atomic_read(&cq->ibcq.usecnt) == 0) {
+				kfree(gd->gdma_context->cq_table[cqid]);
+				gd->gdma_context->cq_table[cqid] = NULL;
+			}
+		}
+
+	}
+
 fail:
 	while (i-- > 0) {
 		ibwq = ind_tbl->ind_tbl[i];
@@ -269,10 +318,12 @@ static int mana_ib_create_qp_raw(struct ib_qp *ibqp, struct ib_pd *ibpd,
 	struct mana_obj_spec wq_spec = {};
 	struct mana_obj_spec cq_spec = {};
 	struct mana_port_context *mpc;
+	struct gdma_queue *gdma_cq;
 	struct mana_context *mc;
 	struct net_device *ndev;
 	struct ib_umem *umem;
-	int err;
+	struct mana_eq *eq;
+	int err, eq_vec;
 	u32 port;
 
 	mc = gd->driver_data;
@@ -350,7 +401,9 @@ static int mana_ib_create_qp_raw(struct ib_qp *ibqp, struct ib_pd *ibpd,
 	cq_spec.gdma_region = send_cq->gdma_region;
 	cq_spec.queue_size = send_cq->cqe * COMP_ENTRY_SIZE;
 	cq_spec.modr_ctx_id = 0;
-	cq_spec.attached_eq = GDMA_CQ_NO_EQ;
+	eq_vec = send_cq->comp_vector % gd->gdma_context->max_num_queues;
+	eq = &mana_ucontext->eqs[eq_vec];
+	cq_spec.attached_eq = eq->eq->id;
 
 	err = mana_create_wq_obj(mpc, mpc->port_handle, GDMA_SQ, &wq_spec,
 				 &cq_spec, &qp->tx_object);
@@ -368,6 +421,26 @@ static int mana_ib_create_qp_raw(struct ib_qp *ibqp, struct ib_pd *ibpd,
 	qp->sq_id = wq_spec.queue_index;
 	send_cq->id = cq_spec.queue_index;
 
+	if (gd->gdma_context->cq_table[send_cq->id] == NULL) {
+
+		gdma_cq = kzalloc(sizeof(*gdma_cq), GFP_KERNEL);
+		if (!gdma_cq) {
+			ibdev_dbg(&mdev->ib_dev,
+				  "failed to allocate gdma_cq\n");
+			err = -ENOMEM;
+			goto err_destroy_wqobj_and_cq;
+		}
+
+		pr_debug("gdma cq allocated %p\n", gdma_cq);
+		gdma_cq->cq.context = send_cq;
+		gdma_cq->type = GDMA_CQ;
+		gdma_cq->cq.callback = mana_ib_cq_handler;
+		gdma_cq->id = send_cq->id;
+		gd->gdma_context->cq_table[send_cq->id] = gdma_cq;
+	} else {
+		gdma_cq = gd->gdma_context->cq_table[send_cq->id];
+	}
+
 	ibdev_dbg(&mdev->ib_dev,
 		  "ret %d qp->tx_object 0x%llx sq id %llu cq id %llu\n", err,
 		  qp->tx_object, qp->sq_id, send_cq->id);
@@ -381,12 +454,17 @@ static int mana_ib_create_qp_raw(struct ib_qp *ibqp, struct ib_pd *ibpd,
 		ibdev_dbg(&mdev->ib_dev,
 			  "Failed copy udata for create qp-raw, %d\n",
 			  err);
-		goto err_destroy_wq_obj;
+		goto err_destroy_wqobj_and_cq;
 	}
 
 	return 0;
 
-err_destroy_wq_obj:
+err_destroy_wqobj_and_cq:
+	if (atomic_read(&send_cq->ibcq.usecnt) == 0) {
+		kfree(gdma_cq);
+		gd->gdma_context->cq_table[send_cq->id] = NULL;
+	}
+
 	mana_destroy_wq_obj(mpc, GDMA_SQ, qp->tx_object);
 
 err_destroy_dma_region:
diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index 8f3f78b68592..8231d77628d9 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -368,53 +368,57 @@ static void mana_gd_process_eqe(struct gdma_queue *eq)
 	}
 }
 
-static void mana_gd_process_eq_events(void *arg)
+static void mana_gd_process_eq_events(struct list_head *eq_list)
 {
 	u32 owner_bits, new_bits, old_bits;
 	union gdma_eqe_info eqe_info;
 	struct gdma_eqe *eq_eqe_ptr;
-	struct gdma_queue *eq = arg;
 	struct gdma_context *gc;
+	struct gdma_queue *eq;
 	struct gdma_eqe *eqe;
 	u32 head, num_eqe;
 	int i;
 
-	gc = eq->gdma_dev->gdma_context;
+	list_for_each_entry_rcu(eq, eq_list, entry) {
+		gc = eq->gdma_dev->gdma_context;
 
-	num_eqe = eq->queue_size / GDMA_EQE_SIZE;
-	eq_eqe_ptr = eq->queue_mem_ptr;
+		num_eqe = eq->queue_size / GDMA_EQE_SIZE;
+		eq_eqe_ptr = eq->queue_mem_ptr;
 
-	/* Process up to 5 EQEs at a time, and update the HW head. */
-	for (i = 0; i < 5; i++) {
-		eqe = &eq_eqe_ptr[eq->head % num_eqe];
-		eqe_info.as_uint32 = eqe->eqe_info;
-		owner_bits = eqe_info.owner_bits;
+		/* Process up to 5 EQEs at a time, and update the HW head. */
+		for (i = 0; i < 5; i++) {
+			eqe = &eq_eqe_ptr[eq->head % num_eqe];
+			eqe_info.as_uint32 = eqe->eqe_info;
+			owner_bits = eqe_info.owner_bits;
 
-		old_bits = (eq->head / num_eqe - 1) & GDMA_EQE_OWNER_MASK;
-		/* No more entries */
-		if (owner_bits == old_bits)
-			break;
+			old_bits =
+				(eq->head / num_eqe - 1) & GDMA_EQE_OWNER_MASK;
+			/* No more entries */
+			if (owner_bits == old_bits)
+				break;
 
-		new_bits = (eq->head / num_eqe) & GDMA_EQE_OWNER_MASK;
-		if (owner_bits != new_bits) {
-			dev_err(gc->dev, "EQ %d: overflow detected\n", eq->id);
-			break;
-		}
+			new_bits = (eq->head / num_eqe) & GDMA_EQE_OWNER_MASK;
+			if (owner_bits != new_bits) {
+				dev_err(gc->dev, "EQ %d: overflow detected\n",
+					eq->id);
+				break;
+			}
 
-		/* Per GDMA spec, rmb is necessary after checking owner_bits, before
-		 * reading eqe.
-		 */
-		rmb();
+			/* Per GDMA spec, rmb is necessary after checking
+			 * owner_bits, before reading eqe.
+			 */
+			rmb();
 
-		mana_gd_process_eqe(eq);
+			mana_gd_process_eqe(eq);
 
-		eq->head++;
-	}
+			eq->head++;
+		}
 
-	head = eq->head % (num_eqe << GDMA_EQE_OWNER_BITS);
+		head = eq->head % (num_eqe << GDMA_EQE_OWNER_BITS);
 
-	mana_gd_ring_doorbell(gc, eq->gdma_dev->doorbell, eq->type, eq->id,
-			      head, SET_ARM_BIT);
+		mana_gd_ring_doorbell(gc, eq->gdma_dev->doorbell, eq->type,
+				      eq->id, head, SET_ARM_BIT);
+	}
 }
 
 static int mana_gd_register_irq(struct gdma_queue *queue,
@@ -432,44 +436,49 @@ static int mana_gd_register_irq(struct gdma_queue *queue,
 	gc = gd->gdma_context;
 	r = &gc->msix_resource;
 	dev = gc->dev;
+	msi_index = spec->eq.msix_index;
 
 	spin_lock_irqsave(&r->lock, flags);
 
-	msi_index = find_first_zero_bit(r->map, r->size);
-	if (msi_index >= r->size || msi_index >= gc->num_msix_usable) {
-		err = -ENOSPC;
-	} else {
-		bitmap_set(r->map, msi_index, 1);
-		queue->eq.msix_index = msi_index;
-	}
-
-	spin_unlock_irqrestore(&r->lock, flags);
+	if (!spec->eq.msix_allocated) {
+		msi_index = find_first_zero_bit(r->map, r->size);
+		if (msi_index >= r->size || msi_index >= gc->num_msix_usable)
+			err = -ENOSPC;
+		else
+			bitmap_set(r->map, msi_index, 1);
 
-	if (err) {
-		dev_err(dev, "Register IRQ err:%d, msi:%u rsize:%u, nMSI:%u",
-			err, msi_index, r->size, gc->num_msix_usable);
+		if (err) {
+			dev_err(dev, "Register IRQ err:%d, msi:%u rsize:%u, nMSI:%u",
+				err, msi_index, r->size, gc->num_msix_usable);
 
-		return err;
+			goto out;
+		}
 	}
 
+	queue->eq.msix_index = msi_index;
 	gic = &gc->irq_contexts[msi_index];
 
-	WARN_ON(gic->handler || gic->arg);
+	list_add_rcu(&queue->entry, &gic->eq_list);
 
-	gic->arg = queue;
+	WARN_ON(gic->handler);
 
 	gic->handler = mana_gd_process_eq_events;
 
-	return 0;
+out:
+	spin_unlock_irqrestore(&r->lock, flags);
+
+	return err;
 }
 
-static void mana_gd_deregiser_irq(struct gdma_queue *queue)
+static void mana_gd_deregister_irq(struct gdma_queue *queue)
 {
 	struct gdma_dev *gd = queue->gdma_dev;
 	struct gdma_irq_context *gic;
 	struct gdma_context *gc;
 	struct gdma_resource *r;
 	unsigned int msix_index;
+	struct list_head *p, *n;
+	struct gdma_queue *eq;
 	unsigned long flags;
 
 	gc = gd->gdma_context;
@@ -480,13 +489,25 @@ static void mana_gd_deregiser_irq(struct gdma_queue *queue)
 	if (WARN_ON(msix_index >= gc->num_msix_usable))
 		return;
 
+	spin_lock_irqsave(&r->lock, flags);
+
 	gic = &gc->irq_contexts[msix_index];
-	gic->handler = NULL;
-	gic->arg = NULL;
 
-	spin_lock_irqsave(&r->lock, flags);
-	bitmap_clear(r->map, msix_index, 1);
+	list_for_each_safe(p, n, &gic->eq_list) {
+		eq = list_entry(p, struct gdma_queue, entry);
+		if (queue == eq) {
+			list_del_rcu(&eq->entry);
+			break;
+		}
+	}
+
+	if (list_empty(&gic->eq_list)) {
+		gic->handler = NULL;
+		bitmap_clear(r->map, msix_index, 1);
+	}
+
 	spin_unlock_irqrestore(&r->lock, flags);
+	synchronize_rcu();
 
 	queue->eq.msix_index = INVALID_PCI_MSIX_INDEX;
 }
@@ -550,7 +571,7 @@ static void mana_gd_destroy_eq(struct gdma_context *gc, bool flush_evenets,
 			dev_warn(gc->dev, "Failed to flush EQ: %d\n", err);
 	}
 
-	mana_gd_deregiser_irq(queue);
+	mana_gd_deregister_irq(queue);
 
 	if (queue->eq.disable_needed)
 		mana_gd_disable_queue(queue);
@@ -565,7 +586,7 @@ static int mana_gd_create_eq(struct gdma_dev *gd,
 	u32 log2_num_entries;
 	int err;
 
-	queue->eq.msix_index = INVALID_PCI_MSIX_INDEX;
+	queue->eq.msix_index = spec->eq.msix_index;
 
 	log2_num_entries = ilog2(queue->queue_size / GDMA_EQE_SIZE);
 
@@ -602,6 +623,7 @@ static int mana_gd_create_eq(struct gdma_dev *gd,
 	mana_gd_destroy_eq(gc, false, queue);
 	return err;
 }
+EXPORT_SYMBOL(mana_gd_create_mana_eq);
 
 static void mana_gd_create_cq(const struct gdma_queue_spec *spec,
 			      struct gdma_queue *queue)
@@ -873,6 +895,7 @@ void mana_gd_destroy_queue(struct gdma_context *gc, struct gdma_queue *queue)
 	mana_gd_free_memory(gmi);
 	kfree(queue);
 }
+EXPORT_SYMBOL(mana_gd_destroy_queue);
 
 int mana_gd_verify_vf_version(struct pci_dev *pdev)
 {
@@ -1188,7 +1211,7 @@ static irqreturn_t mana_gd_intr(int irq, void *arg)
 	struct gdma_irq_context *gic = arg;
 
 	if (gic->handler)
-		gic->handler(gic->arg);
+		gic->handler(&gic->eq_list);
 
 	return IRQ_HANDLED;
 }
@@ -1241,7 +1264,7 @@ static int mana_gd_setup_irqs(struct pci_dev *pdev)
 	for (i = 0; i < nvec; i++) {
 		gic = &gc->irq_contexts[i];
 		gic->handler = NULL;
-		gic->arg = NULL;
+		INIT_LIST_HEAD(&gic->eq_list);
 
 		if (!i)
 			snprintf(gic->name, MANA_IRQ_NAME_SZ, "mana_hwc@pci:%s",
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 06d6292e09b3..85345225813f 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1156,6 +1156,7 @@ static int mana_create_eq(struct mana_context *ac)
 	spec.eq.callback = NULL;
 	spec.eq.context = ac->eqs;
 	spec.eq.log2_throttle_limit = LOG2_EQ_THROTTLE;
+	spec.eq.msix_allocated = false;
 
 	for (i = 0; i < gc->max_num_queues; i++) {
 		err = mana_gd_create_mana_eq(gd, &spec, &ac->eqs[i].eq);
diff --git a/include/net/mana/gdma.h b/include/net/mana/gdma.h
index 96c120160f15..cc728fc42043 100644
--- a/include/net/mana/gdma.h
+++ b/include/net/mana/gdma.h
@@ -6,6 +6,7 @@
 
 #include <linux/dma-mapping.h>
 #include <linux/netdevice.h>
+#include <linux/list.h>
 
 #include "shm_channel.h"
 
@@ -291,6 +292,8 @@ struct gdma_queue {
 	u32 head;
 	u32 tail;
 
+	struct list_head entry;
+
 	/* Extra fields specific to EQ/CQ. */
 	union {
 		struct {
@@ -325,6 +328,8 @@ struct gdma_queue_spec {
 			void *context;
 
 			unsigned long log2_throttle_limit;
+			bool msix_allocated;
+			unsigned int msix_index;
 		} eq;
 
 		struct {
@@ -340,8 +345,8 @@ struct gdma_queue_spec {
 #define MANA_IRQ_NAME_SZ 32
 
 struct gdma_irq_context {
-	void (*handler)(void *arg);
-	void *arg;
+	void (*handler)(struct list_head *arg);
+	struct list_head eq_list;
 	char name[MANA_IRQ_NAME_SZ];
 };
 
-- 
2.25.1


^ permalink raw reply related

* RE: [RFC PATCH 1/1] sched/fair: Fix SMT balance dependency on CPU numbering
From: Michael Kelley (LINUX) @ 2023-06-06 14:08 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com,
	dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com,
	mgorman@suse.de, bristot@redhat.com, vschneid@redhat.com,
	linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org
In-Reply-To: <CAKfTPtAyFqG4W0OAc6pejKdEQ4yTRaoC+qiOZN8sRrwCENmVKA@mail.gmail.com>

From: Vincent Guittot <vincent.guittot@linaro.org> Sent: Monday, June 5, 2023 2:31 AM
> 
> Hi Michael,
> 
> On Wed, 31 May 2023 at 19:55, Michael Kelley <mikelley@microsoft.com> wrote:
> >
> > With some CPU numbering schemes, the function select_idle_cpu() currently
> > has a subtle bias to return the first hyper-thread in a core. As a result
> > work is not evenly balanced across hyper-threads in a core. The difference
> > is often as much as 15 to 20 percentage points -- i.e., the first
> > hyper-thread might run at 45% load while the second hyper-thread runs at
> > only 30% load or less.
> >
> > Two likely CPU numbering schemes make sense with today's typical case
> > of two hyper-threads per core:
> >
> > A. Enumerate all the first hyper-theads in a core, then all the second
> >    hyper-threads in a core.  If a system has 8 cores with 16 hyper-threads,
> >    CPUs #0 and #8 are in the same core, as are CPUs #1 and #9, etc.
> >
> > B. Enumerate all hyper-threads in a core, then all hyper-threads in the
> >    next core, etc.  Again with 8 cores and 16 hyper-threads, CPUs #0 and
> >    #1 are in the same core, as are CPUs #2 and #3, etc.
> >
> > Scheme A is used in most ACPI bare metal systems and in VMs running on
> > KVM.  The enumeration order is determined by the order of the processor
> > entries in the ACPI MADT, and the ACPI spec *recommends* that the MADT
> > be set up for scheme A.
> >
> > However, for reasons that pre-date the ACPI recommendation, Hyper-V
> > guests have an ACPI MADT that is set up for scheme B.  When using scheme B,
> > the subtle bias is evident in select_idle_cpu().  While having Hyper-V
> > conform to the ACPI spec recommendation would solve the Hyper-V problem,
> > it is also desirable for the fair scheduler code to be independent of the
> > CPU numbering scheme.  ACPI is not always the firmware configuration
> > mechanism, and CPU numbering schemes might vary more in architectures
> > other than x86/x64.
> >
> > The bias occurs with scheme B when "has_idle_cpu" is true and
> 
> I assume that you mean has_idle_core as I can't find has_idle_cpu in the code

Yes.  You are right.

> 
> > select_idle_core() is called in the for_each_cpu_wrap() loop. Regardless
> > of where the loop starts, it will almost immediately encountered a CPU
> > that is the first hyper-thread in a core. If that core is idle, the CPU
> > number of that first hyper-thread is returned. If that core is not idle,
> > both hyper-threads are removed from the cpus mask, and the loop iterates
> > to choose another CPU that is the first hyper-thread in a core.  As a
> > result, select_idle_core() almost always returns the first hyper-thread
> > in a core.
> >
> > The bias does not occur with scheme A because half of the CPU numbering
> > space is a series of CPUs that are the second hyper-thread in all the
> > cores. Assuming that the "target" CPU is evenly distributed throughout
> > the CPU numbering space, there's a 50/50 chance of starting in the portion
> > of the CPU numbering space that is all second hyper-threads.  If
> > select_idle_core() finds a idle core, it will likely return a CPU that
> > is the second hyper-thread in the core.  On average over the time,
> > both the first and second hyper-thread are equally likely to be
> > returned.
> >
> > Fix this bias by determining which hyper-thread in a core the "target"
> > CPU is -- i.e., the "smt index" of that CPU.  Then when select_idle_core()
> > finds an idle core, it returns the CPU in the core that has the same
> > smt index. If that CPU is not valid to be chosen, just return the CPU
> > that was passed into select_idle_core() and don't worry about bias.
> >
> > With scheme B, this fix solves the bias problem by making the chosen
> > CPU be roughly equally likely to either hyper-thread.  With scheme A
> > there's no real effect as the chosen CPU was already equally likely
> > to be either hyper-thread, and still is.
> >
> > The imbalance in hyper-thread loading was originally observed in a
> > customer workload, and then reproduced with a synthetic workload.
> > The change has been tested with the synthetic workload in a Hyper-V VM
> > running the normal scheme B CPU numbering, and then with the MADT
> > replaced with a scheme A version using Linux's ability to override
> > ACPI tables. The testing showed no change hyper-thread loading
> > balance with the scheme A CPU numbering, but the imbalance is
> > corrected if the CPU numbering is scheme B.
> 
> You failed to explain why it's important to evenly select 1st or 2nd
> hyper-threads on the system.  I don't see any performance figures.
> What would be the benefit ?

As I noted below, it's not completely clear to me whether this is a
problem.  I don't have a specific workload where overall runtime is
longer due to the SMT level imbalance.  I'm not a scheduler expert,
and wanted input from those who are.  Linux generally does a good
job of balancing load, and given the code in fair.c that is devoted to
balancing, I inferred at least some importance.  But maybe balancing
is more important for the higher-level domains, and less so for the
SMT domain.

From a slightly different perspective, sysadmins are accustomed
to 'htop' or whatever tools showing balanced load.  This is a case
where the load is persistently unbalanced, and as such it drew
attention.  And at a slightly theoretical level, it *is* interesting that
the CPU numbering scheme affects the outcome of scheduler
decisions in a noticeable way.

But if the sense is that an SMT-level imbalance really doesn't affect
anything, we'll live with CPU numbering scheme B being a bit of an
anomaly.  When necessary, we can explain that it doesn't have any
negative effects.

Michael

> 
> >
> > Signed-off-by: Michael Kelley <mikelley@microsoft.com>
> > ---
> >
> > I haven't previously worked in Linux scheduler code, so I'm posting this
> > as an RFC to point out the observed problem, and to suggest a solution.
> > There may well be considerations in the design of a solution that I'm not
> > aware of, so please educate me or suggest an alternative.
> >
> > It's also not completely clear whether an imbalance in hyper-thread
> > loading is actually a problem. It looks weird, and causes customer
> > concern when it is observed consistently across all cores in some
> > production workload. The fair scheduler strives to balance load evenly, so
> > I'm treating it as a problem that should be fixed, if for no other reason
> > than general goodness. But again, I'm sure reviewers will feel free to
> > tell me otherwise. :-) The fix takes relatively few CPU cycles, but it's
> > still a non-zero cost.
> >
> > FWIW, the same imbalance has been observed with kernels as far back as
> > 5.4, and the root cause in the code is essentially the same. So it's not
> > a recently introduced issue. I haven't tried anything earlier than 5.4.
> >
> >  kernel/sched/fair.c | 36 ++++++++++++++++++++++++++++++------
> >  1 file changed, 30 insertions(+), 6 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 373ff5f..8b56e9d 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6832,6 +6832,19 @@ static inline bool test_idle_cores(int cpu)
> >         return false;
> >  }
> >
> > +static inline int get_smt_index(int core)
> > +{
> > +       int cpu, n = 0;
> > +
> > +       for_each_cpu(cpu, cpu_smt_mask(core)) {
> > +               if (cpu == core)
> > +                       return n;
> > +               n++;
> > +       }
> > +       /* If get here, cpu_smt_mask is set up incorrectly */
> > +       return 0;
> > +}
> > +
> >  /*
> >   * Scans the local SMT mask to see if the entire core is idle, and records this
> >   * information in sd_llc_shared->has_idle_cores.
> > @@ -6866,10 +6879,11 @@ void __update_idle_core(struct rq *rq)
> >   * there are no idle cores left in the system; tracked through
> >   * sd_llc->shared->has_idle_cores and enabled through update_idle_core() above.
> >   */
> > -static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int
> *idle_cpu)
> > +static int select_idle_core(struct task_struct *p, int core, int smt_index,
> > +                           struct cpumask *cpus, int *idle_cpu)
> >  {
> >         bool idle = true;
> > -       int cpu;
> > +       int cpu, index_cpu, n = 0;
> >
> >         for_each_cpu(cpu, cpu_smt_mask(core)) {
> >                 if (!available_idle_cpu(cpu)) {
> > @@ -6885,10 +6899,13 @@ static int select_idle_core(struct task_struct *p, int core,
> struct cpumask *cpu
> >                 }
> >                 if (*idle_cpu == -1 && cpumask_test_cpu(cpu, p->cpus_ptr))
> >                         *idle_cpu = cpu;
> > +
> > +               if (n++ == smt_index)
> > +                       index_cpu = cpu;
> >         }
> >
> >         if (idle)
> > -               return core;
> > +               return cpumask_test_cpu(index_cpu, p->cpus_ptr) ? index_cpu : core;
> >
> >         cpumask_andnot(cpus, cpus, cpu_smt_mask(core));
> >         return -1;
> > @@ -6922,7 +6939,13 @@ static inline bool test_idle_cores(int cpu)
> >         return false;
> >  }
> >
> > -static inline int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus,
> int *idle_cpu)
> > +static inline int get_smt_index(int core)
> > +{
> > +       return 0;
> > +}
> > +
> > +static inline int select_idle_core(struct task_struct *p, int core, int smt_index,
> > +                                  struct cpumask *cpus, int *idle_cpu)
> >  {
> >         return __select_idle_cpu(core, p);
> >  }
> > @@ -6942,7 +6965,7 @@ static inline int select_idle_smt(struct task_struct *p, int
> target)
> >  static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> has_idle_core, int target)
> >  {
> >         struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
> > -       int i, cpu, idle_cpu = -1, nr = INT_MAX;
> > +       int i, cpu, smt_index, idle_cpu = -1, nr = INT_MAX;
> >         struct sched_domain_shared *sd_share;
> >         struct rq *this_rq = this_rq();
> >         int this = smp_processor_id();
> > @@ -6994,9 +7017,10 @@ static int select_idle_cpu(struct task_struct *p, struct
> sched_domain *sd, bool
> >                 }
> >         }
> >
> > +       smt_index = get_smt_index(target);
> >         for_each_cpu_wrap(cpu, cpus, target + 1) {
> >                 if (has_idle_core) {
> > -                       i = select_idle_core(p, cpu, cpus, &idle_cpu);
> > +                       i = select_idle_core(p, cpu, smt_index, cpus, &idle_cpu);
> >                         if ((unsigned int)i < nr_cpumask_bits)
> >                                 return i;
> >
> > --
> > 1.8.3.1
> >

^ permalink raw reply

* Re: [PATCH 1/9] x86/hyperv: Add sev-snp enlightened guest static key
From: Tianyu Lan @ 2023-06-06 13:43 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Tianyu Lan, linux-arch, linux-hyperv, linux-kernel, kys, haiyangz,
	wei.liu, decui, tglx, mingo, bp, dave.hansen, x86, hpa,
	daniel.lezcano, arnd, michael.h.kelley
In-Reply-To: <874jnmkt4p.fsf@redhat.com>



On 6/5/2023 8:09 PM, Vitaly Kuznetsov wrote:
>> --- a/arch/x86/kernel/cpu/mshyperv.c
>> +++ b/arch/x86/kernel/cpu/mshyperv.c
>> @@ -402,8 +402,12 @@ static void __init ms_hyperv_init_platform(void)
>>   		pr_info("Hyper-V: Isolation Config: Group A 0x%x, Group B 0x%x\n",
>>   			ms_hyperv.isolation_config_a, ms_hyperv.isolation_config_b);
>>   
>> -		if (hv_get_isolation_type() == HV_ISOLATION_TYPE_SNP)
>> +
>> +		if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP)) {
>> +			static_branch_enable(&isolation_type_en_snp);
>> +		} else if (hv_get_isolation_type() == HV_ISOLATION_TYPE_SNP) {
>>   			static_branch_enable(&isolation_type_snp);
> Nitpick: In case 'isolation_type_snp' and 'isolation_type_en_snp' are
> mutually exclusive, I'd suggest we rename the former: it is quite
> un-intuitive that for an enlightened SNP guest '&isolation_type_snp' is
> NOT enabled. E.g. we can use
> 
> 'isol_type_snp_paravisor'
> and
> 'isol_type_snp_enlightened'
> 
> (I also don't like 'isolation_type_en_snp' name as 'en' normally stands
> for 'enabled')

Hi Vitaly:
	Thanks for your review. Agree. Will rename them.

> 
>> --- a/include/asm-generic/mshyperv.h
>> +++ b/include/asm-generic/mshyperv.h
>> @@ -36,15 +36,21 @@ struct ms_hyperv_info {
>>   	u32 nested_features;
>>   	u32 max_vp_index;
>>   	u32 max_lp_index;
>> -	u32 isolation_config_a;
>> +	union {
>> +		u32 isolation_config_a;
>> +		struct {
>> +			u32 paravisor_present : 1;
>> +			u32 reserved1 : 31;
>> +		};
>> +	};
>>   	union {
>>   		u32 isolation_config_b;
>>   		struct {
>>   			u32 cvm_type : 4;
>> -			u32 reserved1 : 1;
>> +			u32 reserved2 : 1;
>>   			u32 shared_gpa_boundary_active : 1;
>>   			u32 shared_gpa_boundary_bits : 6;
>> -			u32 reserved2 : 20;
>> +			u32 reserved3 : 20;
> Maybe use 'reserved_a1', 'reserved_b1', 'reserved_b2',... to avoid the
> need to rename in the future when more bits from isolation_config_a get
> used?
> 

Good suggestion. will update.

^ permalink raw reply

* RE: [PATCH 1/1] RDMA/mana_ib: Add EQ interrupt support to mana ib driver.
From: Wei Hu @ 2023-06-06 13:22 UTC (permalink / raw)
  To: Jason Gunthorpe, Simon Horman
  Cc: netdev@vger.kernel.org, linux-hyperv@vger.kernel.org,
	linux-rdma@vger.kernel.org, Long Li, Ajay Sharma, leon@kernel.org,
	KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org, Dexuan Cui,
	davem@davemloft.net, edumazet@google.com, kuba@kernel.org,
	pabeni@redhat.com, vkuznets@redhat.com,
	ssengar@linux.microsoft.com, shradhagupta@linux.microsoft.com
In-Reply-To: <ZH3kjU7a2L7EkEQ2@ziepe.ca>



> -----Original Message-----
> From: Jason Gunthorpe <jgg@ziepe.ca>
> Sent: Monday, June 5, 2023 9:35 PM
> To: Simon Horman <simon.horman@corigine.com>
> Cc: Wei Hu <weh@microsoft.com>; netdev@vger.kernel.org; linux-
> hyperv@vger.kernel.org; linux-rdma@vger.kernel.org; Long Li
> <longli@microsoft.com>; Ajay Sharma <sharmaajay@microsoft.com>;
> leon@kernel.org; KY Srinivasan <kys@microsoft.com>; Haiyang Zhang
> <haiyangz@microsoft.com>; wei.liu@kernel.org; Dexuan Cui
> <decui@microsoft.com>; davem@davemloft.net; edumazet@google.com;
> kuba@kernel.org; pabeni@redhat.com; vkuznets@redhat.com;
> ssengar@linux.microsoft.com; shradhagupta@linux.microsoft.com
> Subject: Re: [PATCH 1/1] RDMA/mana_ib: Add EQ interrupt support to mana
> ib driver.
> 
> On Mon, Jun 05, 2023 at 03:15:05PM +0200, Simon Horman wrote:
> > On Mon, Jun 05, 2023 at 11:43:13AM +0000, Wei Hu wrote:
> > > Add EQ interrupt support for mana ib driver. Allocate EQs per
> > > ucontext to receive interrupt. Attach EQ when CQ is created. Call CQ
> > > interrupt handler when completion interrupt happens. EQs are
> > > destroyed when ucontext is deallocated.
> > >
> > > The change calls some public APIs in mana ethernet driver to
> > > allocate EQs and other resources. Ehe EQ process routine is also
> > > shared by mana ethernet and mana ib drivers.
> > >
> > > Co-developed-by: Ajay Sharma <sharmaajay@microsoft.com>
> > > Signed-off-by: Ajay Sharma <sharmaajay@microsoft.com>
> > > Signed-off-by: Wei Hu <weh@microsoft.com>
> >
> > ...
> >
> > > @@ -368,6 +420,24 @@ static int mana_ib_create_qp_raw(struct ib_qp
> *ibqp, struct ib_pd *ibpd,
> > >  	qp->sq_id = wq_spec.queue_index;
> > >  	send_cq->id = cq_spec.queue_index;
> > >
> > > +	if (gd->gdma_context->cq_table[send_cq->id] == NULL) {
> > > +
> > > +		gdma_cq = kzalloc(sizeof(*gdma_cq), GFP_KERNEL);
> > > +		if (!gdma_cq) {
> > > +			pr_err("failed to allocate gdma_cq\n");
> >
> > Hi wei Hu,
> >
> > I think 'err = -ENOMEM' is needed here.
> 
> And no prints like that in drivers.
> 
Thanks for your review, Simon and Jason. You are right. 
I have overlooked these. I will fix it and send a v2 shortly.

Wei


^ permalink raw reply

* Re: [PATCH v6] hv_netvsc: Allocate rx indirection table size dynamically
From: Simon Horman @ 2023-06-06  9:51 UTC (permalink / raw)
  To: Shradha Gupta
  Cc: linux-kernel, linux-hyperv, netdev, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Long Li, Michael Kelley, David S. Miller, Steen Hegelund,
	Praveen Kumar
In-Reply-To: <1685964606-24690-1-git-send-email-shradhagupta@linux.microsoft.com>

+ Praveen Kumar

On Mon, Jun 05, 2023 at 04:30:06AM -0700, Shradha Gupta wrote:
> Allocate the size of rx indirection table dynamically in netvsc
> from the value of size provided by OID_GEN_RECEIVE_SCALE_CAPABILITIES
> query instead of using a constant value of ITAB_NUM.
> 
> Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
> Tested-on: Ubuntu22 (azure VM, SKU size: Standard_F72s_v2)
> Testcases:
> 1. ethtool -x eth0 output
> 2. LISA testcase:PERF-NETWORK-TCP-THROUGHPUT-MULTICONNECTION-NTTTCP-Synthetic
> 3. LISA testcase:PERF-NETWORK-TCP-THROUGHPUT-MULTICONNECTION-NTTTCP-SRIOV

Hi Praveen, all,

it seems that there has not been any non-trivial review of this
patchset since v3. But at that point there was some feedback
from you. So I'd like to ask if you have any feedback on v6 [1].

[1] https://lore.kernel.org/all/1685964606-24690-1-git-send-email-shradhagupta@linux.microsoft.com/


^ permalink raw reply

* Re: [PATCH RFC net-next v3 8/8] tests: add vsock dgram tests
From: Arseniy Krasnov @ 2023-06-06  9:34 UTC (permalink / raw)
  To: Bobby Eshleman, Jiang Wang
  Cc: Stefan Hajnoczi, Stefano Garzarella, Michael S. Tsirkin,
	Jason Wang, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Bryan Tan, Vishnu Dasa, VMware PV-Drivers Reviewers,
	Krasnov Arseniy, kvm, virtualization, netdev, linux-kernel,
	linux-hyperv
In-Reply-To: <7dbec78e-ea44-4684-6d02-5d6d5051187e@sberdevices.ru>

Sorry, CC mailing lists

On 06.06.2023 12:29, Arseniy Krasnov wrote:
> Hello Bobby and Jiang! Small remarks(sorry for this letter layout, I add multiple newline over comments):
> 
> diff --git a/tools/testing/vsock/util.c b/tools/testing/vsock/util.c
> index 01b636d3039a..45e35da48b40 100644
> --- a/tools/testing/vsock/util.c
> +++ b/tools/testing/vsock/util.c
> @@ -260,6 +260,57 @@ void send_byte(int fd, int expected_ret, int flags)
>  	}
>  }
>  
> +/* Transmit one byte and check the return value.
> + *
> + * expected_ret:
> + *  <0 Negative errno (for testing errors)
> + *   0 End-of-file
> + *   1 Success
> + */
> +void sendto_byte(int fd, const struct sockaddr *dest_addr, int len, int expected_ret,
> +		 int flags)
> +{
> +	const uint8_t byte = 'A';
> +	ssize_t nwritten;
> +
> +	timeout_begin(TIMEOUT);
> +	do {
> +		nwritten = sendto(fd, &byte, sizeof(byte), flags, dest_addr,
> +				  len);
> +		timeout_check("write");
> +	} while (nwritten < 0 && errno == EINTR);
> +	timeout_end();
> +
> +	if (expected_ret < 0) {
> +		if (nwritten != -1) {
> +			fprintf(stderr, "bogus sendto(2) return value %zd\n",
> +				nwritten);
> +			exit(EXIT_FAILURE);
> +		}
> +		if (errno != -expected_ret) {
> +			perror("write");
> +			exit(EXIT_FAILURE);
> +		}
> +		return;
> +	}
> +
> +	if (nwritten < 0) {
> +		perror("write");
> +		exit(EXIT_FAILURE);
> +	}
> +	if (nwritten == 0) {
> +		if (expected_ret == 0)
> +			return;
> +
> +		fprintf(stderr, "unexpected EOF while sending byte\n");
> +		exit(EXIT_FAILURE);
> +	}
> +	if (nwritten != sizeof(byte)) {
> +		fprintf(stderr, "bogus sendto(2) return value %zd\n", nwritten);
> +		exit(EXIT_FAILURE);
> +
> 	}
> 
> 
> 
> ^^^
> May be short check that 'nwritten' != 'expected_ret' will be enough? Then print expected and
> real value. Here and in 'recvfrom_byte()' below.
> 
> 
> 
> 
> +}
> +
>  /* Receive one byte and check the return value.
>   *
>   * expected_ret:
> @@ -313,6 +364,60 @@ void recv_byte(int fd, int expected_ret, int flags)
>  	}
>  }
>  
> +/* Receive one byte and check the return value.
> + *
> + * expected_ret:
> + *  <0 Negative errno (for testing errors)
> + *   0 End-of-file
> + *   1 Success
> + */
> +void recvfrom_byte(int fd, struct sockaddr *src_addr, socklen_t *addrlen,
> +		   int expected_ret, int flags)
> +{
> +	uint8_t byte;
> +	ssize_t nread;
> +
> +	timeout_begin(TIMEOUT);
> +	do {
> +		nread = recvfrom(fd, &byte, sizeof(byte), flags, src_addr, addrlen);
> +		timeout_check("read");
> +	} while (nread < 0 && errno == EINTR);
> +	timeout_end();
> +
> +	if (expected_ret < 0) {
> +		if (nread != -1) {
> +			fprintf(stderr, "bogus recvfrom(2) return value %zd\n",
> +				nread);
> +			exit(EXIT_FAILURE);
> +		}
> +		if (errno != -expected_ret) {
> +			perror("read");
> +			exit(EXIT_FAILURE);
> +		}
> +		return;
> +	}
> +
> +	if (nread < 0) {
> +		perror("read");
> +		exit(EXIT_FAILURE);
> +	}
> +	if (nread == 0) {
> +		if (expected_ret == 0)
> +			return;
> +
> +		fprintf(stderr, "unexpected EOF while receiving byte\n");
> +		exit(EXIT_FAILURE);
> +	}
> +	if (nread != sizeof(byte)) {
> +		fprintf(stderr, "bogus recvfrom(2) return value %zd\n", nread);
> +		exit(EXIT_FAILURE);
> +	}
> +	if (byte != 'A') {
> +		fprintf(stderr, "unexpected byte read %c\n", byte);
> +		exit(EXIT_FAILURE);
> +	}
> +}
> +
>  /* Run test cases.  The program terminates if a failure occurs. */
>  void run_tests(const struct test_case *test_cases,
>  	       const struct test_opts *opts)
> diff --git a/tools/testing/vsock/util.h b/tools/testing/vsock/util.h
> index fb99208a95ea..6e5cd610bf05 100644
> --- a/tools/testing/vsock/util.h
> +++ b/tools/testing/vsock/util.h
> @@ -43,7 +43,11 @@ int vsock_seqpacket_accept(unsigned int cid, unsigned int port,
>  			   struct sockaddr_vm *clientaddrp);
>  void vsock_wait_remote_close(int fd);
>  void send_byte(int fd, int expected_ret, int flags);
> +void sendto_byte(int fd, const struct sockaddr *dest_addr, int len, int expected_ret,
> +		 int flags);
>  void recv_byte(int fd, int expected_ret, int flags);
> +void recvfrom_byte(int fd, struct sockaddr *src_addr, socklen_t *addrlen,
> +		   int expected_ret, int flags);
>  void run_tests(const struct test_case *test_cases,
>  	       const struct test_opts *opts);
>  void list_tests(const struct test_case *test_cases);
> diff --git a/tools/testing/vsock/vsock_test.c b/tools/testing/vsock/vsock_test.c
> index ac1bd3ac1533..851c3d65178d 100644
> --- a/tools/testing/vsock/vsock_test.c
> +++ b/tools/testing/vsock/vsock_test.c
> @@ -202,6 +202,113 @@ static void test_stream_server_close_server(const struct test_opts *opts)
>  	close(fd);
>  }
>  
> +static void test_dgram_sendto_client(const struct test_opts *opts)
> +{
> +	union {
> +		struct sockaddr sa;
> +		struct sockaddr_vm svm;
> +	} addr = {
> +		.svm = {
> +			.svm_family = AF_VSOCK,
> +			.svm_port = 1234,
> +			.svm_cid = opts->peer_cid,
> +		},
> +	};
> +	int fd;
> +
> +	/* Wait for the server to be ready */
> +	control_expectln("BIND");
> +
> +	fd = socket(AF_VSOCK, SOCK_DGRAM, 0);
> +	if (fd < 0) {
> +		perror("socket");
> +		exit(EXIT_FAILURE);
> +	}
> +
> +	sendto_byte(fd, &addr.sa, sizeof(addr.svm), 1, 0);
> +
> +	/* Notify the server that the client has finished */
> +	control_writeln("DONE");
> +
> +	close(fd);
> +}
> +
> +static void test_dgram_sendto_server(const struct test_opts *opts)
> +{
> +	union {
> +		struct sockaddr sa;
> +		struct sockaddr_vm svm;
> +	} addr = {
> +		.svm = {
> +			.svm_family = AF_VSOCK,
> +			.svm_port = 1234,
> +			.svm_cid = VMADDR_CID_ANY,
> +		},
> +	};
> +	int fd;
> +	int len = sizeof(addr.sa);
> +
> +	fd = socket(AF_VSOCK, SOCK_DGRAM, 0);
> 
> 
> 
> ^^^
> I think we can check 'socket()' return value;
> 
> 
> 
> 
> +
> +	if (bind(fd, &addr.sa, sizeof(addr.svm)) < 0) {
> +		perror("bind");
> +		exit(EXIT_FAILURE);
> +	}
> +
> +	/* Notify the client that the server is ready */
> +	control_writeln("BIND");
> +
> +	recvfrom_byte(fd, &addr.sa, &len, 1, 0);
> +
> +	/* Wait for the client to finish */
> +	control_expectln("DONE");
> +
> +	close(fd);
> +}
> +
> +static void test_dgram_connect_client(const struct test_opts *opts)
> +{
> +	union {
> +		struct sockaddr sa;
> +		struct sockaddr_vm svm;
> +	} addr = {
> +		.svm = {
> +			.svm_family = AF_VSOCK,
> +			.svm_port = 1234,
> +			.svm_cid = opts->peer_cid,
> +		},
> +	};
> +	int fd;
> +	int ret;
> +
> +	/* Wait for the server to be ready */
> +	control_expectln("BIND");
> +
> +	fd = socket(AF_VSOCK, SOCK_DGRAM, 0);
> +	if (fd < 0) {
> +		perror("bind");
> +		exit(EXIT_FAILURE);
> +	}
> +
> +	ret = connect(fd, &addr.sa, sizeof(addr.svm));
> +	if (ret < 0) {
> +		perror("connect");
> +		exit(EXIT_FAILURE);
> +	}
> +
> +	send_byte(fd, 1, 0);
> +
> +	/* Notify the server that the client has finished */
> +	control_writeln("DONE");
> +
> +	close(fd);
> +}
> +
> +static void test_dgram_connect_server(const struct test_opts *opts)
> +{
> +	test_dgram_sendto_server(opts);
> +}
> +
>  /* With the standard socket sizes, VMCI is able to support about 100
>   * concurrent stream connections.
>   */
> @@ -255,6 +362,77 @@ static void test_stream_multiconn_server(const struct test_opts *opts)
>  		close(fds[i]);
>  }
>  
> +static void test_dgram_multiconn_client(const struct test_opts *opts)
> +{
> +	int fds[MULTICONN_NFDS];
> +	int i;
> +	union {
> +		struct sockaddr sa;
> +		struct sockaddr_vm svm;
> +	} addr = {
> +		.svm = {
> +			.svm_family = AF_VSOCK,
> +			.svm_port = 1234,
> +			.svm_cid = opts->peer_cid,
> +		},
> +	};
> +
> +	/* Wait for the server to be ready */
> +	control_expectln("BIND");
> +
> +	for (i = 0; i < MULTICONN_NFDS; i++) {
> +		fds[i] = socket(AF_VSOCK, SOCK_DGRAM, 0);
> +		if (fds[i] < 0) {
> +			perror("socket");
> +			exit(EXIT_FAILURE);
> +		}
> +	}
> +
> +	for (i = 0; i < MULTICONN_NFDS; i++)
> +		sendto_byte(fds[i], &addr.sa, sizeof(addr.svm), 1, 0);
> +
> +	/* Notify the server that the client has finished */
> +	control_writeln("DONE");
> +
> +	for (i = 0; i < MULTICONN_NFDS; i++)
> +		close(fds[i]);
> +}
> +
> +static void test_dgram_multiconn_server(const struct test_opts *opts)
> +{
> +	union {
> +		struct sockaddr sa;
> +		struct sockaddr_vm svm;
> +	} addr = {
> +		.svm = {
> +			.svm_family = AF_VSOCK,
> +			.svm_port = 1234,
> +			.svm_cid = VMADDR_CID_ANY,
> +		},
> +	};
> +	int fd;
> +	int len = sizeof(addr.sa);
> +	int i;
> +
> +	fd = socket(AF_VSOCK, SOCK_DGRAM, 0);
> 
> 
> 
> ^^^
> I think we can check 'socket()' return value;
> 
> 
> 
> +
> +	if (bind(fd, &addr.sa, sizeof(addr.svm)) < 0) {
> +		perror("bind");
> +		exit(EXIT_FAILURE);
> +	}
> +
> +	/* Notify the client that the server is ready */
> +	control_writeln("BIND");
> +
> +	for (i = 0; i < MULTICONN_NFDS; i++)
> +		recvfrom_byte(fd, &addr.sa, &len, 1, 0);
> +
> +	/* Wait for the client to finish */
> +	control_expectln("DONE");
> +
> +	close(fd);
> +}
> +
>  static void test_stream_msg_peek_client(const struct test_opts *opts)
>  {
>  	int fd;
> @@ -1128,6 +1306,21 @@ static struct test_case test_cases[] = {
>  		.run_client = test_stream_virtio_skb_merge_client,
>  		.run_server = test_stream_virtio_skb_merge_server,
>  	},
> +	{
> +		.name = "SOCK_DGRAM client close",
> +		.run_client = test_dgram_sendto_client,
> +		.run_server = test_dgram_sendto_server,
> +	},
> +	{
> +		.name = "SOCK_DGRAM client connect",
> +		.run_client = test_dgram_connect_client,
> +		.run_server = test_dgram_connect_server,
> +	},
> +	{
> +		.name = "SOCK_DGRAM multiple connections",
> +		.run_client = test_dgram_multiconn_client,
> +		.run_server = test_dgram_multiconn_server,
> +	},
> 
> 
> 
> 
> SOCK_DGRAM guarantees message bounds, I think it will be good to add such test like in SOCK_SEQPACKET tests.
> 
> Thanks, Arseniy
> 
> 
>  	{},
>  };
>  
> 

^ permalink raw reply

* Re: [RFC PATCH V6 01/14] x86/sev: Add a #HV exception handler
From: Peter Zijlstra @ 2023-06-06  7:50 UTC (permalink / raw)
  To: Gupta, Pankaj
  Cc: Tom Lendacky, Tianyu Lan, luto, tglx, mingo, bp, dave.hansen, x86,
	hpa, seanjc, pbonzini, jgross, tiala, kirill, jiangshan.ljs,
	ashish.kalra, srutherford, akpm, anshuman.khandual,
	pawan.kumar.gupta, adrian.hunter, daniel.sneddon,
	alexander.shishkin, sandipan.das, ray.huang, brijesh.singh,
	michael.roth, venu.busireddy, sterritt, tony.luck, samitolvanen,
	fenghua.yu, pangupta, linux-kernel, kvm, linux-hyperv, linux-arch
In-Reply-To: <54fa0a4f-9b3e-50d7-57cb-e0d2d39b7761@amd.com>

On Tue, Jun 06, 2023 at 08:00:32AM +0200, Gupta, Pankaj wrote:
> 
> > > That should really say that a nested #HV should never be raised by the
> > > hypervisor, but if it is, then the guest should detect that and
> > > self-terminate knowing that the hypervisor is possibly being malicious.
> > 
> > I've yet to see code that can do that reliably.
> 
> - Currently, we are detecting the direct nested #HV with below check and
>   guest self terminate.
> 
>   <snip>
> 	if (get_stack_info_noinstr(stack, current, &info) &&
> 	    (info.type == (STACK_TYPE_EXCEPTION + ESTACK_HV) ||
> 	     info.type == (STACK_TYPE_EXCEPTION + ESTACK_HV2)))
> 		panic("Nested #HV exception, HV IST corrupted, stack
>                 type = %d\n", info.type);
>   </snip>
> 
> - Thinking about below solution to detect the nested
>   #HV reliably:
> 
>   -- Make reliable IST stack switching for #VC -> #HV -> #VC case
>      (similar to done in __sev_es_ist_enter/__sev_es_ist_exit for NMI
>      IST stack).

I'm not convinced any of that is actually correct; there is a *huge*
window between NMI hitting and calling __sev_es_ist_enter(), idem on the
exit side.

>   -- In addition to this, we can make nested #HV detection (with another
>      exception type) more reliable with refcounting (percpu?).

There is also #DB and the MOVSS shadow.

And no, I don't think any of that is what you'd call 'robust'. This is
what I call a trainwreck :/

And I'm more than willing to say no until the hardware is more sane.

Supervisor Shadow Stack support is in the same boat, that's on hold
until FRED makes things workable.

^ permalink raw reply

* Re: [RFC PATCH V6 01/14] x86/sev: Add a #HV exception handler
From: Gupta, Pankaj @ 2023-06-06  6:00 UTC (permalink / raw)
  To: Peter Zijlstra, Tom Lendacky
  Cc: Tianyu Lan, luto, tglx, mingo, bp, dave.hansen, x86, hpa, seanjc,
	pbonzini, jgross, tiala, kirill, jiangshan.ljs, ashish.kalra,
	srutherford, akpm, anshuman.khandual, pawan.kumar.gupta,
	adrian.hunter, daniel.sneddon, alexander.shishkin, sandipan.das,
	ray.huang, brijesh.singh, michael.roth, venu.busireddy, sterritt,
	tony.luck, samitolvanen, fenghua.yu, pangupta, linux-kernel, kvm,
	linux-hyperv, linux-arch
In-Reply-To: <20230530185232.GA211927@hirez.programming.kicks-ass.net>


>> That should really say that a nested #HV should never be raised by the
>> hypervisor, but if it is, then the guest should detect that and
>> self-terminate knowing that the hypervisor is possibly being malicious.
> 
> I've yet to see code that can do that reliably.

- Currently, we are detecting the direct nested #HV with below check and
   guest self terminate.

   <snip>
	if (get_stack_info_noinstr(stack, current, &info) &&
	    (info.type == (STACK_TYPE_EXCEPTION + ESTACK_HV) ||
	     info.type == (STACK_TYPE_EXCEPTION + ESTACK_HV2)))
		panic("Nested #HV exception, HV IST corrupted, stack
                 type = %d\n", info.type);
   </snip>

- Thinking about below solution to detect the nested
   #HV reliably:

   -- Make reliable IST stack switching for #VC -> #HV -> #VC case
      (similar to done in __sev_es_ist_enter/__sev_es_ist_exit for NMI
      IST stack).

   -- In addition to this, we can make nested #HV detection (with another
      exception type) more reliable with refcounting (percpu?).

Need your inputs before I implement this solution. Or any other idea in 
software you have in mind?

Thanks,
Pankaj


^ permalink raw reply

* Re: [PATCH RFC net-next v3 0/8] virtio/vsock: support datagrams
From: Arseniy Krasnov @ 2023-06-05 20:42 UTC (permalink / raw)
  To: Bobby Eshleman, Stefan Hajnoczi, Stefano Garzarella,
	Michael S. Tsirkin, Jason Wang, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Bryan Tan
  Cc: kvm, virtualization, netdev, linux-kernel, linux-hyperv,
	Bobby Eshleman, Jiang Wang
In-Reply-To: <20230413-b4-vsock-dgram-v3-0-c2414413ef6a@bytedance.com>

Hello Bobby!

Thanks for this patchset, really interesting!

I applied it on head:

https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=d20dd0ea14072e8a90ff864b2c1603bd68920b4b

And tried to run ./vsock_test (client in the guest, server in the host), I had the following crash:

Control socket connected to 192.168.1.1:12345.                          
0 - SOCK_STREAM connection reset...                                     
[    8.050215] BUG: kernel NULL pointer derefer                         
[    8.050960] #PF: supervisor read access in kernel mode               
[    8.050960] #PF: error_code(0x0000) - not-present page               
[    8.050960] PGD 0 P4D 0                                              
[    8.050960] Oops: 0000 [#1] PREEMPT SMP PTI                          
[    8.050960] CPU: 0 PID: 109 Comm: vsock_test Not tainted 6.4.0-rc3-gd707c220a700
[    8.050960] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14
[    8.050960] RIP: 0010:static_key_count+0x0/0x20                      
[    8.050960] Code: 04 4c 8b 46 08 49 29 c0 4c 01 c8 4c 89 47 08 89 0e 89 56 04 4f
[    8.050960] RSP: 0018:ffffa9a1c021bdc0 EFLAGS: 00010202              
[    8.050960] RAX: ffffffffac309880 RBX: ffffffffc02fc140 RCX: 0000000000000000
[    8.050960] RDX: ffff9a5eff944600 RSI: 0000000000000000 RDI: 0000000000000000
[    8.050960] RBP: ffff9a5ec2371900 R08: ffffa9a1c021bd30 R09: ffff9a5eff98e0c0
[    8.050960] R10: 0000000000001000 R11: 0000000000000000 R12: ffffa9a1c021be80
[    8.050960] R13: 0000000000000000 R14: 0000000000000002 R15: ffff9a5ec1cfca80
[    8.050960] FS:  00007fa9bf88c5c0(0000) GS:ffff9a5efe400000(0000) knlGS:00000000
[    8.050960] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033        
[    8.050960] CR2: 0000000000000000 CR3: 00000000023e0000 CR4: 00000000000006f0
[    8.050960] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    8.050960] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    8.050960] Call Trace:                                              
[    8.050960]  <TASK>                                                  
[    8.050960]  once_deferred+0xd/0x30                                  
[    8.050960]  vsock_assign_transport+0xa2/0x1b0 [vsock]               
[    8.050960]  vsock_connect+0xb4/0x3a0 [vsock]                        
[    8.050960]  ? var_wake_function+0x60/0x60                           
[    8.050960]  __sys_connect+0x9e/0xd0                                 
[    8.050960]  ? _raw_spin_unlock_irq+0xe/0x30                         
[    8.050960]  ? do_setitimer+0x128/0x1f0                              
[    8.050960]  ? alarm_setitimer+0x4c/0x90                             
[    8.050960]  ? fpregs_assert_state_consistent+0x1d/0x50              
[    8.050960]  ? exit_to_user_mode_prepare+0x36/0x130                  
[    8.050960]  __x64_sys_connect+0x11/0x20                             
[    8.050960]  do_syscall_64+0x3b/0xc0                                 
[    8.050960]  entry_SYSCALL_64_after_hwframe+0x4b/0xb5                
[    8.050960] RIP: 0033:0x7fa9bf7c4d13                                 
[    8.050960] Code: 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 48
[    8.050960] RSP: 002b:00007ffdf2d96cc8 EFLAGS: 00000246 ORIG_RAX: 0000000000000a
[    8.050960] RAX: ffffffffffffffda RBX: 0000560c305d0020 RCX: 00007fa9bf7c4d13
[    8.050960] RDX: 0000000000000010 RSI: 00007ffdf2d96ce0 RDI: 0000000000000004
[    8.050960] RBP: 0000000000000004 R08: 0000560c317dc018 R09: 0000000000000000
[    8.050960] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[    8.050960] R13: 0000560c305ccc2d R14: 00007ffdf2d96ce0 R15: 00007ffdf2d96d70
[    8.050960]  </TASK>  


I guess crash is somewhere near:

old_info->transport->release(vsk); in vsock_assign_transport(). May be my config is wrong...

Thanks, Arseniy

^ permalink raw reply

* RE: Fwd: nvsp_rndis_pkt_complete error status and net_ratelimit: callbacks suppressed messages on 6.4.0rc4
From: Michael Kelley (LINUX) @ 2023-06-05 14:08 UTC (permalink / raw)
  To: Linux regressions mailing list
  Cc: Haiyang Zhang, Paolo Abeni, David S. Miller, Jakub Kicinski,
	Linux Kernel Mailing List, Linux BPF, Linux on Hyper-V,
	Linux Kernel Network Developers, Bagas Sanjaya
In-Reply-To: <ebdd877d-f143-487c-04cd-606996eb6176@leemhuis.info>

From: Linux regression tracking (Thorsten Leemhuis) <regressions@leemhuis.info> Sent: Monday, June 5, 2023 3:28 AM
> 
> Hi, Thorsten here, the Linux kernel's regression tracker.
> 
> On 30.05.23 14:25, Bagas Sanjaya wrote:
> >
> > I notice a regression report on Bugzilla [1]. Quoting from it:
> 
> Hmmm, nobody replied to this yet (or am I missing something?). Doesn't
> seems like it's something urgent, but nevertheless:
> 
> Michael, Bagas didn't make it obvious at all, hence please allow me to
> ask: did you notice that this is a regression that is apparently caused
> by a commit of yours?
> 

Thanks for flagging this directly to me.  I'll check on it.

Michael

^ permalink raw reply

* Re: [PATCH 1/1] RDMA/mana_ib: Add EQ interrupt support to mana ib driver.
From: Jason Gunthorpe @ 2023-06-05 13:35 UTC (permalink / raw)
  To: Simon Horman
  Cc: Wei Hu, netdev, linux-hyperv, linux-rdma, longli, sharmaajay,
	leon, kys, haiyangz, wei.liu, decui, davem, edumazet, kuba,
	pabeni, vkuznets, ssengar, shradhagupta
In-Reply-To: <ZH3f2abyRU1l/dq6@corigine.com>

On Mon, Jun 05, 2023 at 03:15:05PM +0200, Simon Horman wrote:
> On Mon, Jun 05, 2023 at 11:43:13AM +0000, Wei Hu wrote:
> > Add EQ interrupt support for mana ib driver. Allocate EQs per ucontext
> > to receive interrupt. Attach EQ when CQ is created. Call CQ interrupt
> > handler when completion interrupt happens. EQs are destroyed when
> > ucontext is deallocated.
> > 
> > The change calls some public APIs in mana ethernet driver to
> > allocate EQs and other resources. Ehe EQ process routine is also shared
> > by mana ethernet and mana ib drivers.
> > 
> > Co-developed-by: Ajay Sharma <sharmaajay@microsoft.com>
> > Signed-off-by: Ajay Sharma <sharmaajay@microsoft.com>
> > Signed-off-by: Wei Hu <weh@microsoft.com>
> 
> ...
> 
> > @@ -368,6 +420,24 @@ static int mana_ib_create_qp_raw(struct ib_qp *ibqp, struct ib_pd *ibpd,
> >  	qp->sq_id = wq_spec.queue_index;
> >  	send_cq->id = cq_spec.queue_index;
> >  
> > +	if (gd->gdma_context->cq_table[send_cq->id] == NULL) {
> > +
> > +		gdma_cq = kzalloc(sizeof(*gdma_cq), GFP_KERNEL);
> > +		if (!gdma_cq) {
> > +			pr_err("failed to allocate gdma_cq\n");
> 
> Hi wei Hu,
> 
> I think 'err = -ENOMEM' is needed here.

And no prints like that in drivers.

Jason

^ permalink raw reply

* Re: [PATCH 1/1] RDMA/mana_ib: Add EQ interrupt support to mana ib driver.
From: Simon Horman @ 2023-06-05 13:15 UTC (permalink / raw)
  To: Wei Hu
  Cc: netdev, linux-hyperv, linux-rdma, longli, sharmaajay, jgg, leon,
	kys, haiyangz, wei.liu, decui, davem, edumazet, kuba, pabeni,
	vkuznets, ssengar, shradhagupta
In-Reply-To: <20230605114313.1640883-1-weh@microsoft.com>

On Mon, Jun 05, 2023 at 11:43:13AM +0000, Wei Hu wrote:
> Add EQ interrupt support for mana ib driver. Allocate EQs per ucontext
> to receive interrupt. Attach EQ when CQ is created. Call CQ interrupt
> handler when completion interrupt happens. EQs are destroyed when
> ucontext is deallocated.
> 
> The change calls some public APIs in mana ethernet driver to
> allocate EQs and other resources. Ehe EQ process routine is also shared
> by mana ethernet and mana ib drivers.
> 
> Co-developed-by: Ajay Sharma <sharmaajay@microsoft.com>
> Signed-off-by: Ajay Sharma <sharmaajay@microsoft.com>
> Signed-off-by: Wei Hu <weh@microsoft.com>

...

> @@ -368,6 +420,24 @@ static int mana_ib_create_qp_raw(struct ib_qp *ibqp, struct ib_pd *ibpd,
>  	qp->sq_id = wq_spec.queue_index;
>  	send_cq->id = cq_spec.queue_index;
>  
> +	if (gd->gdma_context->cq_table[send_cq->id] == NULL) {
> +
> +		gdma_cq = kzalloc(sizeof(*gdma_cq), GFP_KERNEL);
> +		if (!gdma_cq) {
> +			pr_err("failed to allocate gdma_cq\n");

Hi wei Hu,

I think 'err = -ENOMEM' is needed here.

> +			goto err_destroy_wqobj_and_cq;
> +		}
> +
> +		pr_debug("gdma cq allocated %p\n", gdma_cq);
> +		gdma_cq->cq.context = send_cq;
> +		gdma_cq->type = GDMA_CQ;
> +		gdma_cq->cq.callback = mana_ib_cq_handler;
> +		gdma_cq->id = send_cq->id;
> +		gd->gdma_context->cq_table[send_cq->id] = gdma_cq;
> +	} else {
> +		gdma_cq = gd->gdma_context->cq_table[send_cq->id];
> +	}
> +
>  	ibdev_dbg(&mdev->ib_dev,
>  		  "ret %d qp->tx_object 0x%llx sq id %llu cq id %llu\n", err,

^ permalink raw reply

* Re: [PATCH 5/9] x86/hyperv: Use vmmcall to implement Hyper-V hypercall in sev-snp enlightened guest
From: Vitaly Kuznetsov @ 2023-06-05 13:00 UTC (permalink / raw)
  To: Tianyu Lan, kys, haiyangz, wei.liu, decui, tglx, mingo, bp,
	dave.hansen, x86, hpa, daniel.lezcano, arnd, michael.h.kelley
  Cc: Tianyu Lan, linux-arch, linux-hyperv, linux-kernel
In-Reply-To: <20230601151624.1757616-6-ltykernel@gmail.com>

Tianyu Lan <ltykernel@gmail.com> writes:

> From: Tianyu Lan <tiala@microsoft.com>
>
> In sev-snp enlightened guest, Hyper-V hypercall needs
> to use vmmcall to trigger vmexit and notify hypervisor
> to handle hypercall request.
>
> There is no x86 SEV SNP feature flag support so far and
> hardware provides MSR_AMD64_SEV register to check SEV-SNP
> capability with MSR_AMD64_SEV_ENABLED bit. ALTERNATIVE can't
> work without SEV-SNP x86 feature flag. May add later when
> the associated flag is introduced. 
>
> Signed-off-by: Tianyu Lan <tiala@microsoft.com>
> ---
>  arch/x86/include/asm/mshyperv.h | 44 ++++++++++++++++++++++++---------
>  1 file changed, 33 insertions(+), 11 deletions(-)
>
> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
> index 31c476f4e656..d859d7c5f5e8 100644
> --- a/arch/x86/include/asm/mshyperv.h
> +++ b/arch/x86/include/asm/mshyperv.h
> @@ -61,16 +61,25 @@ static inline u64 hv_do_hypercall(u64 control, void *input, void *output)
>  	u64 hv_status;
>  
>  #ifdef CONFIG_X86_64
> -	if (!hv_hypercall_pg)
> -		return U64_MAX;
> +	if (hv_isolation_type_en_snp()) {

Would it be possible to redo 'hv_isolation_type_en_snp()' into a static
inline doing static_branch_unlikely() so we avoid function call penalty
here?

> +		__asm__ __volatile__("mov %4, %%r8\n"
> +				     "vmmcall"
> +				     : "=a" (hv_status), ASM_CALL_CONSTRAINT,
> +				       "+c" (control), "+d" (input_address)
> +				     :  "r" (output_address)
> +				     : "cc", "memory", "r8", "r9", "r10", "r11");
> +	} else {
> +		if (!hv_hypercall_pg)
> +			return U64_MAX;
>  
> -	__asm__ __volatile__("mov %4, %%r8\n"
> -			     CALL_NOSPEC
> -			     : "=a" (hv_status), ASM_CALL_CONSTRAINT,
> -			       "+c" (control), "+d" (input_address)
> -			     :  "r" (output_address),
> -				THUNK_TARGET(hv_hypercall_pg)
> -			     : "cc", "memory", "r8", "r9", "r10", "r11");
> +		__asm__ __volatile__("mov %4, %%r8\n"
> +				     CALL_NOSPEC
> +				     : "=a" (hv_status), ASM_CALL_CONSTRAINT,
> +				       "+c" (control), "+d" (input_address)
> +				     :  "r" (output_address),
> +					THUNK_TARGET(hv_hypercall_pg)
> +				     : "cc", "memory", "r8", "r9", "r10", "r11");
> +	}
>  #else
>  	u32 input_address_hi = upper_32_bits(input_address);
>  	u32 input_address_lo = lower_32_bits(input_address);
> @@ -104,7 +113,13 @@ static inline u64 _hv_do_fast_hypercall8(u64 control, u64 input1)
>  	u64 hv_status;
>  
>  #ifdef CONFIG_X86_64
> -	{
> +	if (hv_isolation_type_en_snp()) {
> +		__asm__ __volatile__(
> +				"vmmcall"
> +				: "=a" (hv_status), ASM_CALL_CONSTRAINT,
> +				"+c" (control), "+d" (input1)
> +				:: "cc", "r8", "r9", "r10", "r11");
> +	} else {
>  		__asm__ __volatile__(CALL_NOSPEC
>  				     : "=a" (hv_status), ASM_CALL_CONSTRAINT,
>  				       "+c" (control), "+d" (input1)
> @@ -149,7 +164,14 @@ static inline u64 _hv_do_fast_hypercall16(u64 control, u64 input1, u64 input2)
>  	u64 hv_status;
>  
>  #ifdef CONFIG_X86_64
> -	{
> +	if (hv_isolation_type_en_snp()) {
> +		__asm__ __volatile__("mov %4, %%r8\n"
> +				     "vmmcall"
> +				     : "=a" (hv_status), ASM_CALL_CONSTRAINT,
> +				       "+c" (control), "+d" (input1)
> +				     : "r" (input2)
> +				     : "cc", "r8", "r9", "r10", "r11");
> +	} else {
>  		__asm__ __volatile__("mov %4, %%r8\n"
>  				     CALL_NOSPEC
>  				     : "=a" (hv_status), ASM_CALL_CONSTRAINT,

-- 
Vitaly


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox