* [PATCH RESEND v9 36/36] x86/fred: Disable FRED by default in its early stage
From: Xin Li @ 2023-08-01 8:35 UTC (permalink / raw)
To: linux-doc, linux-kernel, linux-edac, linux-hyperv, kvm, xen-devel
Cc: Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, H . Peter Anvin, Andy Lutomirski, Oleg Nesterov,
Tony Luck, K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov, Sean Christopherson,
Peter Zijlstra, Juergen Gross, Stefano Stabellini,
Oleksandr Tyshchenko, Josh Poimboeuf, Paul E . McKenney,
Catalin Marinas, Randy Dunlap, Steven Rostedt, Kim Phillips,
Xin Li, Hyeonggon Yoo, Liam R . Howlett, Sebastian Reichel,
Kirill A . Shutemov, Suren Baghdasaryan, Pawan Gupta, Babu Moger,
Jim Mattson, Sandipan Das, Lai Jiangshan, Hans de Goede,
Reinette Chatre, Daniel Sneddon, Breno Leitao, Nikunj A Dadhania,
Brian Gerst, Sami Tolvanen, Alexander Potapenko, Andrew Morton,
Arnd Bergmann, Eric W . Biederman, Kees Cook, Masami Hiramatsu,
Masahiro Yamada, Ze Gao, Fei Li, Conghui, Ashok Raj,
Jason A . Donenfeld, Mark Rutland, Jacob Pan, Jiapeng Chong,
Jane Malalane, David Woodhouse, Boris Ostrovsky,
Arnaldo Carvalho de Melo, Yantengsi, Christophe Leroy,
Sathvika Vasireddy
In-Reply-To: <20230801083553.8468-1-xin3.li@intel.com>
Disable FRED by default in its early stage.
To enable FRED, a new kernel command line option "fred" needs to be added.
Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
Changes since v7:
* Add a log message when FRED is enabled.
---
Documentation/admin-guide/kernel-parameters.txt | 4 ++++
arch/x86/kernel/cpu/common.c | 3 +++
arch/x86/kernel/fred.c | 3 +++
3 files changed, 10 insertions(+)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index a1457995fd41..cb12decfcdc0 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1513,6 +1513,10 @@
Warning: use of this parameter will taint the kernel
and may cause unknown problems.
+ fred
+ Forcefully enable flexible return and event delivery,
+ which is otherwise disabled by default.
+
ftrace=[tracer]
[FTRACE] will set and start the specified tracer
as early as possible in order to facilitate early
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index b34a8a138755..38cf4f64a56e 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1451,6 +1451,9 @@ static void __init cpu_parse_early_param(void)
char *argptr = arg, *opt;
int arglen, taint = 0;
+ if (!cmdline_find_option_bool(boot_command_line, "fred"))
+ setup_clear_cpu_cap(X86_FEATURE_FRED);
+
#ifdef CONFIG_X86_32
if (cmdline_find_option_bool(boot_command_line, "no387"))
#ifdef CONFIG_MATH_EMULATION
diff --git a/arch/x86/kernel/fred.c b/arch/x86/kernel/fred.c
index 7fdf79c964a8..a4a726ea9fc2 100644
--- a/arch/x86/kernel/fred.c
+++ b/arch/x86/kernel/fred.c
@@ -8,6 +8,9 @@
void cpu_init_fred_exceptions(void)
{
+ /* When FRED is enabled by default, this log message may not needed */
+ pr_info("Initialize FRED on CPU%d\n", smp_processor_id());
+
wrmsrl(MSR_IA32_FRED_CONFIG,
/* Reserve for CALL emulation */
FRED_CONFIG_REDZONE |
--
2.34.1
^ permalink raw reply related
* Re: [PATCH RESEND v9 00/36] x86: enable FRED for x86-64
From: Peter Zijlstra @ 2023-08-01 10:52 UTC (permalink / raw)
To: Xin Li
Cc: linux-doc, linux-kernel, linux-edac, linux-hyperv, kvm, xen-devel,
Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, H . Peter Anvin, Andy Lutomirski, Oleg Nesterov,
Tony Luck, K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov, Sean Christopherson,
Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
Josh Poimboeuf, Paul E . McKenney, Catalin Marinas, Randy Dunlap,
Steven Rostedt, Kim Phillips, Hyeonggon Yoo, Liam R . Howlett,
Sebastian Reichel, Kirill A . Shutemov, Suren Baghdasaryan,
Pawan Gupta, Babu Moger, Jim Mattson, Sandipan Das, Lai Jiangshan,
Hans de Goede, Reinette Chatre, Daniel Sneddon, Breno Leitao,
Nikunj A Dadhania, Brian Gerst, Sami Tolvanen,
Alexander Potapenko, Andrew Morton, Arnd Bergmann,
Eric W . Biederman, Kees Cook, Masami Hiramatsu, Masahiro Yamada,
Ze Gao, Fei Li, Conghui, Ashok Raj, Jason A . Donenfeld,
Mark Rutland, Jacob Pan, Jiapeng Chong, Jane Malalane,
David Woodhouse, Boris Ostrovsky, Arnaldo Carvalho de Melo,
Yantengsi, Christophe Leroy, Sathvika Vasireddy
In-Reply-To: <20230801083318.8363-1-xin3.li@intel.com>
On Tue, Aug 01, 2023 at 01:32:42AM -0700, Xin Li wrote:
> Resend because the mail system failed to deliver some messages yesterday.
Well, you need to figure out how to send patches, because both yesterday
and today are screwy.
The one from yesterday came in 6 thread groups: 0-25, 26, 27, 28, 29, 30-36,
while the one from today comes in 2 thread groups: 0-26, 27-36. Which I
suppose one can count as an improvement :/
Seriously, it should not be hard to send 36 patches in a single thread.
I see you're trying to send through the regular corporate email
trainwreck; do you have a linux.intel.com account? Or really anything
else besides intel.com? You can try sending the series to yourself to
see if it arrives correctly as a whole before sending it out to the list
again.
I also believe there is a kernel.org service for sending patch series,
but i'm not sure I remember the details.
^ permalink raw reply
* [PATCH V7 net] net: mana: Fix MANA VF unload when hardware is
From: Souradeep Chakrabarti @ 2023-08-01 12:29 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, davem, edumazet, kuba, pabeni,
longli, sharmaajay, leon, cai.huoqing, ssengar, vkuznets, tglx,
linux-hyperv, netdev, linux-kernel, linux-rdma
Cc: schakrabarti, Souradeep Chakrabarti, stable
When unloading the MANA driver, mana_dealloc_queues() waits for the MANA
hardware to complete any inflight packets and set the pending send count
to zero. But if the hardware has failed, mana_dealloc_queues()
could wait forever.
Fix this by adding a timeout to the wait. Set the timeout to 120 seconds,
which is a somewhat arbitrary value that is more than long enough for
functional hardware to complete any sends.
Cc: stable@vger.kernel.org
Fixes: ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)")
Signed-off-by: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>
---
V6 -> V7:
* Optimized the while loop for freeing skb.
V5 -> V6:
* Added pcie_flr to reset the pci after timeout.
* Fixed the position of changelog.
* Removed unused variable like cq.
V4 -> V5:
* Added fixes tag
* Changed the usleep_range from static to incremental value.
* Initialized timeout in the begining.
V3 -> V4:
* Removed the unnecessary braces from mana_dealloc_queues().
V2 -> V3:
* Removed the unnecessary braces from mana_dealloc_queues().
V1 -> V2:
* Added net branch
* Removed the typecasting to (struct mana_context*) of void pointer
* Repositioned timeout variable in mana_dealloc_queues()
* Repositioned vf_unload_timeout in mana_context struct, to utilise the
6 bytes hole
---
drivers/net/ethernet/microsoft/mana/mana_en.c | 37 +++++++++++++++++--
1 file changed, 33 insertions(+), 4 deletions(-)
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index a499e460594b..3c5552a176d0 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -8,6 +8,7 @@
#include <linux/ethtool.h>
#include <linux/filter.h>
#include <linux/mm.h>
+#include <linux/pci.h>
#include <net/checksum.h>
#include <net/ip6_checksum.h>
@@ -2345,9 +2346,12 @@ int mana_attach(struct net_device *ndev)
static int mana_dealloc_queues(struct net_device *ndev)
{
struct mana_port_context *apc = netdev_priv(ndev);
+ unsigned long timeout = jiffies + 120 * HZ;
struct gdma_dev *gd = apc->ac->gdma_dev;
struct mana_txq *txq;
+ struct sk_buff *skb;
int i, err;
+ u32 tsleep;
if (apc->port_is_up)
return -EINVAL;
@@ -2363,15 +2367,40 @@ static int mana_dealloc_queues(struct net_device *ndev)
* to false, but it doesn't matter since mana_start_xmit() drops any
* new packets due to apc->port_is_up being false.
*
- * Drain all the in-flight TX packets
+ * Drain all the in-flight TX packets.
+ * A timeout of 120 seconds for all the queues is used.
+ * This will break the while loop when h/w is not responding.
+ * This value of 120 has been decided here considering max
+ * number of queues.
*/
+
for (i = 0; i < apc->num_queues; i++) {
txq = &apc->tx_qp[i].txq;
-
- while (atomic_read(&txq->pending_sends) > 0)
- usleep_range(1000, 2000);
+ tsleep = 1000;
+ while (atomic_read(&txq->pending_sends) > 0 &&
+ time_before(jiffies, timeout)) {
+ usleep_range(tsleep, tsleep + 1000);
+ tsleep <<= 1;
+ }
+ if (atomic_read(&txq->pending_sends)) {
+ err = pcie_flr(to_pci_dev(gd->gdma_context->dev));
+ if (err) {
+ netdev_err(ndev, "flr failed %d with %d pkts pending in txq %u\n",
+ err, atomic_read(&txq->pending_sends),
+ txq->gdma_txq_id);
+ }
+ break;
+ }
}
+ for (i = 0; i < apc->num_queues; i++) {
+ txq = &apc->tx_qp[i].txq;
+ while (skb = skb_dequeue(&txq->pending_skbs)) {
+ mana_unmap_skb(skb, apc);
+ dev_consume_skb_any(skb);
+ }
+ atomic_set(&txq->pending_sends, 0);
+ }
/* We're 100% sure the queues can no longer be woken up, because
* we're sure now mana_poll_tx_cq() can't be running.
*/
--
2.34.1
^ permalink raw reply related
* Re: [PATCH RESEND v9 00/36] x86: enable FRED for x86-64
From: Peter Zijlstra @ 2023-08-01 13:02 UTC (permalink / raw)
To: Xin Li
Cc: linux-doc, linux-kernel, linux-edac, linux-hyperv, kvm, xen-devel,
Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, H . Peter Anvin, Andy Lutomirski, Oleg Nesterov,
Tony Luck, K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov, Sean Christopherson,
Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
Josh Poimboeuf, Paul E . McKenney, Catalin Marinas, Randy Dunlap,
Steven Rostedt, Kim Phillips, Hyeonggon Yoo, Liam R . Howlett,
Sebastian Reichel, Kirill A . Shutemov, Suren Baghdasaryan,
Pawan Gupta, Babu Moger, Jim Mattson, Sandipan Das, Lai Jiangshan,
Hans de Goede, Reinette Chatre, Daniel Sneddon, Breno Leitao,
Nikunj A Dadhania, Brian Gerst, Sami Tolvanen,
Alexander Potapenko, Andrew Morton, Arnd Bergmann,
Eric W . Biederman, Kees Cook, Masami Hiramatsu, Masahiro Yamada,
Ze Gao, Fei Li, Conghui, Ashok Raj, Jason A . Donenfeld,
Mark Rutland, Jacob Pan, Jiapeng Chong, Jane Malalane,
David Woodhouse, Boris Ostrovsky, Arnaldo Carvalho de Melo,
Yantengsi, Christophe Leroy, Sathvika Vasireddy
In-Reply-To: <20230801105236.GB79828@hirez.programming.kicks-ass.net>
On Tue, Aug 01, 2023 at 12:52:36PM +0200, Peter Zijlstra wrote:
> I also believe there is a kernel.org service for sending patch series,
> but i'm not sure I remember the details.
https://b4.docs.kernel.org/en/latest/contributor/send.html
^ permalink raw reply
* RE: [PATCH V4,net-next] net: mana: Add page pool for RX buffers
From: Haiyang Zhang @ 2023-08-01 14:19 UTC (permalink / raw)
To: Jakub Kicinski
Cc: linux-hyperv@vger.kernel.org, netdev@vger.kernel.org, Dexuan Cui,
KY Srinivasan, Paul Rosswurm, olaf@aepfle.de, vkuznets,
davem@davemloft.net, wei.liu@kernel.org, edumazet@google.com,
pabeni@redhat.com, leon@kernel.org, Long Li,
ssengar@linux.microsoft.com, linux-rdma@vger.kernel.org,
daniel@iogearbox.net, john.fastabend@gmail.com,
bpf@vger.kernel.org, ast@kernel.org, Ajay Sharma, hawk@kernel.org,
tglx@linutronix.de, shradhagupta@linux.microsoft.com,
linux-kernel@vger.kernel.org
In-Reply-To: <20230731173119.3ca14894@kernel.org>
> -----Original Message-----
> From: Jakub Kicinski <kuba@kernel.org>
> Sent: Monday, July 31, 2023 8:31 PM
> To: Haiyang Zhang <haiyangz@microsoft.com>
> Cc: linux-hyperv@vger.kernel.org; netdev@vger.kernel.org; Dexuan Cui
> <decui@microsoft.com>; KY Srinivasan <kys@microsoft.com>; Paul Rosswurm
> <paulros@microsoft.com>; olaf@aepfle.de; vkuznets@redhat.com;
> davem@davemloft.net; wei.liu@kernel.org; edumazet@google.com;
> pabeni@redhat.com; leon@kernel.org; Long Li <longli@microsoft.com>;
> ssengar@linux.microsoft.com; linux-rdma@vger.kernel.org;
> daniel@iogearbox.net; john.fastabend@gmail.com; bpf@vger.kernel.org;
> ast@kernel.org; Ajay Sharma <sharmaajay@microsoft.com>; hawk@kernel.org;
> tglx@linutronix.de; shradhagupta@linux.microsoft.com; linux-
> kernel@vger.kernel.org
> Subject: Re: [PATCH V4,net-next] net: mana: Add page pool for RX buffers
>
> On Fri, 28 Jul 2023 14:46:07 -0700 Haiyang Zhang wrote:
> > static void *mana_get_rxfrag(struct mana_rxq *rxq, struct device *dev,
> > - dma_addr_t *da, bool is_napi)
> > + dma_addr_t *da, bool *from_pool, bool is_napi)
> > {
> > struct page *page;
> > void *va;
> >
> > + *from_pool = false;
> > +
> > /* Reuse XDP dropped page if available */
> > if (rxq->xdp_save_va) {
> > va = rxq->xdp_save_va;
> > @@ -1533,17 +1543,22 @@ static void *mana_get_rxfrag(struct mana_rxq
> *rxq, struct device *dev,
> > return NULL;
> > }
> > } else {
> > - page = dev_alloc_page();
> > + page = page_pool_dev_alloc_pages(rxq->page_pool);
> > if (!page)
> > return NULL;
> >
> > + *from_pool = true;
> > va = page_to_virt(page);
> > }
> >
> > *da = dma_map_single(dev, va + rxq->headroom, rxq->datasize,
> > DMA_FROM_DEVICE);
> > if (dma_mapping_error(dev, *da)) {
> > - put_page(virt_to_head_page(va));
> > + if (*from_pool)
> > + page_pool_put_full_page(rxq->page_pool, page,
> is_napi);
>
> AFAICT you only pass the is_napi to recycle in case of error?
> It's fine to always pass in false, passing true enables some
> optimizations but it's not worth trying to optimize error paths.
Yes, this place is only for the error path. I will pass in false.
>
> Otherwise you may be passing in true, even tho budget was 0,
> see the recently added warnings in this doc:
>
> https://www.ker/
> nel.org%2Fdoc%2Fhtml%2Fnext%2Fnetworking%2Fnapi.html&data=05%7C01%7
> Chaiyangz%40microsoft.com%7C82fcd2d20fe54dd8cd4808db9226a2c7%7C72f9
> 88bf86f141af91ab2d7cd011db47%7C1%7C0%7C638264466881746523%7CUnkn
> own%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1ha
> WwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=D9ac4TnYOSPwGmk%2FKX
> G4buu%2FKT7J%2BuH8lAJNPl%2FRWy4%3D&reserved=0
>
> In general the driver seems to be processing Rx regardless
> of budget? This looks like a bug which should be fixed with
> a separate patch for the net tree..
Thanks, I will look into and fix this in a separate patch.
- Haiyang
^ permalink raw reply
* Re: [PATCH V7 net] net: mana: Fix MANA VF unload when hardware is
From: Simon Horman @ 2023-08-01 15:31 UTC (permalink / raw)
To: Souradeep Chakrabarti
Cc: kys, haiyangz, wei.liu, decui, davem, edumazet, kuba, pabeni,
longli, sharmaajay, leon, cai.huoqing, ssengar, vkuznets, tglx,
linux-hyperv, netdev, linux-kernel, linux-rdma, schakrabarti,
stable
In-Reply-To: <1690892953-25201-1-git-send-email-schakrabarti@linux.microsoft.com>
On Tue, Aug 01, 2023 at 05:29:13AM -0700, Souradeep Chakrabarti wrote:
...
Hi Souradeep,
> + for (i = 0; i < apc->num_queues; i++) {
> + txq = &apc->tx_qp[i].txq;
> + while (skb = skb_dequeue(&txq->pending_skbs)) {
W=1 builds with both clang-16 and gcc-12 complain that
they would like an extra set of parentheses around
an assignment used as a truth value.
> + mana_unmap_skb(skb, apc);
> + dev_consume_skb_any(skb);
> + }
> + atomic_set(&txq->pending_sends, 0);
> + }
> /* We're 100% sure the queues can no longer be woken up, because
> * we're sure now mana_poll_tx_cq() can't be running.
> */
> --
> 2.34.1
>
>
^ permalink raw reply
* Re: [PATCH RFC net-next v5 10/14] virtio/vsock: add VIRTIO_VSOCK_F_DGRAM feature bit
From: Stefano Garzarella @ 2023-08-01 16:04 UTC (permalink / raw)
To: Bobby Eshleman
Cc: Michael S. Tsirkin, Bobby Eshleman, linux-hyperv, Stefan Hajnoczi,
kvm, VMware PV-Drivers Reviewers, Simon Horman, virtualization,
Eric Dumazet, Dan Carpenter, Xuan Zhuo, Wei Liu, Dexuan Cui,
Bryan Tan, Jakub Kicinski, Paolo Abeni, Haiyang Zhang,
Krasnov Arseniy, Vishnu Dasa, Jiang Wang, netdev, linux-kernel,
bpf, David S. Miller
In-Reply-To: <ZMiKXh173b/3Pj1L@bullseye>
On Tue, Aug 01, 2023 at 04:30:22AM +0000, Bobby Eshleman wrote:
>On Thu, Jul 27, 2023 at 09:48:21AM +0200, Stefano Garzarella wrote:
>> On Wed, Jul 26, 2023 at 02:38:08PM -0400, Michael S. Tsirkin wrote:
>> > On Wed, Jul 19, 2023 at 12:50:14AM +0000, Bobby Eshleman wrote:
>> > > This commit adds a feature bit for virtio vsock to support datagrams.
>> > >
>> > > Signed-off-by: Jiang Wang <jiang.wang@bytedance.com>
>> > > Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
>> > > ---
>> > > include/uapi/linux/virtio_vsock.h | 1 +
>> > > 1 file changed, 1 insertion(+)
>> > >
>> > > diff --git a/include/uapi/linux/virtio_vsock.h b/include/uapi/linux/virtio_vsock.h
>> > > index 331be28b1d30..27b4b2b8bf13 100644
>> > > --- a/include/uapi/linux/virtio_vsock.h
>> > > +++ b/include/uapi/linux/virtio_vsock.h
>> > > @@ -40,6 +40,7 @@
>> > >
>> > > /* The feature bitmap for virtio vsock */
>> > > #define VIRTIO_VSOCK_F_SEQPACKET 1 /* SOCK_SEQPACKET supported */
>> > > +#define VIRTIO_VSOCK_F_DGRAM 3 /* SOCK_DGRAM supported */
>> > >
>> > > struct virtio_vsock_config {
>> > > __le64 guest_cid;
>> >
>> > pls do not add interface without first getting it accepted in the
>> > virtio spec.
>>
>> Yep, fortunatelly this series is still RFC.
>> I think by now we've seen that the implementation is doable, so we
>> should discuss the changes to the specification ASAP. Then we can
>> merge the series.
>>
>> @Bobby can you start the discussion about spec changes?
>>
>
>No problem at all. Am I right to assume that a new patch to the spec is
>the standard starting point for discussion?
Yep, I think so!
Thanks,
Stefano
^ permalink raw reply
* RE: [PATCH RESEND v9 00/36] x86: enable FRED for x86-64
From: Li, Xin3 @ 2023-08-01 17:09 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-edac@vger.kernel.org, linux-hyperv@vger.kernel.org,
kvm@vger.kernel.org, xen-devel@lists.xenproject.org,
Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86@kernel.org, H . Peter Anvin, Lutomirski, Andy,
Oleg Nesterov, Luck, Tony, K . Y . Srinivasan, Haiyang Zhang,
Wei Liu, Cui, Dexuan, Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov,
Christopherson,, Sean, Gross, Jurgen, Stefano Stabellini,
Oleksandr Tyshchenko, Josh Poimboeuf, Paul E . McKenney,
Catalin Marinas, Randy Dunlap, Steven Rostedt, Kim Phillips,
Hyeonggon Yoo, Liam R . Howlett, Sebastian Reichel,
Kirill A . Shutemov, Suren Baghdasaryan, Pawan Gupta, Babu Moger,
Jim Mattson, Sandipan Das, Lai Jiangshan, Hans de Goede,
Chatre, Reinette, Daniel Sneddon, Breno Leitao, Nikunj A Dadhania,
Brian Gerst, Sami Tolvanen, Alexander Potapenko, Andrew Morton,
Arnd Bergmann, Eric W . Biederman, Kees Cook, Masami Hiramatsu,
Masahiro Yamada, Ze Gao, Li, Fei1, Conghui, Raj, Ashok,
Jason A . Donenfeld, Mark Rutland, Jacob Pan, Jiapeng Chong,
Jane Malalane, Woodhouse, David, Ostrovsky, Boris,
Arnaldo Carvalho de Melo, Yantengsi, Christophe Leroy,
Sathvika Vasireddy
In-Reply-To: <20230801105236.GB79828@hirez.programming.kicks-ass.net>
> > Resend because the mail system failed to deliver some messages yesterday.
>
> The one from yesterday came in 6 thread groups: 0-25, 26, 27, 28, 29, 30-36,
> while the one from today comes in 2 thread groups: 0-26, 27-36. Which I
> suppose one can count as an improvement :/
Sigh, sorry for the chaos.
>
> Seriously, it should not be hard to send 36 patches in a single thread.
No, but it worked fine before thus I didn't realize there an email
service policy which prevents sending email to too many recipients
in a short period (I have a long CC list in this v9 patch set).
> I see you're trying to send through the regular corporate email
> trainwreck; do you have a linux.intel.com account? Or really anything
> else besides intel.com? You can try sending the series to yourself to
> see if it arrives correctly as a whole before sending it out to the list
> again.
I did try sending to myself before to LKML, and it worked fine. But
now you know why it happened.
As mentioned, I should avoid "the regular corporate email trainwreck"
before doing it again. Working on it...
^ permalink raw reply
* RE: [PATCH RESEND v9 00/36] x86: enable FRED for x86-64
From: Li, Xin3 @ 2023-08-01 17:41 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-edac@vger.kernel.org, linux-hyperv@vger.kernel.org,
kvm@vger.kernel.org, xen-devel@lists.xenproject.org,
Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86@kernel.org, H . Peter Anvin, Lutomirski, Andy,
Oleg Nesterov, Luck, Tony, K . Y . Srinivasan, Haiyang Zhang,
Wei Liu, Cui, Dexuan, Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov,
Christopherson,, Sean, Gross, Jurgen, Stefano Stabellini,
Oleksandr Tyshchenko, Josh Poimboeuf, Paul E . McKenney,
Catalin Marinas, Randy Dunlap, Steven Rostedt, Kim Phillips,
Hyeonggon Yoo, Liam R . Howlett, Sebastian Reichel,
Kirill A . Shutemov, Suren Baghdasaryan, Pawan Gupta, Babu Moger,
Jim Mattson, Sandipan Das, Lai Jiangshan, Hans de Goede,
Chatre, Reinette, Daniel Sneddon, Breno Leitao, Nikunj A Dadhania,
Brian Gerst, Sami Tolvanen, Alexander Potapenko, Andrew Morton,
Arnd Bergmann, Eric W . Biederman, Kees Cook, Masami Hiramatsu,
Masahiro Yamada, Ze Gao, Li, Fei1, Conghui, Raj, Ashok,
Jason A . Donenfeld, Mark Rutland, Jacob Pan, Jiapeng Chong,
Jane Malalane, Woodhouse, David, Ostrovsky, Boris,
Arnaldo Carvalho de Melo, Yantengsi, Christophe Leroy,
Sathvika Vasireddy
In-Reply-To: <20230801130240.GA80967@hirez.programming.kicks-ass.net>
> > I also believe there is a kernel.org service for sending patch series,
> > but i'm not sure I remember the details.
>
> https://b4.docs.kernel.org/en/latest/contributor/send.html
It says:
The kernel.org endpoint can only be used for kernel.org-hosted projects.
If there are no recognized mailing lists in the to/cc headers, then the
submission will be rejected.
If I want to test the email sending service, how could I test it with sending
just to myself? Maybe it allows only sending to the sender.
^ permalink raw reply
* Re: [PATCH V7 net] net: mana: Fix MANA VF unload when hardware is
From: Jesse Brandeburg @ 2023-08-01 18:04 UTC (permalink / raw)
To: Souradeep Chakrabarti, kys, haiyangz, wei.liu, decui, davem,
edumazet, kuba, pabeni, longli, sharmaajay, leon, cai.huoqing,
ssengar, vkuznets, tglx, linux-hyperv, netdev, linux-kernel,
linux-rdma
Cc: schakrabarti, stable
In-Reply-To: <1690892953-25201-1-git-send-email-schakrabarti@linux.microsoft.com>
On 8/1/2023 5:29 AM, Souradeep Chakrabarti wrote:
> When unloading the MANA driver, mana_dealloc_queues() waits for the MANA
> hardware to complete any inflight packets and set the pending send count
> to zero. But if the hardware has failed, mana_dealloc_queues()
> could wait forever.
>
> Fix this by adding a timeout to the wait. Set the timeout to 120 seconds,
tx timeout in stack defaults to 5 seconds, do you not have that on? What
happens when you start getting resets while unloading?
> which is a somewhat arbitrary value that is more than long enough for
> functional hardware to complete any sends.
I'd say 2 or 5 seconds is probably plenty of time to hang up a driver
unload.
>
> Cc: stable@vger.kernel.org
> Fixes: ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)")
>
> Signed-off-by: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>
keep s-o-b and other trailers together please, no spaces, it messes up
git and doesn't conform to kernel standards.
> ---
> V6 -> V7:
> * Optimized the while loop for freeing skb.
>
> V5 -> V6:
> * Added pcie_flr to reset the pci after timeout.
> * Fixed the position of changelog.
> * Removed unused variable like cq.
>
> V4 -> V5:
> * Added fixes tag
> * Changed the usleep_range from static to incremental value.
> * Initialized timeout in the begining.
>
> V3 -> V4:
> * Removed the unnecessary braces from mana_dealloc_queues().
>
> V2 -> V3:
> * Removed the unnecessary braces from mana_dealloc_queues().
>
> V1 -> V2:
> * Added net branch
> * Removed the typecasting to (struct mana_context*) of void pointer
> * Repositioned timeout variable in mana_dealloc_queues()
> * Repositioned vf_unload_timeout in mana_context struct, to utilise the
> 6 bytes hole
> ---
> drivers/net/ethernet/microsoft/mana/mana_en.c | 37 +++++++++++++++++--
> 1 file changed, 33 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> index a499e460594b..3c5552a176d0 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> @@ -8,6 +8,7 @@
> #include <linux/ethtool.h>
> #include <linux/filter.h>
> #include <linux/mm.h>
> +#include <linux/pci.h>
>
> #include <net/checksum.h>
> #include <net/ip6_checksum.h>
> @@ -2345,9 +2346,12 @@ int mana_attach(struct net_device *ndev)
> static int mana_dealloc_queues(struct net_device *ndev)
> {
> struct mana_port_context *apc = netdev_priv(ndev);
> + unsigned long timeout = jiffies + 120 * HZ;
> struct gdma_dev *gd = apc->ac->gdma_dev;
> struct mana_txq *txq;
> + struct sk_buff *skb;
> int i, err;
> + u32 tsleep;
>
> if (apc->port_is_up)
> return -EINVAL;
> @@ -2363,15 +2367,40 @@ static int mana_dealloc_queues(struct net_device *ndev)
> * to false, but it doesn't matter since mana_start_xmit() drops any
> * new packets due to apc->port_is_up being false.
> *
> - * Drain all the in-flight TX packets
> + * Drain all the in-flight TX packets.
> + * A timeout of 120 seconds for all the queues is used.
> + * This will break the while loop when h/w is not responding.
> + * This value of 120 has been decided here considering max
> + * number of queues.
> */
> +
> for (i = 0; i < apc->num_queues; i++) {
> txq = &apc->tx_qp[i].txq;
> -
> - while (atomic_read(&txq->pending_sends) > 0)
> - usleep_range(1000, 2000);
> + tsleep = 1000;
> + while (atomic_read(&txq->pending_sends) > 0 &&
> + time_before(jiffies, timeout)) {
> + usleep_range(tsleep, tsleep + 1000);
> + tsleep <<= 1;
> + }
> + if (atomic_read(&txq->pending_sends)) {
> + err = pcie_flr(to_pci_dev(gd->gdma_context->dev));
> + if (err) {
> + netdev_err(ndev, "flr failed %d with %d pkts pending in txq %u\n",
> + err, atomic_read(&txq->pending_sends),
> + txq->gdma_txq_id);
> + }
> + break;
> + }
> }
>
> + for (i = 0; i < apc->num_queues; i++) {
> + txq = &apc->tx_qp[i].txq;
> + while (skb = skb_dequeue(&txq->pending_skbs)) {
> + mana_unmap_skb(skb, apc);
> + dev_consume_skb_any(skb);
dev_kfree_skb_any() would be more correct here since this is an error
path and the transmit is presumed dropped, isn't it?
> + }
> + atomic_set(&txq->pending_sends, 0);
> + }
> /* We're 100% sure the queues can no longer be woken up, because
> * we're sure now mana_poll_tx_cq() can't be running.
> */
^ permalink raw reply
* RE: [EXTERNAL] Re: [PATCH V7 net] net: mana: Fix MANA VF unload when hardware is
From: Souradeep Chakrabarti @ 2023-08-01 18:57 UTC (permalink / raw)
To: Jesse Brandeburg, Souradeep Chakrabarti, Haiyang Zhang
Cc: KY Srinivasan, wei.liu@kernel.org, Dexuan Cui,
davem@davemloft.net, edumazet@google.com, kuba@kernel.org,
pabeni@redhat.com, Long Li, Ajay Sharma, leon@kernel.org,
cai.huoqing@linux.dev, ssengar@linux.microsoft.com, vkuznets,
tglx@linutronix.de, linux-hyperv@vger.kernel.org,
netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-rdma@vger.kernel.org, stable@vger.kernel.org
In-Reply-To: <8ccbfab0-e24f-b758-cd11-27b6d8ab1d48@intel.com>
>-----Original Message-----
>From: Jesse Brandeburg <jesse.brandeburg@intel.com>
>Sent: Tuesday, August 1, 2023 11:34 PM
>To: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>; KY Srinivasan
><kys@microsoft.com>; Haiyang Zhang <haiyangz@microsoft.com>;
>wei.liu@kernel.org; Dexuan Cui <decui@microsoft.com>;
>davem@davemloft.net; edumazet@google.com; kuba@kernel.org;
>pabeni@redhat.com; Long Li <longli@microsoft.com>; Ajay Sharma
><sharmaajay@microsoft.com>; leon@kernel.org; cai.huoqing@linux.dev;
>ssengar@linux.microsoft.com; vkuznets <vkuznets@redhat.com>;
>tglx@linutronix.de; linux-hyperv@vger.kernel.org; netdev@vger.kernel.org;
>linux-kernel@vger.kernel.org; linux-rdma@vger.kernel.org
>Cc: Souradeep Chakrabarti <schakrabarti@microsoft.com>;
>stable@vger.kernel.org
>Subject: [EXTERNAL] Re: [PATCH V7 net] net: mana: Fix MANA VF unload when
>hardware is
>
>On 8/1/2023 5:29 AM, Souradeep Chakrabarti wrote:
>> When unloading the MANA driver, mana_dealloc_queues() waits for the
>> MANA hardware to complete any inflight packets and set the pending
>> send count to zero. But if the hardware has failed,
>> mana_dealloc_queues() could wait forever.
>>
>> Fix this by adding a timeout to the wait. Set the timeout to 120
>> seconds,
>
>tx timeout in stack defaults to 5 seconds, do you not have that on? What
>happens when you start getting resets while unloading?
>
>> which is a somewhat arbitrary value that is more than long enough for
>> functional hardware to complete any sends.
>
>I'd say 2 or 5 seconds is probably plenty of time to hang up a driver unload.
>
Thank you for the comment. This was already discussed in V4.
I am just sharing the summary here:
This waiting time is usually much shorter than 120 sec.
The long wait only happens in rare and unexpected NIC HW non-responding cases.
At that point, we don't actually care if the pending packets are sent or not.
But if we free the queues too soon, and the HW is slow for unexpected reasons,
a delayed completion notice will DMA into the freed memory and cause corruption.
That's why we have a longer waiting time.
>>
>> Cc: stable@vger.kernel.org
>> Fixes: ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure
>> Network Adapter (MANA)")
>>
>> Signed-off-by: Souradeep Chakrabarti
>> <schakrabarti@linux.microsoft.com>
>
>keep s-o-b and other trailers together please, no spaces, it messes up git and
>doesn't conform to kernel standards.
>
>
>> ---
>> V6 -> V7:
>> * Optimized the while loop for freeing skb.
>>
>> V5 -> V6:
>> * Added pcie_flr to reset the pci after timeout.
>> * Fixed the position of changelog.
>> * Removed unused variable like cq.
>>
>> V4 -> V5:
>> * Added fixes tag
>> * Changed the usleep_range from static to incremental value.
>> * Initialized timeout in the begining.
>>
>> V3 -> V4:
>> * Removed the unnecessary braces from mana_dealloc_queues().
>>
>> V2 -> V3:
>> * Removed the unnecessary braces from mana_dealloc_queues().
>>
>> V1 -> V2:
>> * Added net branch
>> * Removed the typecasting to (struct mana_context*) of void pointer
>> * Repositioned timeout variable in mana_dealloc_queues()
>> * Repositioned vf_unload_timeout in mana_context struct, to utilise
>> the
>> 6 bytes hole
>> ---
>> drivers/net/ethernet/microsoft/mana/mana_en.c | 37
>> +++++++++++++++++--
>> 1 file changed, 33 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c
>> b/drivers/net/ethernet/microsoft/mana/mana_en.c
>> index a499e460594b..3c5552a176d0 100644
>> --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
>> +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
>> @@ -8,6 +8,7 @@
>> #include <linux/ethtool.h>
>> #include <linux/filter.h>
>> #include <linux/mm.h>
>> +#include <linux/pci.h>
>>
>> #include <net/checksum.h>
>> #include <net/ip6_checksum.h>
>> @@ -2345,9 +2346,12 @@ int mana_attach(struct net_device *ndev)
>> static int mana_dealloc_queues(struct net_device *ndev) {
>> struct mana_port_context *apc = netdev_priv(ndev);
>> + unsigned long timeout = jiffies + 120 * HZ;
>> struct gdma_dev *gd = apc->ac->gdma_dev;
>> struct mana_txq *txq;
>> + struct sk_buff *skb;
>> int i, err;
>> + u32 tsleep;
>>
>> if (apc->port_is_up)
>> return -EINVAL;
>> @@ -2363,15 +2367,40 @@ static int mana_dealloc_queues(struct
>net_device *ndev)
>> * to false, but it doesn't matter since mana_start_xmit() drops any
>> * new packets due to apc->port_is_up being false.
>> *
>> - * Drain all the in-flight TX packets
>> + * Drain all the in-flight TX packets.
>> + * A timeout of 120 seconds for all the queues is used.
>> + * This will break the while loop when h/w is not responding.
>> + * This value of 120 has been decided here considering max
>> + * number of queues.
>> */
>> +
>> for (i = 0; i < apc->num_queues; i++) {
>> txq = &apc->tx_qp[i].txq;
>> -
>> - while (atomic_read(&txq->pending_sends) > 0)
>> - usleep_range(1000, 2000);
>> + tsleep = 1000;
>> + while (atomic_read(&txq->pending_sends) > 0 &&
>> + time_before(jiffies, timeout)) {
>> + usleep_range(tsleep, tsleep + 1000);
>> + tsleep <<= 1;
>> + }
>> + if (atomic_read(&txq->pending_sends)) {
>> + err = pcie_flr(to_pci_dev(gd->gdma_context->dev));
>> + if (err) {
>> + netdev_err(ndev, "flr failed %d with %d pkts
>pending in txq %u\n",
>> + err, atomic_read(&txq-
>>pending_sends),
>> + txq->gdma_txq_id);
>> + }
>> + break;
>> + }
>> }
>>
>> + for (i = 0; i < apc->num_queues; i++) {
>> + txq = &apc->tx_qp[i].txq;
>> + while (skb = skb_dequeue(&txq->pending_skbs)) {
>> + mana_unmap_skb(skb, apc);
>> + dev_consume_skb_any(skb);
>
>dev_kfree_skb_any() would be more correct here since this is an error path and
>the transmit is presumed dropped, isn't it?
Yes, dev_kfree_skb_any() will be a better approach in this scenario. Will change it in next version.
>
>
>> + }
>> + atomic_set(&txq->pending_sends, 0);
>> + }
>> /* We're 100% sure the queues can no longer be woken up, because
>> * we're sure now mana_poll_tx_cq() can't be running.
>> */
^ permalink raw reply
* RE: [EXTERNAL] Re: [PATCH V7 net] net: mana: Fix MANA VF unload when hardware is
From: Souradeep Chakrabarti @ 2023-08-01 18:58 UTC (permalink / raw)
To: Simon Horman, Souradeep Chakrabarti
Cc: KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org, Dexuan Cui,
davem@davemloft.net, edumazet@google.com, kuba@kernel.org,
pabeni@redhat.com, Long Li, Ajay Sharma, leon@kernel.org,
cai.huoqing@linux.dev, ssengar@linux.microsoft.com, vkuznets,
tglx@linutronix.de, linux-hyperv@vger.kernel.org,
netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-rdma@vger.kernel.org, stable@vger.kernel.org
In-Reply-To: <ZMklUch+vfZBqfAr@kernel.org>
>-----Original Message-----
>From: Simon Horman <horms@kernel.org>
>Sent: Tuesday, August 1, 2023 9:01 PM
>To: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>
>Cc: KY Srinivasan <kys@microsoft.com>; Haiyang Zhang
><haiyangz@microsoft.com>; wei.liu@kernel.org; Dexuan Cui
><decui@microsoft.com>; davem@davemloft.net; edumazet@google.com;
>kuba@kernel.org; pabeni@redhat.com; Long Li <longli@microsoft.com>; Ajay
>Sharma <sharmaajay@microsoft.com>; leon@kernel.org;
>cai.huoqing@linux.dev; ssengar@linux.microsoft.com; vkuznets
><vkuznets@redhat.com>; tglx@linutronix.de; linux-hyperv@vger.kernel.org;
>netdev@vger.kernel.org; linux-kernel@vger.kernel.org; linux-
>rdma@vger.kernel.org; Souradeep Chakrabarti
><schakrabarti@microsoft.com>; stable@vger.kernel.org
>Subject: [EXTERNAL] Re: [PATCH V7 net] net: mana: Fix MANA VF unload when
>hardware is
>
>On Tue, Aug 01, 2023 at 05:29:13AM -0700, Souradeep Chakrabarti wrote:
>
>...
>
>Hi Souradeep,
>
>
>> + for (i = 0; i < apc->num_queues; i++) {
>> + txq = &apc->tx_qp[i].txq;
>> + while (skb = skb_dequeue(&txq->pending_skbs)) {
>
>W=1 builds with both clang-16 and gcc-12 complain that they would like an
>extra set of parentheses around an assignment used as a truth value.
Thanks for letting me know. I will fix it in next version.
>
>> + mana_unmap_skb(skb, apc);
>> + dev_consume_skb_any(skb);
>> + }
>> + atomic_set(&txq->pending_sends, 0);
>> + }
>> /* We're 100% sure the queues can no longer be woken up, because
>> * we're sure now mana_poll_tx_cq() can't be running.
>> */
>> --
>> 2.34.1
>>
>>
^ permalink raw reply
* Re: [PATCH RESEND v9 33/36] KVM: VMX: Add VMX_DO_FRED_EVENT_IRQOFF for IRQ/NMI handling
From: Sean Christopherson @ 2023-08-01 19:01 UTC (permalink / raw)
To: Xin Li
Cc: linux-doc, linux-kernel, linux-edac, linux-hyperv, kvm, xen-devel,
Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, H . Peter Anvin, Andy Lutomirski, Oleg Nesterov,
Tony Luck, K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov, Peter Zijlstra,
Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
Josh Poimboeuf, Paul E . McKenney, Catalin Marinas, Randy Dunlap,
Steven Rostedt, Kim Phillips, Hyeonggon Yoo, Liam R . Howlett,
Sebastian Reichel, Kirill A . Shutemov, Suren Baghdasaryan,
Pawan Gupta, Babu Moger, Jim Mattson, Sandipan Das, Lai Jiangshan,
Hans de Goede, Reinette Chatre, Daniel Sneddon, Breno Leitao,
Nikunj A Dadhania, Brian Gerst, Sami Tolvanen,
Alexander Potapenko, Andrew Morton, Arnd Bergmann,
Eric W . Biederman, Kees Cook, Masami Hiramatsu, Masahiro Yamada,
Ze Gao, Fei Li, Conghui, Ashok Raj, Jason A . Donenfeld,
Mark Rutland, Jacob Pan, Jiapeng Chong, Jane Malalane,
David Woodhouse, Boris Ostrovsky, Arnaldo Carvalho de Melo,
Yantengsi, Christophe Leroy, Sathvika Vasireddy
In-Reply-To: <20230801083553.8468-7-xin3.li@intel.com>
On Tue, Aug 01, 2023, Xin Li wrote:
>
> diff --git a/arch/x86/kvm/vmx/vmenter.S b/arch/x86/kvm/vmx/vmenter.S
> index 07e927d4d099..5ee6a57b59a5 100644
> --- a/arch/x86/kvm/vmx/vmenter.S
> +++ b/arch/x86/kvm/vmx/vmenter.S
> @@ -2,12 +2,14 @@
> #include <linux/linkage.h>
> #include <asm/asm.h>
> #include <asm/bitsperlong.h>
> +#include <asm/fred.h>
> #include <asm/kvm_vcpu_regs.h>
> #include <asm/nospec-branch.h>
> #include <asm/percpu.h>
> #include <asm/segment.h>
> #include "kvm-asm-offsets.h"
> #include "run_flags.h"
> +#include "../../entry/calling.h"
Rather than do the low level PUSH_REGS and POP_REGS, I vote to have core code
expose a FRED-specific wrapper for invoking external_interrupt(). More below.
>
> #define WORD_SIZE (BITS_PER_LONG / 8)
>
> @@ -31,6 +33,80 @@
> #define VCPU_R15 __VCPU_REGS_R15 * WORD_SIZE
> #endif
>
> +#ifdef CONFIG_X86_FRED
> +.macro VMX_DO_FRED_EVENT_IRQOFF branch_insn branch_target nmi=0
objtool isn't happy.
arch/x86/kvm/vmx/vmenter.o: warning: objtool: vmx_do_fred_interrupt_irqoff+0x6c: return with modified stack frame
arch/x86/kvm/vmx/vmenter.o: warning: objtool: vmx_do_fred_nmi_irqoff+0x37: sibling call from callable instruction with modified stack frame
The "return with modified stack frame" goes away with my suggested changes, but
the sibling call remains for the NMI case due to the JMP instead of a call.
> + /*
> + * Unconditionally create a stack frame, getting the correct RSP on the
> + * stack (for x86-64) would take two instructions anyways, and RBP can
> + * be used to restore RSP to make objtool happy (see below).
> + */
> + push %_ASM_BP
> + mov %_ASM_SP, %_ASM_BP
The frame stuff is worth throwing in a macro, if only to avoid a copy+pasted
comment, which by the by, is wrong. (a) it's ERETS, not IRET. (b) the IRQ does
a vanilla RET, not ERETS. E.g. like so:
.macro VMX_DO_EVENT_FRAME_BEGIN
/*
* Unconditionally create a stack frame, getting the correct RSP on the
* stack (for x86-64) would take two instructions anyways, and RBP can
* be used to restore RSP to make objtool happy (see below).
*/
push %_ASM_BP
mov %_ASM_SP, %_ASM_BP
.endm
.macro VMX_DO_EVENT_FRAME_END
/*
* "Restore" RSP from RBP, even though {E,I}RET has already unwound RSP
* to the correct value *in most cases*. KVM's IRQ handling with FRED
* doesn't do ERETS, and objtool doesn't know the callee will IRET/ERET
* and, without the explicit restore, thinks the stack is getting walloped.
* Using an unwind hint is problematic due to x86-64's dynamic alignment.
*/
mov %_ASM_BP, %_ASM_SP
pop %_ASM_BP
.endm
> +
> + /*
> + * Don't check the FRED stack level, the call stack leading to this
> + * helper is effectively constant and shallow (relatively speaking).
> + *
> + * Emulate the FRED-defined redzone and stack alignment.
> + */
> + sub $(FRED_CONFIG_REDZONE_AMOUNT << 6), %rsp
> + and $FRED_STACK_FRAME_RSP_MASK, %rsp
> +
> + /*
> + * A FRED stack frame has extra 16 bytes of information pushed at the
> + * regular stack top compared to an IDT stack frame.
There is pretty much no chance that anyone remembers the layout of an IDT stack
frame off the top of their head. I.e. saying "FRED has 16 bytes more" isn't all
that useful. It also fails to capture the fact that FRED stuff a hell of a lot
more information in those "common" 48 bytes.
It'll be hard/impossible to capture all of the overload info in a comment, but
showing the actual layout of the frame would be super helpful, e.g. something like
this
/*
* FRED stack frames are always 64 bytes:
*
* ------------------------------
* | Bytes | Usage |
* -----------------------------|
* | 63:56 | Reserved |
* | 55:48 | Event Data |
* | 47:40 | SS + Event Info |
* | 39:32 | RSP |
* | 31:24 | RFLAGS |
* | 23:16 | CS + Aux Info |
* | 15:8 | RIP |
* | 7:0 | Error Code |
* ------------------------------
*/
> + */
> + push $0 /* Reserved by FRED, must be 0 */
> + push $0 /* FRED event data, 0 for NMI and external interrupts */
> +
> + shl $32, %rdi /* FRED event type and vector */
> + .if \nmi
> + bts $FRED_SSX_NMI_BIT, %rdi /* Set the NMI bit */
The spec I have from May 2022 says the NMI bit colocated with CS, not SS. And
the cover letter's suggestion to use a search engine to find the spec ain't
exactly helpful, that just gives me the same "May 2022 Revision 3.0" spec. So
either y'all have a spec that I can't find, or this is wrong.
> + .endif
> + bts $FRED_SSX_64_BIT_MODE_BIT, %rdi /* Set the 64-bit mode */
> + or $__KERNEL_DS, %rdi
> + push %rdi
> + push %rbp
> + pushf
> + mov $__KERNEL_CS, %rax
> + push %rax
> +
> + /*
> + * Unlike the IDT event delivery, FRED _always_ pushes an error code
> + * after pushing the return RIP, thus the CALL instruction CANNOT be
> + * used here to push the return RIP, otherwise there is no chance to
> + * push an error code before invoking the IRQ/NMI handler.
> + *
> + * Use LEA to get the return RIP and push it, then push an error code.
> + */
> + lea 1f(%rip), %rax
This is quite misleading for IRQs. It took me a while to figure out that the only
reason it's functionally ok is that external_interrupt() will do RET, not ERETS,
i.e. the RIP that's pushed here isn't used for IRQs! Expanding the above comment
would be quite helpful, e.g.
*
* Use LEA to get the return RIP and push it, then push an error code.
* Note, only NMI handling does an ERETS to the target! IRQ handling
* doesn't need to unmask NMIs and so simply uses CALL+RET, i.e. the
* RIP pushed here is only truly consumed for NMIs!
> + push %rax
> + push $0 /* FRED error code, 0 for NMI and external interrupts */
> +
> + .if \nmi == 0
> + PUSH_REGS
> + mov %rsp, %rdi
Nit, *if* this stays in KVM, please use %_ASM_ARG1 instead of %rdi. I normally
dislike unnecessary abstraction, but in this case using _ASM_ARG1 makes it clear
(without a comment) that this code is loading a param for a funciton call, *not*
for some FRED magic.
> + .endif
Jumping way back to providing a wrapper for FRED, if we do that, then there's no
need to include calling.h, and the weird wrinkle about the ERET target kinda goes
away too. E.g. provide this in arch/x86/entry/entry_64_fred.S
.section .text, "ax"
/* Note, this is instrumentation safe, and returns via RET, not ERETS! */
#if IS_ENABLED(CONFIG_KVM_INTEL)
SYM_CODE_START(fred_irq_entry_from_kvm)
FRED_ENTER
call external_interrupt
FRED_EXIT
RET
SYM_CODE_END(fred_irq_entry_from_kvm)
EXPORT_SYMBOL_GPL(fred_irq_entry_from_kvm);
#endif
and then the KVM side for this particular chunk is more simply:
lea 1f(%rip), %rax
push %rax
push $0 /* FRED error code, 0 for NMI and external interrupts */
\branch_insn \branch_target
1:
VMX_DO_EVENT_FRAME_END
RET
Alternatively, the whole thing could be shoved into arch/x86/entry/entry_64_fred.S,
but at a glance I don't think that would be a net positive due to the need to handle
IRQs vs. NMIs.
> + \branch_insn \branch_target
> +
> + .if \nmi == 0
> + POP_REGS
> + .endif
> +
> +1:
> + /*
> + * "Restore" RSP from RBP, even though IRET has already unwound RSP to
As mentioned above, this is incorrect on two fronts.
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 0ecf4be2c6af..4e90c69a92bf 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -6890,6 +6890,14 @@ static void vmx_apicv_post_state_restore(struct kvm_vcpu *vcpu)
> memset(vmx->pi_desc.pir, 0, sizeof(vmx->pi_desc.pir));
> }
>
> +#ifdef CONFIG_X86_FRED
> +void vmx_do_fred_interrupt_irqoff(unsigned int vector);
> +void vmx_do_fred_nmi_irqoff(unsigned int vector);
> +#else
> +#define vmx_do_fred_interrupt_irqoff(x) BUG()
> +#define vmx_do_fred_nmi_irqoff(x) BUG()
> +#endif
My slight preference is to open code the BUG() as a ud2 in assembly, purely to
avoid more #ifdefs.
> +
> void vmx_do_interrupt_irqoff(unsigned long entry);
> void vmx_do_nmi_irqoff(void);
>
> @@ -6932,14 +6940,16 @@ static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu)
> {
> u32 intr_info = vmx_get_intr_info(vcpu);
> unsigned int vector = intr_info & INTR_INFO_VECTOR_MASK;
> - gate_desc *desc = (gate_desc *)host_idt_base + vector;
>
> if (KVM_BUG(!is_external_intr(intr_info), vcpu->kvm,
> "unexpected VM-Exit interrupt info: 0x%x", intr_info))
> return;
>
> kvm_before_interrupt(vcpu, KVM_HANDLING_IRQ);
> - vmx_do_interrupt_irqoff(gate_offset(desc));
> + if (cpu_feature_enabled(X86_FEATURE_FRED))
> + vmx_do_fred_interrupt_irqoff(vector); /* Event type is 0 */
I strongly prefer to use code to document what's going on. E.g. the tail comment
just left me wondering, what event type is 0? Whereas this makes it quite clear
that KVM is signaling a hardware interrupt. The fact that it's a nop as far as
code generation goes is irrelevant.
vmx_do_fred_interrupt_irqoff((EVENT_TYPE_HWINT << 16) | vector);
> + else
> + vmx_do_interrupt_irqoff(gate_offset((gate_desc *)host_idt_base + vector));
> kvm_after_interrupt(vcpu);
>
> vcpu->arch.at_instruction_boundary = true;
Here's a diff for (hopefully) everything I've suggested above.
---
arch/x86/entry/entry_64_fred.S | 17 ++++++-
arch/x86/kernel/traps.c | 5 --
arch/x86/kvm/vmx/vmenter.S | 84 +++++++++++++++-------------------
arch/x86/kvm/vmx/vmx.c | 7 +--
4 files changed, 55 insertions(+), 58 deletions(-)
diff --git a/arch/x86/entry/entry_64_fred.S b/arch/x86/entry/entry_64_fred.S
index 12063267d2ac..a973c0bd29f6 100644
--- a/arch/x86/entry/entry_64_fred.S
+++ b/arch/x86/entry/entry_64_fred.S
@@ -10,7 +10,6 @@
#include "calling.h"
.code64
- .section ".noinstr.text", "ax"
.macro FRED_ENTER
UNWIND_HINT_END_OF_STACK
@@ -24,6 +23,22 @@
POP_REGS
.endm
+ .section .text, "ax"
+
+/* Note, this is instrumentation safe, and returns via RET, not ERETS! */
+#if IS_ENABLED(CONFIG_KVM_INTEL)
+SYM_CODE_START(fred_irq_entry_from_kvm)
+ FRED_ENTER
+ call external_interrupt
+ FRED_EXIT
+ RET
+SYM_CODE_END(fred_irq_entry_from_kvm)
+EXPORT_SYMBOL_GPL(fred_irq_entry_from_kvm);
+#endif
+
+ .section ".noinstr.text", "ax"
+
+
/*
* The new RIP value that FRED event delivery establishes is
* IA32_FRED_CONFIG & ~FFFH for events that occur in ring 3.
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 21eeba7b188f..cbcb83c71dab 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -1566,11 +1566,6 @@ int external_interrupt(struct pt_regs *regs)
return 0;
}
-#if IS_ENABLED(CONFIG_KVM_INTEL)
-/* For KVM VMX to handle IRQs in IRQ induced VM exits. */
-EXPORT_SYMBOL_GPL(external_interrupt);
-#endif
-
#endif /* CONFIG_X86_64 */
void __init trap_init(void)
diff --git a/arch/x86/kvm/vmx/vmenter.S b/arch/x86/kvm/vmx/vmenter.S
index 79a4c91d9434..e25df565c3f8 100644
--- a/arch/x86/kvm/vmx/vmenter.S
+++ b/arch/x86/kvm/vmx/vmenter.S
@@ -9,7 +9,6 @@
#include <asm/segment.h>
#include "kvm-asm-offsets.h"
#include "run_flags.h"
-#include "../../entry/calling.h"
#define WORD_SIZE (BITS_PER_LONG / 8)
@@ -33,15 +32,31 @@
#define VCPU_R15 __VCPU_REGS_R15 * WORD_SIZE
#endif
+.macro VMX_DO_EVENT_FRAME_BEGIN
+ /*
+ * Unconditionally create a stack frame, getting the correct RSP on the
+ * stack (for x86-64) would take two instructions anyways, and RBP can
+ * be used to restore RSP to make objtool happy (see below).
+ */
+ push %_ASM_BP
+ mov %_ASM_SP, %_ASM_BP
+.endm
+
+.macro VMX_DO_EVENT_FRAME_END
+ /*
+ * "Restore" RSP from RBP, even though {E,I}RET has already unwound RSP
+ * to the correct value *in most cases*. KVM's IRQ handling with FRED
+ * doesn't do ERETS, and objtool doesn't know the callee will IRET/ERET
+ * and, without the explicit restore, thinks the stack is getting walloped.
+ * Using an unwind hint is problematic due to x86-64's dynamic alignment.
+ */
+ mov %_ASM_BP, %_ASM_SP
+ pop %_ASM_BP
+.endm
+
#ifdef CONFIG_X86_FRED
.macro VMX_DO_FRED_EVENT_IRQOFF branch_insn branch_target nmi=0
- /*
- * Unconditionally create a stack frame, getting the correct RSP on the
- * stack (for x86-64) would take two instructions anyways, and RBP can
- * be used to restore RSP to make objtool happy (see below).
- */
- push %_ASM_BP
- mov %_ASM_SP, %_ASM_BP
+ VMX_DO_EVENT_FRAME_BEGIN
/*
* Don't check the FRED stack level, the call stack leading to this
@@ -78,43 +93,23 @@
* push an error code before invoking the IRQ/NMI handler.
*
* Use LEA to get the return RIP and push it, then push an error code.
+ * Note, only NMI handling does an ERETS to the target! IRQ handling
+ * doesn't need to unmask NMIs and so simply uses CALL+RET, i.e. the
+ * RIP pushed here is only truly consumed for NMIs!
*/
lea 1f(%rip), %rax
push %rax
push $0 /* FRED error code, 0 for NMI and external interrupts */
- .if \nmi == 0
- PUSH_REGS
- mov %rsp, %rdi
- .endif
-
\branch_insn \branch_target
-
- .if \nmi == 0
- POP_REGS
- .endif
-
1:
- /*
- * "Restore" RSP from RBP, even though IRET has already unwound RSP to
- * the correct value. objtool doesn't know the callee will IRET and,
- * without the explicit restore, thinks the stack is getting walloped.
- * Using an unwind hint is problematic due to x86-64's dynamic alignment.
- */
- mov %_ASM_BP, %_ASM_SP
- pop %_ASM_BP
+ VMX_DO_EVENT_FRAME_END
RET
.endm
#endif
.macro VMX_DO_EVENT_IRQOFF call_insn call_target
- /*
- * Unconditionally create a stack frame, getting the correct RSP on the
- * stack (for x86-64) would take two instructions anyways, and RBP can
- * be used to restore RSP to make objtool happy (see below).
- */
- push %_ASM_BP
- mov %_ASM_SP, %_ASM_BP
+ VMX_DO_EVENT_FRAME_BEGIN
#ifdef CONFIG_X86_64
/*
@@ -129,14 +124,7 @@
push $__KERNEL_CS
\call_insn \call_target
- /*
- * "Restore" RSP from RBP, even though IRET has already unwound RSP to
- * the correct value. objtool doesn't know the callee will IRET and,
- * without the explicit restore, thinks the stack is getting walloped.
- * Using an unwind hint is problematic due to x86-64's dynamic alignment.
- */
- mov %_ASM_BP, %_ASM_SP
- pop %_ASM_BP
+ VMX_DO_EVENT_FRAME_END
RET
.endm
@@ -375,11 +363,13 @@ SYM_INNER_LABEL_ALIGN(vmx_vmexit, SYM_L_GLOBAL)
SYM_FUNC_END(__vmx_vcpu_run)
-#ifdef CONFIG_X86_FRED
SYM_FUNC_START(vmx_do_fred_nmi_irqoff)
+#ifdef CONFIG_X86_FRED
VMX_DO_FRED_EVENT_IRQOFF jmp fred_entrypoint_kernel nmi=1
+#else
+ ud2
+#endif
SYM_FUNC_END(vmx_do_fred_nmi_irqoff)
-#endif
SYM_FUNC_START(vmx_do_nmi_irqoff)
VMX_DO_EVENT_IRQOFF call asm_exc_nmi_kvm_vmx
@@ -438,11 +428,13 @@ SYM_FUNC_END(vmread_error_trampoline)
#endif
.section .text, "ax"
-#ifdef CONFIG_X86_FRED
SYM_FUNC_START(vmx_do_fred_interrupt_irqoff)
- VMX_DO_FRED_EVENT_IRQOFF call external_interrupt
+#ifdef CONFIG_X86_FRED
+ VMX_DO_FRED_EVENT_IRQOFF call fred_irq_entry_from_kvm
+#else
+ ud2
+#endif
SYM_FUNC_END(vmx_do_fred_interrupt_irqoff)
-#endif
SYM_FUNC_START(vmx_do_interrupt_irqoff)
VMX_DO_EVENT_IRQOFF CALL_NOSPEC _ASM_ARG1
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index bf757f5071e4..cb4675dd87df 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6919,13 +6919,8 @@ static void vmx_apicv_post_state_restore(struct kvm_vcpu *vcpu)
memset(vmx->pi_desc.pir, 0, sizeof(vmx->pi_desc.pir));
}
-#ifdef CONFIG_X86_FRED
void vmx_do_fred_interrupt_irqoff(unsigned int vector);
void vmx_do_fred_nmi_irqoff(unsigned int vector);
-#else
-#define vmx_do_fred_interrupt_irqoff(x) BUG()
-#define vmx_do_fred_nmi_irqoff(x) BUG()
-#endif
void vmx_do_interrupt_irqoff(unsigned long entry);
void vmx_do_nmi_irqoff(void);
@@ -6976,7 +6971,7 @@ static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu)
kvm_before_interrupt(vcpu, KVM_HANDLING_IRQ);
if (cpu_feature_enabled(X86_FEATURE_FRED))
- vmx_do_fred_interrupt_irqoff(vector); /* Event type is 0 */
+ vmx_do_fred_interrupt_irqoff((EVENT_TYPE_HWINT << 16) | vector);
else
vmx_do_interrupt_irqoff(gate_offset((gate_desc *)host_idt_base + vector));
kvm_after_interrupt(vcpu);
base-commit: 8961078ffe509a97ec7803b17912e57c47b93fa2
--
^ permalink raw reply related
* RE: [PATCH v4 1/1] RDMA/mana_ib: Add EQ interrupt support to mana ib driver.
From: Long Li @ 2023-08-01 19:06 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Wei Hu, netdev@vger.kernel.org, linux-hyperv@vger.kernel.org,
linux-rdma@vger.kernel.org, Ajay Sharma, leon@kernel.org,
KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org, Dexuan Cui,
davem@davemloft.net, edumazet@google.com, kuba@kernel.org,
pabeni@redhat.com, vkuznets, ssengar@linux.microsoft.com,
shradhagupta@linux.microsoft.com
In-Reply-To: <ZMQLW4elDj0vV1ld@ziepe.ca>
> Subject: Re: [PATCH v4 1/1] RDMA/mana_ib: Add EQ interrupt support to mana ib
> driver.
>
> On Fri, Jul 28, 2023 at 06:22:53PM +0000, Long Li wrote:
> > > Subject: Re: [PATCH v4 1/1] RDMA/mana_ib: Add EQ interrupt support
> > > to mana ib driver.
> > >
> > > On Fri, Jul 28, 2023 at 05:51:46PM +0000, Long Li wrote:
> > > > > Subject: Re: [PATCH v4 1/1] RDMA/mana_ib: Add EQ interrupt
> > > > > support to mana ib driver.
> > > > >
> > > > > On Fri, Jul 28, 2023 at 05:07:49PM +0000, Wei Hu wrote:
> > > > > > Add EQ interrupt support for mana ib driver. Allocate EQs per
> > > > > > ucontext to receive interrupt. Attach EQ when CQ is created.
> > > > > > Call CQ interrupt handler when completion interrupt happens.
> > > > > > EQs are destroyed when ucontext is deallocated.
> > > > >
> > > > > It seems strange that interrupts would be somehow linked to a ucontext?
> > > > > interrupts are highly limited, you can DOS the entire system if
> > > > > someone abuses this.
> > > > >
> > > > > Generally I expect a properly functioning driver to use one
> > > > > interrupt per CPU
> > > core.
> > > >
> > > > Yes, MANA uses one interrupt per CPU. One interrupt is shared
> > > > among multiple EQs.
> > >
> > > So you have another multiplexing layer between the interrupt and the
> > > EQ? That is alot of multiplexing layers..
> > >
> > > > > You should tie the CQ to a shared EQ belong to the core that the
> > > > > CQ wants to have affinity to.
> > > >
> > > > The reason for using a separate EQ for a ucontext, is for
> > > > preventing DOS. If we use a shared EQ, a single ucontext can storm
> > > > this shared EQ affecting other users.
> > >
> > > With a proper design it should not be possible. The CQ adds an entry
> > > to the EQ and that should be rate limited by the ability of
> > > userspace to schedule to re-arm the CQ.
> >
> > I think DPDK user space can sometimes storm the EQ by arming the CQ
> > from user-mode.
>
> Maybe maliciously you can do a blind re-arm, but nothing sane should do that.
Yes, we don't expect a sane user would do that. But in a containerized cloud VM, we can't trust any user. The hardware/driver is designed to isolate the damage from those bad behaviors to their own environment.
>
> > With a malicious DPDK user, this code can be abused to arm the CQ at
> > extremely high rate.
>
> Again, the rate of CQ re-arm is limited by the ability of userspace to schedule, I'm
> reluctant to consider that a DOS vector. Doesn't your HW have EQ overflow
> recovery?
The HW supports detecting and recovery of EQ overflow, but it is on the slow path. A bad user can still affect other users if they use the same EQ and get into recovery mode from time to time.
>
> Frankly, stacking more layers of IRQ multiplexing doesn't seem like it should solve
> any problems, you are just shifting where the DOS can occure. Allowing userspace
> to create EQs is its own DOS direction, either you exhaust and DOS the number of
> EQs or you DOS the multiplexing layer between the interrupt and the EQ.
The hardware is designed to support a very large number EQs. In practice, this hardware limit is unlikely to be reached before other resources are running out.
The driver interrupt code limits the CPU processing time of each EQ by reading a small batch of EQEs in this interrupt. It guarantees all the EQs are checked on this CPU, and limits the interrupt processing time for any given EQ. In this way, a bad EQ (which is stormed by a bad user doing unreasonable re-arming on the CQ) can't storm other EQs on this CPU.
Thanks,
Long
^ permalink raw reply
* Re: [PATCH RESEND v9 33/36] KVM: VMX: Add VMX_DO_FRED_EVENT_IRQOFF for IRQ/NMI handling
From: Peter Zijlstra @ 2023-08-01 19:37 UTC (permalink / raw)
To: Sean Christopherson
Cc: Xin Li, linux-doc, linux-kernel, linux-edac, linux-hyperv, kvm,
xen-devel, Jonathan Corbet, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H . Peter Anvin,
Andy Lutomirski, Oleg Nesterov, Tony Luck, K . Y . Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Paolo Bonzini, Wanpeng Li,
Vitaly Kuznetsov, Juergen Gross, Stefano Stabellini,
Oleksandr Tyshchenko, Josh Poimboeuf, Paul E . McKenney,
Catalin Marinas, Randy Dunlap, Steven Rostedt, Kim Phillips,
Hyeonggon Yoo, Liam R . Howlett, Sebastian Reichel,
Kirill A . Shutemov, Suren Baghdasaryan, Pawan Gupta, Babu Moger,
Jim Mattson, Sandipan Das, Lai Jiangshan, Hans de Goede,
Reinette Chatre, Daniel Sneddon, Breno Leitao, Nikunj A Dadhania,
Brian Gerst, Sami Tolvanen, Alexander Potapenko, Andrew Morton,
Arnd Bergmann, Eric W . Biederman, Kees Cook, Masami Hiramatsu,
Masahiro Yamada, Ze Gao, Fei Li, Conghui, Ashok Raj,
Jason A . Donenfeld, Mark Rutland, Jacob Pan, Jiapeng Chong,
Jane Malalane, David Woodhouse, Boris Ostrovsky,
Arnaldo Carvalho de Melo, Yantengsi, Christophe Leroy,
Sathvika Vasireddy
In-Reply-To: <ZMlWe5TgS6HM98Mg@google.com>
On Tue, Aug 01, 2023 at 07:01:15PM +0000, Sean Christopherson wrote:
> The spec I have from May 2022 says the NMI bit colocated with CS, not SS. And
> the cover letter's suggestion to use a search engine to find the spec ain't
> exactly helpful, that just gives me the same "May 2022 Revision 3.0" spec. So
> either y'all have a spec that I can't find, or this is wrong.
https://intel.com/sdm
is a useful shorthand I've recently been told about. On that page is
also "Flexible Return and Event Delivery Specification", when clicked it
will gift you a FRED v5.0 PDF.
^ permalink raw reply
* Re: [PATCH RESEND v9 33/36] KVM: VMX: Add VMX_DO_FRED_EVENT_IRQOFF for IRQ/NMI handling
From: Sean Christopherson @ 2023-08-01 19:49 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Xin Li, linux-doc, linux-kernel, linux-edac, linux-hyperv, kvm,
xen-devel, Jonathan Corbet, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H . Peter Anvin,
Andy Lutomirski, Oleg Nesterov, Tony Luck, K . Y . Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Paolo Bonzini, Wanpeng Li,
Vitaly Kuznetsov, Juergen Gross, Stefano Stabellini,
Oleksandr Tyshchenko, Josh Poimboeuf, Paul E . McKenney,
Catalin Marinas, Randy Dunlap, Steven Rostedt, Kim Phillips,
Hyeonggon Yoo, Liam R . Howlett, Sebastian Reichel,
Kirill A . Shutemov, Suren Baghdasaryan, Pawan Gupta, Babu Moger,
Jim Mattson, Sandipan Das, Lai Jiangshan, Hans de Goede,
Reinette Chatre, Daniel Sneddon, Breno Leitao, Nikunj A Dadhania,
Brian Gerst, Sami Tolvanen, Alexander Potapenko, Andrew Morton,
Arnd Bergmann, Eric W . Biederman, Kees Cook, Masami Hiramatsu,
Masahiro Yamada, Ze Gao, Fei Li, Conghui, Ashok Raj,
Jason A . Donenfeld, Mark Rutland, Jacob Pan, Jiapeng Chong,
Jane Malalane, David Woodhouse, Boris Ostrovsky,
Arnaldo Carvalho de Melo, Yantengsi, Christophe Leroy,
Sathvika Vasireddy
In-Reply-To: <20230801193750.GA119080@hirez.programming.kicks-ass.net>
On Tue, Aug 01, 2023, Peter Zijlstra wrote:
> On Tue, Aug 01, 2023 at 07:01:15PM +0000, Sean Christopherson wrote:
> > The spec I have from May 2022 says the NMI bit colocated with CS, not SS. And
> > the cover letter's suggestion to use a search engine to find the spec ain't
> > exactly helpful, that just gives me the same "May 2022 Revision 3.0" spec. So
> > either y'all have a spec that I can't find, or this is wrong.
>
> https://intel.com/sdm
>
> is a useful shorthand I've recently been told about.
Hallelujah!
> On that page is also "Flexible Return and Event Delivery Specification", when
> clicked it will gift you a FRED v5.0 PDF.
Worked for me, too. Thanks!
^ permalink raw reply
* Re: [EXTERNAL] Re: [PATCH v3] x86/hyperv: add noop functions to x86_init mpparse functions
From: Dave Hansen @ 2023-08-01 21:52 UTC (permalink / raw)
To: Saurabh Singh Sengar, Wei Liu, Saurabh Sengar,
dave.hansen@linux.intel.com
Cc: KY Srinivasan, Haiyang Zhang, Dexuan Cui, tglx@linutronix.de,
mingo@redhat.com, bp@alien8.de, x86@kernel.org,
Michael Kelley (LINUX), linux-kernel@vger.kernel.org,
linux-hyperv@vger.kernel.org, hpa@zytor.com
In-Reply-To: <PUZP153MB06357202BB90D434C909DE8CBE3FA@PUZP153MB0635.APCP153.PROD.OUTLOOK.COM>
On 7/21/23 05:58, Saurabh Singh Sengar wrote:
>> On Fri, Jun 23, 2023 at 09:28:08AM -0700, Saurabh Sengar wrote:
>>> Hyper-V can run VMs at different privilege "levels" known as Virtual
>>> Trust Levels (VTL). Sometimes, it chooses to run two different VMs at
>>> different levels but they share some of their address space. In such
>>> setups VTL2 (higher level VM) has visibility of all of the
>>> VTL0 (level 0) memory space.
>>>
>>> When the CONFIG_X86_MPPARSE is enabled for VTL2, the VTL2 kernel
>>> performs a search within the low memory to locate MP tables. However,
>>> in systems where VTL0 manages the low memory and may contain valid
>>> tables, this scanning can result in incorrect MP table information
>>> being provided to the VTL2 kernel, mistakenly considering VTL0's MP
>>> table as its own
>>>
>>> Add noop functions to avoid MP parse scan by VTL2.
>>>
>>> Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com>
>> Hi Dave, are you happy with the commit message?
> HI Dave,
>
> If there is no concern, can I get your ack
Looks sane:
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Although, if you muck with this any more having actual facts on what the
"incorrect MP table information" causes would be nice too.
^ permalink raw reply
* RE: [patch v2 21/38] x86/cpu: Provide cpu_init/parse_topology()
From: Thomas Gleixner @ 2023-08-01 22:25 UTC (permalink / raw)
To: Michael Kelley (LINUX), Peter Zijlstra
Cc: LKML, x86@kernel.org, Tom Lendacky, Andrew Cooper,
Arjan van de Ven, James E.J. Bottomley, Dick Kennedy, James Smart,
Martin K. Petersen, linux-scsi@vger.kernel.org, Guenter Roeck,
linux-hwmon@vger.kernel.org, Jean Delvare, Huang Rui,
Juergen Gross, Steve Wahl, Mike Travis, Dimitri Sivanich,
Russ Anderson, linux-hyperv, Linus Torvalds, Greg Kroah-Hartman
In-Reply-To: <873513n31m.ffs@tglx>
Michael!
On Tue, Aug 01 2023 at 00:12, Thomas Gleixner wrote:
> On Mon, Jul 31 2023 at 21:27, Michael Kelley wrote:
> Clearly the hyper-v BIOS people put a lot of thoughts into this:
>
>> x2APIC features / processor topology (0xb):
>> extended APIC ID = 0
>> --- level 0 ---
>> level number = 0x0 (0)
>> level type = thread (1)
>> bit width of level = 0x1 (1)
>> number of logical processors at level = 0x2 (2)
>> --- level 1 ---
>> level number = 0x1 (1)
>> level type = core (2)
>> bit width of level = 0x6 (6)
>> number of logical processors at level = 0x40 (64)
>
> FAIL: ^^^^^
>
> While that field is not meant for topology evaluation it is at least
> expected to tell the actual number of logical processors at that level
> which are actually available.
>
> The CPUID APIC ID aka initial_apicid clearly tells that the topology has
> 36 logical CPUs in package 0 and 36 in package 1 according to your
> table.
>
> On real hardware this looks like this:
>
> --- level 1 ---
> level number = 0x1 (1)
> level type = core (2)
> bit width of level = 0x6 (6)
> number of logical processors at level = 0x38 (56)
>
> Which corresponds to reality and is consistent. But sure, consistency is
> overrated.
So I looked really hard to find some hint how to detect this situation
on the boot CPU, which allows us to mitigate it, but there is none at
all.
So we are caught between a rock and a hard place, which provides us two
mutually exclusive options to chose from:
1) Have a sane topology evaluation mechanism which solves the known
problems of hybrid systems, wrong sizing estimates and other
unpleasantries.
2) Support the Hyper-V BIOS trainwreck forever.
Unsurprisingly #2 is not really an option as #1 is a crucial issue for
the kernel and we need it resolved urgently as of yesterday.
So while I'm definitely a strong supporter of no-regression policy, I
have to make an argument here why this particular issue is _not_
covered:
1) Hyper-V BIOS/firmware violates the firmware specification and
requirements which are clearly spelled out in the SDM.
2) This violatation is reported on every boot with one promiment
message per brought up AP where the initial APIC ID as provided by
CPUID leaf 0xB deviates from the APIC ID read from "hardware", which is
also provided by MADT starting with CPU 36 in the provided example:
"[FIRMWARE BUG] CPU36: APIC id mismatch. Firmware: 40 APIC: 24"
repeating itself up to CPU71 with the relevant diverging APIC IDs.
At least that's what the upstream kernel produces according to
validate_apic_and_package_id() in such an situation.
3) This is known for years and the Hyper-V Linux team tried to get this
resolved, but obviously their arguments fell on deaf ears.
IOW, the firmware BUG message has been ignored willfully for years
due to "works for me, why should I care?" attitude.
Seriously, kernel development cannot be held hostage forever by the
wilful ignorance of a BIOS team, which refuses to adhere to
specifications and defines their own world order.
The x86 maintainer team is chosing the lesser of two evils and lets
those who created the problem and refused to resolve it deal with the
outcome.
Just to clarify. This is not preventing affected guests from booting.
The worst consequence is a slight performance regression because the
firmware provided topology information is not matching reality and
therefore the scheduler placement vs. L3 affinity sucks. That's clearly
not a kernel problem.
I'm happy to aid accelerating this thought process by elevating the
existing pr_err(FW_BUG....) to a solid WARN_ON_ONCE(). See below.
Thanks,
tglx
---
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1688,7 +1688,7 @@ static void validate_apic_and_package_id
apicid = apic->cpu_present_to_apicid(cpu);
- if (apicid != c->topo.apicid) {
+ if (WARN_ON_ONCE(apicid != c->topo.apicid)) {
pr_err(FW_BUG "CPU%u: APIC id mismatch. Firmware: %x APIC: %x\n",
cpu, apicid, c->topo.initial_apicid);
}
^ permalink raw reply
* Re: [patch v2 21/38] x86/cpu: Provide cpu_init/parse_topology()
From: Andrew Cooper @ 2023-08-01 22:35 UTC (permalink / raw)
To: Thomas Gleixner, Michael Kelley (LINUX), Peter Zijlstra
Cc: LKML, x86@kernel.org, Tom Lendacky, Arjan van de Ven,
James E.J. Bottomley, Dick Kennedy, James Smart,
Martin K. Petersen, linux-scsi@vger.kernel.org, Guenter Roeck,
linux-hwmon@vger.kernel.org, Jean Delvare, Huang Rui,
Juergen Gross, Steve Wahl, Mike Travis, Dimitri Sivanich,
Russ Anderson, linux-hyperv, Linus Torvalds, Greg Kroah-Hartman
In-Reply-To: <87r0omjt8c.ffs@tglx>
On 01/08/2023 11:25 pm, Thomas Gleixner wrote:
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -1688,7 +1688,7 @@ static void validate_apic_and_package_id
>
> apicid = apic->cpu_present_to_apicid(cpu);
>
> - if (apicid != c->topo.apicid) {
> + if (WARN_ON_ONCE(apicid != c->topo.apicid)) {
> pr_err(FW_BUG "CPU%u: APIC id mismatch. Firmware: %x APIC: %x\n",
While you're fixing this, care to remove the chaotic-evil path of mixing
%u and %x with no 0x prefix?
~Andrew
^ permalink raw reply
* RE: [PATCH RESEND v9 33/36] KVM: VMX: Add VMX_DO_FRED_EVENT_IRQOFF for IRQ/NMI handling
From: Li, Xin3 @ 2023-08-01 23:18 UTC (permalink / raw)
To: Christopherson,, Sean
Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-edac@vger.kernel.org, linux-hyperv@vger.kernel.org,
kvm@vger.kernel.org, xen-devel@lists.xenproject.org,
Jonathan Corbet, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86@kernel.org, H . Peter Anvin, Lutomirski, Andy,
Oleg Nesterov, Luck, Tony, K . Y . Srinivasan, Haiyang Zhang,
Wei Liu, Cui, Dexuan, Paolo Bonzini, Wanpeng Li, Vitaly Kuznetsov,
Peter Zijlstra, Gross, Jurgen, Stefano Stabellini,
Oleksandr Tyshchenko, Josh Poimboeuf, Paul E . McKenney,
Catalin Marinas, Randy Dunlap, Steven Rostedt, Kim Phillips,
Hyeonggon Yoo, Liam R . Howlett, Sebastian Reichel,
Kirill A . Shutemov, Suren Baghdasaryan, Pawan Gupta, Babu Moger,
Jim Mattson, Sandipan Das, Lai Jiangshan, Hans de Goede,
Chatre, Reinette, Daniel Sneddon, Breno Leitao, Nikunj A Dadhania,
Brian Gerst, Sami Tolvanen, Alexander Potapenko, Andrew Morton,
Arnd Bergmann, Eric W . Biederman, Kees Cook, Masami Hiramatsu,
Masahiro Yamada, Ze Gao, Li, Fei1, Conghui, Raj, Ashok,
Jason A . Donenfeld, Mark Rutland, Jacob Pan, Jiapeng Chong,
Jane Malalane, Woodhouse, David, Ostrovsky, Boris,
Arnaldo Carvalho de Melo, Yantengsi, Christophe Leroy,
Sathvika Vasireddy
In-Reply-To: <ZMlWe5TgS6HM98Mg@google.com>
> > +#include "../../entry/calling.h"
>
> Rather than do the low level PUSH_REGS and POP_REGS, I vote to have core code
> expose a FRED-specific wrapper for invoking external_interrupt(). More below.
Nice idea!
> > + /*
> > + * A FRED stack frame has extra 16 bytes of information pushed at the
> > + * regular stack top compared to an IDT stack frame.
>
> There is pretty much no chance that anyone remembers the layout of an IDT stack
> frame off the top of their head. I.e. saying "FRED has 16 bytes more" isn't all
> that useful. It also fails to capture the fact that FRED stuff a hell of a lot
> more information in those "common" 48 bytes.
>
> It'll be hard/impossible to capture all of the overload info in a comment, but
> showing the actual layout of the frame would be super helpful, e.g. something
> like
> this
>
> /*
> * FRED stack frames are always 64 bytes:
> *
> * ------------------------------
> * | Bytes | Usage |
> * -----------------------------|
> * | 63:56 | Reserved |
> * | 55:48 | Event Data |
> * | 47:40 | SS + Event Info |
> * | 39:32 | RSP |
> * | 31:24 | RFLAGS |
> * | 23:16 | CS + Aux Info |
> * | 15:8 | RIP |
> * | 7:0 | Error Code |
> * ------------------------------
> */
> *
> * Use LEA to get the return RIP and push it, then push an error code.
> * Note, only NMI handling does an ERETS to the target! IRQ handling
> * doesn't need to unmask NMIs and so simply uses CALL+RET, i.e. the
> * RIP pushed here is only truly consumed for NMIs!
I take these as ASM code does need more love, i.e., nice comments :/
Otherwise only more confusion.
>
> Jumping way back to providing a wrapper for FRED, if we do that, then there's no
> need to include calling.h, and the weird wrinkle about the ERET target kinda goes
> away too. E.g. provide this in arch/x86/entry/entry_64_fred.S
>
> .section .text, "ax"
>
> /* Note, this is instrumentation safe, and returns via RET, not ERETS! */
> #if IS_ENABLED(CONFIG_KVM_INTEL)
> SYM_CODE_START(fred_irq_entry_from_kvm)
> FRED_ENTER
> call external_interrupt
> FRED_EXIT
> RET
> SYM_CODE_END(fred_irq_entry_from_kvm)
> EXPORT_SYMBOL_GPL(fred_irq_entry_from_kvm);
> #endif
>
> and then the KVM side for this particular chunk is more simply:
>
> lea 1f(%rip), %rax
> push %rax
> push $0 /* FRED error code, 0 for NMI and external
> interrupts */
>
> \branch_insn \branch_target
> 1:
> VMX_DO_EVENT_FRAME_END
> RET
>
>
> Alternatively, the whole thing could be shoved into
> arch/x86/entry/entry_64_fred.S,
> but at a glance I don't think that would be a net positive due to the need to
> handle
> IRQs vs. NMIs.
>
> > + \branch_insn \branch_target
> > +
> > + .if \nmi == 0
> > + POP_REGS
> > + .endif
> > +
> > +1:
> > + /*
> > + * "Restore" RSP from RBP, even though IRET has already unwound RSP to
>
> As mentioned above, this is incorrect on two fronts.
>
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index 0ecf4be2c6af..4e90c69a92bf 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -6890,6 +6890,14 @@ static void vmx_apicv_post_state_restore(struct
> kvm_vcpu *vcpu)
> > memset(vmx->pi_desc.pir, 0, sizeof(vmx->pi_desc.pir));
> > }
> >
> > +#ifdef CONFIG_X86_FRED
> > +void vmx_do_fred_interrupt_irqoff(unsigned int vector);
> > +void vmx_do_fred_nmi_irqoff(unsigned int vector);
> > +#else
> > +#define vmx_do_fred_interrupt_irqoff(x) BUG()
> > +#define vmx_do_fred_nmi_irqoff(x) BUG()
> > +#endif
>
> My slight preference is to open code the BUG() as a ud2 in assembly, purely to
> avoid more #ifdefs.
>
> > +
> > void vmx_do_interrupt_irqoff(unsigned long entry);
> > void vmx_do_nmi_irqoff(void);
> >
> > @@ -6932,14 +6940,16 @@ static void handle_external_interrupt_irqoff(struct
> kvm_vcpu *vcpu)
> > {
> > u32 intr_info = vmx_get_intr_info(vcpu);
> > unsigned int vector = intr_info & INTR_INFO_VECTOR_MASK;
> > - gate_desc *desc = (gate_desc *)host_idt_base + vector;
> >
> > if (KVM_BUG(!is_external_intr(intr_info), vcpu->kvm,
> > "unexpected VM-Exit interrupt info: 0x%x", intr_info))
> > return;
> >
> > kvm_before_interrupt(vcpu, KVM_HANDLING_IRQ);
> > - vmx_do_interrupt_irqoff(gate_offset(desc));
> > + if (cpu_feature_enabled(X86_FEATURE_FRED))
> > + vmx_do_fred_interrupt_irqoff(vector); /* Event type is 0 */
>
> I strongly prefer to use code to document what's going on. E.g. the tail comment
> just left me wondering, what event type is 0? Whereas this makes it quite clear
> that KVM is signaling a hardware interrupt. The fact that it's a nop as far as
> code generation goes is irrelevant.
>
> vmx_do_fred_interrupt_irqoff((EVENT_TYPE_HWINT << 16) | vector);
Better readability.
>
> > + else
> > + vmx_do_interrupt_irqoff(gate_offset((gate_desc *)host_idt_base
> + vector));
> > kvm_after_interrupt(vcpu);
> >
> > vcpu->arch.at_instruction_boundary = true;
>
> Here's a diff for (hopefully) everything I've suggested above.
Thanks a lot! I will test it and update the patch in this mail thread.
>
> ---
> arch/x86/entry/entry_64_fred.S | 17 ++++++-
> arch/x86/kernel/traps.c | 5 --
> arch/x86/kvm/vmx/vmenter.S | 84 +++++++++++++++-------------------
> arch/x86/kvm/vmx/vmx.c | 7 +--
> 4 files changed, 55 insertions(+), 58 deletions(-)
>
> diff --git a/arch/x86/entry/entry_64_fred.S b/arch/x86/entry/entry_64_fred.S
> index 12063267d2ac..a973c0bd29f6 100644
> --- a/arch/x86/entry/entry_64_fred.S
> +++ b/arch/x86/entry/entry_64_fred.S
> @@ -10,7 +10,6 @@
> #include "calling.h"
>
> .code64
> - .section ".noinstr.text", "ax"
>
> .macro FRED_ENTER
> UNWIND_HINT_END_OF_STACK
> @@ -24,6 +23,22 @@
> POP_REGS
> .endm
>
> + .section .text, "ax"
> +
> +/* Note, this is instrumentation safe, and returns via RET, not ERETS! */
> +#if IS_ENABLED(CONFIG_KVM_INTEL)
> +SYM_CODE_START(fred_irq_entry_from_kvm)
> + FRED_ENTER
> + call external_interrupt
> + FRED_EXIT
> + RET
> +SYM_CODE_END(fred_irq_entry_from_kvm)
> +EXPORT_SYMBOL_GPL(fred_irq_entry_from_kvm);
> +#endif
> +
> + .section ".noinstr.text", "ax"
> +
> +
> /*
> * The new RIP value that FRED event delivery establishes is
> * IA32_FRED_CONFIG & ~FFFH for events that occur in ring 3.
> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index 21eeba7b188f..cbcb83c71dab 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -1566,11 +1566,6 @@ int external_interrupt(struct pt_regs *regs)
> return 0;
> }
>
> -#if IS_ENABLED(CONFIG_KVM_INTEL)
> -/* For KVM VMX to handle IRQs in IRQ induced VM exits. */
> -EXPORT_SYMBOL_GPL(external_interrupt);
> -#endif
> -
> #endif /* CONFIG_X86_64 */
>
> void __init trap_init(void)
> diff --git a/arch/x86/kvm/vmx/vmenter.S b/arch/x86/kvm/vmx/vmenter.S
> index 79a4c91d9434..e25df565c3f8 100644
> --- a/arch/x86/kvm/vmx/vmenter.S
> +++ b/arch/x86/kvm/vmx/vmenter.S
> @@ -9,7 +9,6 @@
> #include <asm/segment.h>
> #include "kvm-asm-offsets.h"
> #include "run_flags.h"
> -#include "../../entry/calling.h"
>
> #define WORD_SIZE (BITS_PER_LONG / 8)
>
> @@ -33,15 +32,31 @@
> #define VCPU_R15 __VCPU_REGS_R15 * WORD_SIZE
> #endif
>
> +.macro VMX_DO_EVENT_FRAME_BEGIN
> + /*
> + * Unconditionally create a stack frame, getting the correct RSP on the
> + * stack (for x86-64) would take two instructions anyways, and RBP can
> + * be used to restore RSP to make objtool happy (see below).
> + */
> + push %_ASM_BP
> + mov %_ASM_SP, %_ASM_BP
> +.endm
> +
> +.macro VMX_DO_EVENT_FRAME_END
> + /*
> + * "Restore" RSP from RBP, even though {E,I}RET has already unwound RSP
> + * to the correct value *in most cases*. KVM's IRQ handling with FRED
> + * doesn't do ERETS, and objtool doesn't know the callee will IRET/ERET
> + * and, without the explicit restore, thinks the stack is getting walloped.
> + * Using an unwind hint is problematic due to x86-64's dynamic alignment.
> + */
> + mov %_ASM_BP, %_ASM_SP
> + pop %_ASM_BP
> +.endm
> +
> #ifdef CONFIG_X86_FRED
> .macro VMX_DO_FRED_EVENT_IRQOFF branch_insn branch_target nmi=0
> - /*
> - * Unconditionally create a stack frame, getting the correct RSP on the
> - * stack (for x86-64) would take two instructions anyways, and RBP can
> - * be used to restore RSP to make objtool happy (see below).
> - */
> - push %_ASM_BP
> - mov %_ASM_SP, %_ASM_BP
> + VMX_DO_EVENT_FRAME_BEGIN
>
> /*
> * Don't check the FRED stack level, the call stack leading to this
> @@ -78,43 +93,23 @@
> * push an error code before invoking the IRQ/NMI handler.
> *
> * Use LEA to get the return RIP and push it, then push an error code.
> + * Note, only NMI handling does an ERETS to the target! IRQ handling
> + * doesn't need to unmask NMIs and so simply uses CALL+RET, i.e. the
> + * RIP pushed here is only truly consumed for NMIs!
> */
> lea 1f(%rip), %rax
> push %rax
> push $0 /* FRED error code, 0 for NMI and external
> interrupts */
>
> - .if \nmi == 0
> - PUSH_REGS
> - mov %rsp, %rdi
> - .endif
> -
> \branch_insn \branch_target
> -
> - .if \nmi == 0
> - POP_REGS
> - .endif
> -
> 1:
> - /*
> - * "Restore" RSP from RBP, even though IRET has already unwound RSP to
> - * the correct value. objtool doesn't know the callee will IRET and,
> - * without the explicit restore, thinks the stack is getting walloped.
> - * Using an unwind hint is problematic due to x86-64's dynamic alignment.
> - */
> - mov %_ASM_BP, %_ASM_SP
> - pop %_ASM_BP
> + VMX_DO_EVENT_FRAME_END
> RET
> .endm
> #endif
>
> .macro VMX_DO_EVENT_IRQOFF call_insn call_target
> - /*
> - * Unconditionally create a stack frame, getting the correct RSP on the
> - * stack (for x86-64) would take two instructions anyways, and RBP can
> - * be used to restore RSP to make objtool happy (see below).
> - */
> - push %_ASM_BP
> - mov %_ASM_SP, %_ASM_BP
> + VMX_DO_EVENT_FRAME_BEGIN
>
> #ifdef CONFIG_X86_64
> /*
> @@ -129,14 +124,7 @@
> push $__KERNEL_CS
> \call_insn \call_target
>
> - /*
> - * "Restore" RSP from RBP, even though IRET has already unwound RSP to
> - * the correct value. objtool doesn't know the callee will IRET and,
> - * without the explicit restore, thinks the stack is getting walloped.
> - * Using an unwind hint is problematic due to x86-64's dynamic alignment.
> - */
> - mov %_ASM_BP, %_ASM_SP
> - pop %_ASM_BP
> + VMX_DO_EVENT_FRAME_END
> RET
> .endm
>
> @@ -375,11 +363,13 @@ SYM_INNER_LABEL_ALIGN(vmx_vmexit,
> SYM_L_GLOBAL)
>
> SYM_FUNC_END(__vmx_vcpu_run)
>
> -#ifdef CONFIG_X86_FRED
> SYM_FUNC_START(vmx_do_fred_nmi_irqoff)
> +#ifdef CONFIG_X86_FRED
> VMX_DO_FRED_EVENT_IRQOFF jmp fred_entrypoint_kernel nmi=1
> +#else
> + ud2
> +#endif
> SYM_FUNC_END(vmx_do_fred_nmi_irqoff)
> -#endif
>
> SYM_FUNC_START(vmx_do_nmi_irqoff)
> VMX_DO_EVENT_IRQOFF call asm_exc_nmi_kvm_vmx
> @@ -438,11 +428,13 @@ SYM_FUNC_END(vmread_error_trampoline)
> #endif
>
> .section .text, "ax"
> -#ifdef CONFIG_X86_FRED
> SYM_FUNC_START(vmx_do_fred_interrupt_irqoff)
> - VMX_DO_FRED_EVENT_IRQOFF call external_interrupt
> +#ifdef CONFIG_X86_FRED
> + VMX_DO_FRED_EVENT_IRQOFF call fred_irq_entry_from_kvm
> +#else
> + ud2
> +#endif
> SYM_FUNC_END(vmx_do_fred_interrupt_irqoff)
> -#endif
>
> SYM_FUNC_START(vmx_do_interrupt_irqoff)
> VMX_DO_EVENT_IRQOFF CALL_NOSPEC _ASM_ARG1
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index bf757f5071e4..cb4675dd87df 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -6919,13 +6919,8 @@ static void vmx_apicv_post_state_restore(struct
> kvm_vcpu *vcpu)
> memset(vmx->pi_desc.pir, 0, sizeof(vmx->pi_desc.pir));
> }
>
> -#ifdef CONFIG_X86_FRED
> void vmx_do_fred_interrupt_irqoff(unsigned int vector);
> void vmx_do_fred_nmi_irqoff(unsigned int vector);
> -#else
> -#define vmx_do_fred_interrupt_irqoff(x) BUG()
> -#define vmx_do_fred_nmi_irqoff(x) BUG()
> -#endif
>
> void vmx_do_interrupt_irqoff(unsigned long entry);
> void vmx_do_nmi_irqoff(void);
> @@ -6976,7 +6971,7 @@ static void handle_external_interrupt_irqoff(struct
> kvm_vcpu *vcpu)
>
> kvm_before_interrupt(vcpu, KVM_HANDLING_IRQ);
> if (cpu_feature_enabled(X86_FEATURE_FRED))
> - vmx_do_fred_interrupt_irqoff(vector); /* Event type is 0 */
> + vmx_do_fred_interrupt_irqoff((EVENT_TYPE_HWINT << 16) |
> vector);
> else
> vmx_do_interrupt_irqoff(gate_offset((gate_desc *)host_idt_base
> + vector));
> kvm_after_interrupt(vcpu);
>
> base-commit: 8961078ffe509a97ec7803b17912e57c47b93fa2
> --
^ permalink raw reply
* Re: [PATCH v4 1/1] RDMA/mana_ib: Add EQ interrupt support to mana ib driver.
From: Jason Gunthorpe @ 2023-08-01 23:46 UTC (permalink / raw)
To: Long Li
Cc: Wei Hu, netdev@vger.kernel.org, linux-hyperv@vger.kernel.org,
linux-rdma@vger.kernel.org, Ajay Sharma, leon@kernel.org,
KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org, Dexuan Cui,
davem@davemloft.net, edumazet@google.com, kuba@kernel.org,
pabeni@redhat.com, vkuznets, ssengar@linux.microsoft.com,
shradhagupta@linux.microsoft.com
In-Reply-To: <PH7PR21MB326367A455B78A1F230C5C34CE0AA@PH7PR21MB3263.namprd21.prod.outlook.com>
On Tue, Aug 01, 2023 at 07:06:57PM +0000, Long Li wrote:
> The driver interrupt code limits the CPU processing time of each EQ
> by reading a small batch of EQEs in this interrupt. It guarantees
> all the EQs are checked on this CPU, and limits the interrupt
> processing time for any given EQ. In this way, a bad EQ (which is
> stormed by a bad user doing unreasonable re-arming on the CQ) can't
> storm other EQs on this CPU.
Of course it can, the bad use just creates a million EQs and pushes a
bit of work through them constantly. How is that really any different
from pushing more EQEs into a single EQ?
And how does your EQ multiplexing work anyhow? Do you poll every EQ on
every interrupt? That itself is a DOS vector.
Jason
^ permalink raw reply
* [PATCH bpf-next 1/3] eth: add missing xdp.h includes in drivers
From: Jakub Kicinski @ 2023-08-02 0:32 UTC (permalink / raw)
To: ast
Cc: netdev, bpf, hawk, amritha.nambiar, aleksander.lobakin,
Jakub Kicinski, j.vosburgh, andy, shayagr, akiyano, ioana.ciornei,
claudiu.manoil, vladimir.oltean, wei.fang, shenwei.wang,
xiaoning.wang, linux-imx, dmichail, jeroendb, pkaligineedi,
shailend, jesse.brandeburg, anthony.l.nguyen, horatiu.vultur,
UNGLinuxDriver, kys, haiyangz, wei.liu, decui, peppe.cavallaro,
alexandre.torgue, joabreu, mcoquelin.stm32, grygorii.strashko,
longli, sharmaajay, daniel, john.fastabend, gerhard, simon.horman,
leon, linux-hyperv
In-Reply-To: <20230802003246.2153774-1-kuba@kernel.org>
Handful of drivers currently expect to get xdp.h by virtue
of including netdevice.h. This will soon no longer be the case
so add explicit includes.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
CC: j.vosburgh@gmail.com
CC: andy@greyhouse.net
CC: shayagr@amazon.com
CC: akiyano@amazon.com
CC: ioana.ciornei@nxp.com
CC: claudiu.manoil@nxp.com
CC: vladimir.oltean@nxp.com
CC: wei.fang@nxp.com
CC: shenwei.wang@nxp.com
CC: xiaoning.wang@nxp.com
CC: linux-imx@nxp.com
CC: dmichail@fungible.com
CC: jeroendb@google.com
CC: pkaligineedi@google.com
CC: shailend@google.com
CC: jesse.brandeburg@intel.com
CC: anthony.l.nguyen@intel.com
CC: horatiu.vultur@microchip.com
CC: UNGLinuxDriver@microchip.com
CC: kys@microsoft.com
CC: haiyangz@microsoft.com
CC: wei.liu@kernel.org
CC: decui@microsoft.com
CC: peppe.cavallaro@st.com
CC: alexandre.torgue@foss.st.com
CC: joabreu@synopsys.com
CC: mcoquelin.stm32@gmail.com
CC: grygorii.strashko@ti.com
CC: longli@microsoft.com
CC: sharmaajay@microsoft.com
CC: daniel@iogearbox.net
CC: hawk@kernel.org
CC: john.fastabend@gmail.com
CC: gerhard@engleder-embedded.com
CC: simon.horman@corigine.com
CC: leon@kernel.org
CC: linux-hyperv@vger.kernel.org
CC: bpf@vger.kernel.org
---
drivers/net/bonding/bond_main.c | 1 +
drivers/net/ethernet/amazon/ena/ena_netdev.h | 1 +
drivers/net/ethernet/engleder/tsnep.h | 1 +
drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h | 1 +
drivers/net/ethernet/freescale/enetc/enetc.h | 1 +
drivers/net/ethernet/freescale/fec.h | 1 +
drivers/net/ethernet/fungible/funeth/funeth_txrx.h | 1 +
drivers/net/ethernet/google/gve/gve.h | 1 +
drivers/net/ethernet/intel/igc/igc.h | 1 +
drivers/net/ethernet/microchip/lan966x/lan966x_main.h | 1 +
drivers/net/ethernet/microsoft/mana/mana_en.c | 1 +
drivers/net/ethernet/stmicro/stmmac/stmmac.h | 1 +
drivers/net/ethernet/ti/cpsw_priv.h | 1 +
drivers/net/hyperv/hyperv_net.h | 1 +
drivers/net/tap.c | 1 +
include/net/mana/mana.h | 2 ++
16 files changed, 17 insertions(+)
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 7a0f25301f7e..2f21cca4fdaf 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -90,6 +90,7 @@
#include <net/tls.h>
#endif
#include <net/ip6_route.h>
+#include <net/xdp.h>
#include "bonding_priv.h"
diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.h b/drivers/net/ethernet/amazon/ena/ena_netdev.h
index 248b715b4d68..a1134152ecce 100644
--- a/drivers/net/ethernet/amazon/ena/ena_netdev.h
+++ b/drivers/net/ethernet/amazon/ena/ena_netdev.h
@@ -15,6 +15,7 @@
#include <linux/netdevice.h>
#include <linux/skbuff.h>
#include <uapi/linux/bpf.h>
+#include <net/xdp.h>
#include "ena_com.h"
#include "ena_eth_com.h"
diff --git a/drivers/net/ethernet/engleder/tsnep.h b/drivers/net/ethernet/engleder/tsnep.h
index 11b29f56aaf9..6e14c918e3fb 100644
--- a/drivers/net/ethernet/engleder/tsnep.h
+++ b/drivers/net/ethernet/engleder/tsnep.h
@@ -14,6 +14,7 @@
#include <linux/net_tstamp.h>
#include <linux/ptp_clock_kernel.h>
#include <linux/miscdevice.h>
+#include <net/xdp.h>
#define TSNEP "tsnep"
diff --git a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h
index d56d7a13262e..bfb6c96c3b2f 100644
--- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h
+++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h
@@ -12,6 +12,7 @@
#include <linux/fsl/mc.h>
#include <linux/net_tstamp.h>
#include <net/devlink.h>
+#include <net/xdp.h>
#include <soc/fsl/dpaa2-io.h>
#include <soc/fsl/dpaa2-fd.h>
diff --git a/drivers/net/ethernet/freescale/enetc/enetc.h b/drivers/net/ethernet/freescale/enetc/enetc.h
index 8577cf7699a0..7439739cd81a 100644
--- a/drivers/net/ethernet/freescale/enetc/enetc.h
+++ b/drivers/net/ethernet/freescale/enetc/enetc.h
@@ -11,6 +11,7 @@
#include <linux/if_vlan.h>
#include <linux/phylink.h>
#include <linux/dim.h>
+#include <net/xdp.h>
#include "enetc_hw.h"
diff --git a/drivers/net/ethernet/freescale/fec.h b/drivers/net/ethernet/freescale/fec.h
index 8f1edcca96c4..5a0974e62f99 100644
--- a/drivers/net/ethernet/freescale/fec.h
+++ b/drivers/net/ethernet/freescale/fec.h
@@ -22,6 +22,7 @@
#include <linux/timecounter.h>
#include <dt-bindings/firmware/imx/rsrc.h>
#include <linux/firmware/imx/sci.h>
+#include <net/xdp.h>
#if defined(CONFIG_M523x) || defined(CONFIG_M527x) || defined(CONFIG_M528x) || \
defined(CONFIG_M520x) || defined(CONFIG_M532x) || defined(CONFIG_ARM) || \
diff --git a/drivers/net/ethernet/fungible/funeth/funeth_txrx.h b/drivers/net/ethernet/fungible/funeth/funeth_txrx.h
index 53b7e95213a8..5eec552a1f24 100644
--- a/drivers/net/ethernet/fungible/funeth/funeth_txrx.h
+++ b/drivers/net/ethernet/fungible/funeth/funeth_txrx.h
@@ -5,6 +5,7 @@
#include <linux/netdevice.h>
#include <linux/u64_stats_sync.h>
+#include <net/xdp.h>
/* Tx descriptor size */
#define FUNETH_SQE_SIZE 64U
diff --git a/drivers/net/ethernet/google/gve/gve.h b/drivers/net/ethernet/google/gve/gve.h
index 4b425bf71ede..a31256f70348 100644
--- a/drivers/net/ethernet/google/gve/gve.h
+++ b/drivers/net/ethernet/google/gve/gve.h
@@ -11,6 +11,7 @@
#include <linux/netdevice.h>
#include <linux/pci.h>
#include <linux/u64_stats_sync.h>
+#include <net/xdp.h>
#include "gve_desc.h"
#include "gve_desc_dqo.h"
diff --git a/drivers/net/ethernet/intel/igc/igc.h b/drivers/net/ethernet/intel/igc/igc.h
index 9db384f66a8e..4bffc3cb502f 100644
--- a/drivers/net/ethernet/intel/igc/igc.h
+++ b/drivers/net/ethernet/intel/igc/igc.h
@@ -15,6 +15,7 @@
#include <linux/net_tstamp.h>
#include <linux/bitfield.h>
#include <linux/hrtimer.h>
+#include <net/xdp.h>
#include "igc_hw.h"
diff --git a/drivers/net/ethernet/microchip/lan966x/lan966x_main.h b/drivers/net/ethernet/microchip/lan966x/lan966x_main.h
index 27f272831ea5..eb7d81b5e9f8 100644
--- a/drivers/net/ethernet/microchip/lan966x/lan966x_main.h
+++ b/drivers/net/ethernet/microchip/lan966x/lan966x_main.h
@@ -14,6 +14,7 @@
#include <net/pkt_cls.h>
#include <net/pkt_sched.h>
#include <net/switchdev.h>
+#include <net/xdp.h>
#include <vcap_api.h>
#include <vcap_api_client.h>
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index ac2acc9aca9d..21665f114fe9 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -11,6 +11,7 @@
#include <net/checksum.h>
#include <net/ip6_checksum.h>
+#include <net/xdp.h>
#include <net/mana/mana.h>
#include <net/mana/mana_auxiliary.h>
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac.h b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
index 4ce5eaaae513..f838a13b9447 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac.h
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
@@ -23,6 +23,7 @@
#include <linux/reset.h>
#include <net/page_pool.h>
#include <uapi/linux/bpf.h>
+#include <net/xdp.h>
struct stmmac_resources {
void __iomem *addr;
diff --git a/drivers/net/ethernet/ti/cpsw_priv.h b/drivers/net/ethernet/ti/cpsw_priv.h
index 34230145ca0b..67ca005fd990 100644
--- a/drivers/net/ethernet/ti/cpsw_priv.h
+++ b/drivers/net/ethernet/ti/cpsw_priv.h
@@ -7,6 +7,7 @@
#define DRIVERS_NET_ETHERNET_TI_CPSW_PRIV_H_
#include <uapi/linux/bpf.h>
+#include <net/xdp.h>
#include "davinci_cpdma.h"
diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index c9dd69dbe1b8..810977952f95 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -16,6 +16,7 @@
#include <linux/hyperv.h>
#include <linux/rndis.h>
#include <linux/jhash.h>
+#include <net/xdp.h>
/* RSS related */
#define OID_GEN_RECEIVE_SCALE_CAPABILITIES 0x00010203 /* query only */
diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index 9137fb8c1c42..b196a2a54355 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -22,6 +22,7 @@
#include <net/net_namespace.h>
#include <net/rtnetlink.h>
#include <net/sock.h>
+#include <net/xdp.h>
#include <linux/virtio_net.h>
#include <linux/skb_array.h>
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index 024ad8ddb27e..1ccdca03e166 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -4,6 +4,8 @@
#ifndef _MANA_H
#define _MANA_H
+#include <net/xdp.h>
+
#include "gdma.h"
#include "hw_channel.h"
--
2.41.0
^ permalink raw reply related
* RE: [PATCH bpf-next 1/3] eth: add missing xdp.h includes in drivers
From: Wei Fang @ 2023-08-02 2:26 UTC (permalink / raw)
To: Jakub Kicinski, ast@kernel.org
Cc: netdev@vger.kernel.org, bpf@vger.kernel.org, hawk@kernel.org,
amritha.nambiar@intel.com, aleksander.lobakin@intel.com,
j.vosburgh@gmail.com, andy@greyhouse.net, shayagr@amazon.com,
akiyano@amazon.com, Ioana Ciornei, Claudiu Manoil,
Vladimir Oltean, Shenwei Wang, Clark Wang, dl-linux-imx,
dmichail@fungible.com, jeroendb@google.com,
pkaligineedi@google.com, shailend@google.com,
jesse.brandeburg@intel.com, anthony.l.nguyen@intel.com,
horatiu.vultur@microchip.com, UNGLinuxDriver@microchip.com,
kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
decui@microsoft.com, peppe.cavallaro@st.com,
alexandre.torgue@foss.st.com, joabreu@synopsys.com,
mcoquelin.stm32@gmail.com, grygorii.strashko@ti.com,
longli@microsoft.com, sharmaajay@microsoft.com,
daniel@iogearbox.net, john.fastabend@gmail.com,
gerhard@engleder-embedded.com, simon.horman@corigine.com,
leon@kernel.org, linux-hyperv@vger.kernel.org
In-Reply-To: <20230802003246.2153774-2-kuba@kernel.org>
> Handful of drivers currently expect to get xdp.h by virtue of including
> netdevice.h. This will soon no longer be the case so add explicit includes.
>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
> CC: j.vosburgh@gmail.com
> CC: andy@greyhouse.net
> CC: shayagr@amazon.com
> CC: akiyano@amazon.com
> CC: ioana.ciornei@nxp.com
> CC: claudiu.manoil@nxp.com
> CC: vladimir.oltean@nxp.com
> CC: wei.fang@nxp.com
> CC: shenwei.wang@nxp.com
> CC: xiaoning.wang@nxp.com
> CC: linux-imx@nxp.com
> CC: dmichail@fungible.com
> CC: jeroendb@google.com
> CC: pkaligineedi@google.com
> CC: shailend@google.com
> CC: jesse.brandeburg@intel.com
> CC: anthony.l.nguyen@intel.com
> CC: horatiu.vultur@microchip.com
> CC: UNGLinuxDriver@microchip.com
> CC: kys@microsoft.com
> CC: haiyangz@microsoft.com
> CC: wei.liu@kernel.org
> CC: decui@microsoft.com
> CC: peppe.cavallaro@st.com
> CC: alexandre.torgue@foss.st.com
> CC: joabreu@synopsys.com
> CC: mcoquelin.stm32@gmail.com
> CC: grygorii.strashko@ti.com
> CC: longli@microsoft.com
> CC: sharmaajay@microsoft.com
> CC: daniel@iogearbox.net
> CC: hawk@kernel.org
> CC: john.fastabend@gmail.com
> CC: gerhard@engleder-embedded.com
> CC: simon.horman@corigine.com
> CC: leon@kernel.org
> CC: linux-hyperv@vger.kernel.org
> CC: bpf@vger.kernel.org
> ---
> drivers/net/bonding/bond_main.c | 1 +
> drivers/net/ethernet/amazon/ena/ena_netdev.h | 1 +
> drivers/net/ethernet/engleder/tsnep.h | 1 +
> drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h | 1 +
> drivers/net/ethernet/freescale/enetc/enetc.h | 1 +
> drivers/net/ethernet/freescale/fec.h | 1 +
> drivers/net/ethernet/fungible/funeth/funeth_txrx.h | 1 +
> drivers/net/ethernet/google/gve/gve.h | 1 +
> drivers/net/ethernet/intel/igc/igc.h | 1 +
> drivers/net/ethernet/microchip/lan966x/lan966x_main.h | 1 +
> drivers/net/ethernet/microsoft/mana/mana_en.c | 1 +
> drivers/net/ethernet/stmicro/stmmac/stmmac.h | 1 +
> drivers/net/ethernet/ti/cpsw_priv.h | 1 +
> drivers/net/hyperv/hyperv_net.h | 1 +
> drivers/net/tap.c | 1 +
> include/net/mana/mana.h | 2 ++
> 16 files changed, 17 insertions(+)
>
[...]
> diff --git a/drivers/net/ethernet/freescale/fec.h
> b/drivers/net/ethernet/freescale/fec.h
> index 8f1edcca96c4..5a0974e62f99 100644
> --- a/drivers/net/ethernet/freescale/fec.h
> +++ b/drivers/net/ethernet/freescale/fec.h
> @@ -22,6 +22,7 @@
> #include <linux/timecounter.h>
> #include <dt-bindings/firmware/imx/rsrc.h> #include
> <linux/firmware/imx/sci.h>
> +#include <net/xdp.h>
>
> #if defined(CONFIG_M523x) || defined(CONFIG_M527x) ||
> defined(CONFIG_M528x) || \
> defined(CONFIG_M520x) || defined(CONFIG_M532x) ||
> defined(CONFIG_ARM) || \ diff --git
For fec.h, it looks good to me.
Reviewed-by: Wei Fang <wei.fang@nxp.com>
^ permalink raw reply
* Re: [EXTERNAL] Re: [PATCH v4 1/1] RDMA/mana_ib: Add EQ interrupt support to mana ib driver.
From: Ajay Sharma @ 2023-08-02 4:11 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Long Li, Wei Hu, netdev@vger.kernel.org,
linux-hyperv@vger.kernel.org, linux-rdma@vger.kernel.org,
leon@kernel.org, KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org,
Dexuan Cui, davem@davemloft.net, edumazet@google.com,
kuba@kernel.org, pabeni@redhat.com, vkuznets,
ssengar@linux.microsoft.com, shradhagupta@linux.microsoft.com
In-Reply-To: <ZMmZO9IPmXNEB49t@ziepe.ca>
> On Aug 1, 2023, at 6:46 PM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Tue, Aug 01, 2023 at 07:06:57PM +0000, Long Li wrote:
>
>> The driver interrupt code limits the CPU processing time of each EQ
>> by reading a small batch of EQEs in this interrupt. It guarantees
>> all the EQs are checked on this CPU, and limits the interrupt
>> processing time for any given EQ. In this way, a bad EQ (which is
>> stormed by a bad user doing unreasonable re-arming on the CQ) can't
>> storm other EQs on this CPU.
>
> Of course it can, the bad use just creates a million EQs and pushes a
> bit of work through them constantly. How is that really any different
> from pushing more EQEs into a single EQ?
>
> And how does your EQ multiplexing work anyhow? Do you poll every EQ on
> every interrupt? That itself is a DOS vector.
>
> Jason
User does not create eqs directly . EQ creation is by product of opening device ie allocating context. I am not sure if the same process is allowed to open device multiple times - must be some kind of lock implemented. So million eqs are probably far fetched .
As for how the eq servicing is done - only those eq’s for which the interrupt is raised are checked. And each eq is tied only once and only to a single interrupt.
Ajay
^ permalink raw reply
* Re: [EXTERNAL] Re: [PATCH V7 net] net: mana: Fix MANA VF unload when hardware is
From: Souradeep Chakrabarti @ 2023-08-02 5:37 UTC (permalink / raw)
To: Kalesh Anakkur Purayil
Cc: Souradeep Chakrabarti, Simon Horman, KY Srinivasan, Haiyang Zhang,
wei.liu@kernel.org, Dexuan Cui, davem@davemloft.net,
edumazet@google.com, kuba@kernel.org, pabeni@redhat.com, Long Li,
Ajay Sharma, leon@kernel.org, cai.huoqing@linux.dev,
ssengar@linux.microsoft.com, vkuznets, tglx@linutronix.de,
linux-hyperv@vger.kernel.org, netdev@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-rdma@vger.kernel.org,
stable@vger.kernel.org
In-Reply-To: <CAH-L+nPsuoJfCQcJnpMWk5DPGev8f+YWi0K4V+fU=5-bxP5GVw@mail.gmail.com>
On Wed, Aug 02, 2023 at 10:57:52AM +0530, Kalesh Anakkur Purayil wrote:
> Hi Souradeep,
>
> It looks like the subject line is not complete. I could see "net: mana: Fix
> MANA VF unload when hardware is".
>
> Is that correct?
>
> Regards,
> Kalesh
>
Yes, it got truncated. Will fix it in next version.
> On Wed, Aug 2, 2023 at 12:29 AM Souradeep Chakrabarti <
> schakrabarti@microsoft.com> wrote:
>
> >
> >
> > >-----Original Message-----
> > >From: Simon Horman <horms@kernel.org>
> > >Sent: Tuesday, August 1, 2023 9:01 PM
> > >To: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>
> > >Cc: KY Srinivasan <kys@microsoft.com>; Haiyang Zhang
> > ><haiyangz@microsoft.com>; wei.liu@kernel.org; Dexuan Cui
> > ><decui@microsoft.com>; davem@davemloft.net; edumazet@google.com;
> > >kuba@kernel.org; pabeni@redhat.com; Long Li <longli@microsoft.com>; Ajay
> > >Sharma <sharmaajay@microsoft.com>; leon@kernel.org;
> > >cai.huoqing@linux.dev; ssengar@linux.microsoft.com; vkuznets
> > ><vkuznets@redhat.com>; tglx@linutronix.de; linux-hyperv@vger.kernel.org;
> > >netdev@vger.kernel.org; linux-kernel@vger.kernel.org; linux-
> > >rdma@vger.kernel.org; Souradeep Chakrabarti
> > ><schakrabarti@microsoft.com>; stable@vger.kernel.org
> > >Subject: [EXTERNAL] Re: [PATCH V7 net] net: mana: Fix MANA VF unload when
> > >hardware is
> > >
> > >On Tue, Aug 01, 2023 at 05:29:13AM -0700, Souradeep Chakrabarti wrote:
> > >
> > >...
> > >
> > >Hi Souradeep,
> > >
> > >
> > >> + for (i = 0; i < apc->num_queues; i++) {
> > >> + txq = &apc->tx_qp[i].txq;
> > >> + while (skb = skb_dequeue(&txq->pending_skbs)) {
> > >
> > >W=1 builds with both clang-16 and gcc-12 complain that they would like an
> > >extra set of parentheses around an assignment used as a truth value.
> > Thanks for letting me know. I will fix it in next version.
> > >
> > >> + mana_unmap_skb(skb, apc);
> > >> + dev_consume_skb_any(skb);
> > >> + }
> > >> + atomic_set(&txq->pending_sends, 0);
> > >> + }
> > >> /* We're 100% sure the queues can no longer be woken up, because
> > >> * we're sure now mana_poll_tx_cq() can't be running.
> > >> */
> > >> --
> > >> 2.34.1
> > >>
> > >>
> >
> >
>
> --
> Regards,
> Kalesh A P
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox