* Re: [PATCH v2 1/1] RDMA/mana_ib: Add EQ interrupt support to mana ib driver.
From: Jason Gunthorpe @ 2023-06-12 18:16 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Leon Romanovsky, Wei Hu, netdev@vger.kernel.org,
linux-hyperv@vger.kernel.org, linux-rdma@vger.kernel.org, Long Li,
Ajay Sharma, KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org,
Dexuan Cui, davem@davemloft.net, edumazet@google.com,
pabeni@redhat.com, vkuznets@redhat.com,
ssengar@linux.microsoft.com, shradhagupta@linux.microsoft.com
In-Reply-To: <20230612102221.2ca726fd@kernel.org>
On Mon, Jun 12, 2023 at 10:22:21AM -0700, Jakub Kicinski wrote:
> On Mon, 12 Jun 2023 09:13:49 +0300 Leon Romanovsky wrote:
> > > Thanks for you comment. I am new to the process. I have a few
> > > questions regarding to this and hope you can help. First of all,
> > > the patch is mostly for IB. Is it possible for the patch to just go
> > > through the RDMA branch, since most of the changes are in RDMA?
> >
> > Yes, it can, we (RDMA) will handle it.
>
> Probably, although it's better to teach them some process sooner
> rather than later?
I've been of the opinion the shared branch process is difficult - we
took a long time to fine tune the process. If you don't fully
understand how to do this with git you can make a real mess of it.
So I would say MS is welcome to use it if they can do it right, but I
wouldn't push them to do so or expect they must to be
successful. Really only Mellanox and Intel seem to have enough churn
to justify it right now.
If they don't use shared branches then they must be responsible to
avoid conflicts, even if that means they have to delay sending patches
for a cycle.
Jason
^ permalink raw reply
* RE: [PATCH net-next] tcp: Make pingpong threshold tunable
From: Haiyang Zhang @ 2023-06-12 22:05 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: ncardwell@google.com, atenart@kernel.org, bagasdotme@gmail.com,
corbet@lwn.net, davem@davemloft.net, dsahern@kernel.org,
edumazet@google.com, kuba@kernel.org, KY Srinivasan,
linux-doc@vger.kernel.org, linux-hyperv@vger.kernel.org,
linux-kernel@vger.kernel.org, liushixin2@huawei.com,
maheshb@google.com, netdev@vger.kernel.org, olaf@aepfle.de,
pabeni@redhat.com, simon.horman@corigine.com, soheil@google.com,
stephen@networkplumber.org, Tim Gardner, vkuznets@redhat.com,
weiwan@google.com, ycheng@google.com, ykaliuta@redhat.com
In-Reply-To: <20230609195338.27299-1-kuniyu@amazon.com>
> -----Original Message-----
> From: Kuniyuki Iwashima <kuniyu@amazon.com>
> Sent: Friday, June 9, 2023 3:54 PM
> To: Haiyang Zhang <haiyangz@microsoft.com>
> Cc: ncardwell@google.com; atenart@kernel.org; bagasdotme@gmail.com;
> corbet@lwn.net; davem@davemloft.net; dsahern@kernel.org;
> edumazet@google.com; kuba@kernel.org; kuniyu@amazon.com; KY
> Srinivasan <kys@microsoft.com>; linux-doc@vger.kernel.org; linux-
> hyperv@vger.kernel.org; linux-kernel@vger.kernel.org;
> liushixin2@huawei.com; maheshb@google.com; netdev@vger.kernel.org;
> olaf@aepfle.de; pabeni@redhat.com; simon.horman@corigine.com;
> soheil@google.com; stephen@networkplumber.org; Tim Gardner
> <tim.gardner@canonical.com>; vkuznets@redhat.com; weiwan@google.com;
> ycheng@google.com; ykaliuta@redhat.com
> Subject: Re: [PATCH net-next] tcp: Make pingpong threshold tunable
>
> [You don't often get email from kuniyu@amazon.com. Learn why this is
> important at https://aka.ms/LearnAboutSenderIdentification ]
>
> From: Neal Cardwell <ncardwell@google.com>
> Date: Fri, 9 Jun 2023 15:16:00 -0400
> > On Fri, Jun 9, 2023 at 12:26 PM Haiyang Zhang <haiyangz@microsoft.com>
> wrote:
> >
> > Regarding the patch title:
> > > [PATCH net-next] tcp: Make pingpong threshold tunable
> >
> > There are many ways to make something tunable these days, including
> > BPF, setsockopt(), and sysctl. :-) This patch only uses sysctl. Please
> > consider a more clear/specific title, like:
> >
> > [PATCH net-next] tcp: set pingpong threshold via sysctl
> >
> > > TCP pingpong threshold is 1 by default. But some applications, like SQL DB
> > > may prefer a higher pingpong threshold to activate delayed acks in quick
> > > ack mode for better performance.
> > >
> > > The pingpong threshold and related code were changed to 3 in the year
> > > 2019, and reverted to 1 in the year 2022.
> >
> > Please include the specific commit, like:
> >
> > The pingpong threshold and related code were changed to 3 in the year
> > 2019 in:
> > commit 4a41f453bedf ("tcp: change pingpong threshold to 3")
> > and reverted to 1 in the year 2022 in:
> > commit 4d8f24eeedc5 ("Revert "tcp: change pingpong threshold to 3"")
> >
> > Then please make sure to use scripts/checkpatch.pl on your resulting
> > patch to check the formatting of the commit references, among other
> > things.
> >
> > > There is no single value that
> > > fits all applications.
> > >
> > > Add net.core.tcp_pingpong_thresh sysctl tunable,
> >
> > For consistency, TCP sysctls should be in net.ipv4 rather than
> > net.core. Yes, that is awkward, given IPv6 support. But consistency is
> > very important here. :-)
> >
> > > so it can be tuned for
> > > optimal performance based on the application needs.
> > >
> > > Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
> > > ---
> > > Documentation/admin-guide/sysctl/net.rst | 8 ++++++++
> > > include/net/inet_connection_sock.h | 14 +++++++++++---
> > > net/core/sysctl_net_core.c | 9 +++++++++
> > > net/ipv4/tcp.c | 2 ++
> > > net/ipv4/tcp_output.c | 17 +++++++++++++++--
> > > 5 files changed, 45 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/Documentation/admin-guide/sysctl/net.rst
> b/Documentation/admin-guide/sysctl/net.rst
> > > index 4877563241f3..16f54be9461f 100644
> > > --- a/Documentation/admin-guide/sysctl/net.rst
> > > +++ b/Documentation/admin-guide/sysctl/net.rst
> > > @@ -413,6 +413,14 @@ historical importance.
> > >
> > > Default: 0
> > >
> > > +tcp_pingpong_thresh
> > > +-------------------
> > > +
> > > +TCP pingpong threshold is 1 by default, but some application may need a
> higher
> > > +threshold for optimal performance.
> > > +
> > > +Default: 1, min: 1, max: 3
> >
> > If we want to make this tunable, it seems sad to make the max 3. I'd
> > suggest making the max 255, since we have 8 bits of space anyway in
> > the inet_csk(sk)->icsk_ack.pingpong field.
> >
> > > +
> > > 2. /proc/sys/net/unix - Parameters for Unix domain sockets
> > > ----------------------------------------------------------
> > >
> > > diff --git a/include/net/inet_connection_sock.h
> b/include/net/inet_connection_sock.h
> > > index c2b15f7e5516..e84e33ddae49 100644
> > > --- a/include/net/inet_connection_sock.h
> > > +++ b/include/net/inet_connection_sock.h
> > > @@ -324,11 +324,11 @@ void inet_csk_update_fastreuse(struct
> inet_bind_bucket *tb,
> > >
> > > struct dst_entry *inet_csk_update_pmtu(struct sock *sk, u32 mtu);
> > >
> > > -#define TCP_PINGPONG_THRESH 1
> > > +extern int tcp_pingpong_thresh;
> >
> > To match most TCP sysctls, this should be per-namespace, rather than
> global.
>
> Also, please change int to u8.
>
>
> >
> > Please follow a recent example by Eric, perhaps:
> > 65466904b015f6eeb9225b51aeb29b01a1d4b59c
> > tcp: adjust TSO packet sizes based on min_rtt
> >
> >
> > >
> > > static inline void inet_csk_enter_pingpong_mode(struct sock *sk)
> > > {
> > > - inet_csk(sk)->icsk_ack.pingpong = TCP_PINGPONG_THRESH;
> > > + inet_csk(sk)->icsk_ack.pingpong = tcp_pingpong_thresh;
> > > }
> >
> > inet_csk(sk)->icsk_ack.pingpong = sock_net(sk)-
> >sysctl_tcp_pingpong_thresh;
>
> Let's use READ_ONCE(sock_net(sk)->sysctl_tcp_pingpong_thresh).
> Same for other sysctl reads.
>
>
> >
> > > static inline void inet_csk_exit_pingpong_mode(struct sock *sk)
> > > @@ -338,7 +338,15 @@ static inline void
> inet_csk_exit_pingpong_mode(struct sock *sk)
> > >
> > > static inline bool inet_csk_in_pingpong_mode(struct sock *sk)
> > > {
> > > - return inet_csk(sk)->icsk_ack.pingpong >= TCP_PINGPONG_THRESH;
> > > + return inet_csk(sk)->icsk_ack.pingpong >= tcp_pingpong_thresh;
> > > +}
> >
> > Again, sock_net(sk)->sysctl_tcp_pingpong_thresh rather than
> tcp_pingpong_thresh.
> >
> > > +static inline void inet_csk_inc_pingpong_cnt(struct sock *sk)
> > > +{
> > > + struct inet_connection_sock *icsk = inet_csk(sk);
> > > +
> > > + if (icsk->icsk_ack.pingpong < U8_MAX)
> > > + icsk->icsk_ack.pingpong++;
> > > }
> > >
> > > static inline bool inet_csk_has_ulp(struct sock *sk)
> > > diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
> > > index 782273bb93c2..b5253567f2bd 100644
> > > --- a/net/core/sysctl_net_core.c
> > > +++ b/net/core/sysctl_net_core.c
> > > @@ -653,6 +653,15 @@ static struct ctl_table net_core_table[] = {
> >
> > Again, in net.ipv4, not net.core.
> >
> > > .proc_handler = proc_dointvec_minmax,
> > > .extra1 = SYSCTL_ZERO,
> > > },
> > > + {
> > > + .procname = "tcp_pingpong_thresh",
> > > + .data = &tcp_pingpong_thresh,
> > > + .maxlen = sizeof(int),
> > > + .mode = 0644,
> > > + .proc_handler = proc_dointvec_minmax,
> > > + .extra1 = SYSCTL_ONE,
> > > + .extra2 = SYSCTL_THREE,
> >
> > Please make the max U8_MAX to allow more flexibility (since we have 8
> > bits of space anyway in the inet_csk(sk)->icsk_ack.pingpong field).
>
> Please use proc_dou8vec_minmax(), then you can drop .extra2.
>
> .maxlen = sizeof(u8),
> .mode = 0644,
> .proc_handler = proc_dou8vec_minmax,
> .extra1 = SYSCTL_ONE,
>
> Thanks,
> Kuniyuki
>
> >
> > > + },
> > > { }
> > > };
> > >
> > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > > index 53b7751b68e1..dcd143193d41 100644
> > > --- a/net/ipv4/tcp.c
> > > +++ b/net/ipv4/tcp.c
> > > @@ -308,6 +308,8 @@ EXPORT_SYMBOL(tcp_have_smc);
> > > struct percpu_counter tcp_sockets_allocated
> ____cacheline_aligned_in_smp;
> > > EXPORT_SYMBOL(tcp_sockets_allocated);
> > >
> > > +int tcp_pingpong_thresh __read_mostly = 1;
> > > +
> >
> > Again, per-network-namespace. You will need to initialize the
> > per-netns value in tcp_sk_init(). Again, see Eric's
> > 65466904b015f6eeb9225b51aeb29b01a1d4b59c commit for an example.
> >
> > > * TCP splice context
> > > */
> > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > > index cfe128b81a01..576d21621778 100644
> > > --- a/net/ipv4/tcp_output.c
> > > +++ b/net/ipv4/tcp_output.c
> > > @@ -167,12 +167,25 @@ static void tcp_event_data_sent(struct
> tcp_sock *tp,
> > > if (tcp_packets_in_flight(tp) == 0)
> > > tcp_ca_event(sk, CA_EVENT_TX_START);
> > >
> > > + /* If tcp_pingpong_thresh > 1, and
> > > + * this is the first data packet sent in response to the
> > > + * previous received data,
> > > + * and it is a reply for ato after last received packet,
> > > + * increase pingpong count.
> > > + */
> > > + if (tcp_pingpong_thresh > 1 &&
> > > + before(tp->lsndtime, icsk->icsk_ack.lrcvtime) &&
> > > + (u32)(now - icsk->icsk_ack.lrcvtime) < icsk->icsk_ack.ato)
> > > + inet_csk_inc_pingpong_cnt(sk);
> > > +
> >
> > Introducing this new code re-introduces a bug fixed in 4d8f24eeedc5.
> > As that commit description noted:
> >
> > This to-be-reverted commit was meant to apply a stricter rule for the
> > stack to enter pingpong mode. However, the condition used to check for
> > interactive session "before(tp->lsndtime, icsk->icsk_ack.lrcvtime)" is
> > jiffy based and might be too coarse, which delays the stack entering
> > pingpong mode.
> > We revert this patch so that we no longer use the above condition to
> > determine interactive session,
> >
> > > tp->lsndtime = now;
> > >
> > > - /* If it is a reply for ato after last received
> > > + /* If tcp_pingpong_thresh == 1, and
> >
> > Please remove the "If tcp_pingpong_thresh == 1, and" part, since this
> > is the correct code path no matter the value of the threshold.
> >
> > > + * it is a reply for ato after last received
> > > * packet, enter pingpong mode.
> > > */
> > > - if ((u32)(now - icsk->icsk_ack.lrcvtime) < icsk->icsk_ack.ato)
> > > + if (tcp_pingpong_thresh == 1 &&
> >
> > Please remove the "if (tcp_pingpong_thresh == 1 &&" part, since this
> > is the correct code path no matter the value of the threshold.
> >
> > > + (u32)(now - icsk->icsk_ack.lrcvtime) < icsk->icsk_ack.ato)
> > > inet_csk_enter_pingpong_mode(sk);
> >
> > Please make this call inet_csk_inc_pingpong_cnt(), since this is the
> > correct code path no matter the value of the threshold.
Thank Neal and Kuniyuki for the detailed reviews! I will update the code
accordingly.
- Haiyang
^ permalink raw reply
* [PATCH v2] x86/hyperv: add noop functions to x86_init mpparse functions
From: Saurabh Sengar @ 2023-06-13 7:13 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, tglx, mingo, bp, dave.hansen, x86,
mikelley, linux-kernel, linux-hyperv, hpa
In certain configurations, VTL0 and VTL2 can share the same VM
partition and hence share the same memory address space. In such
systems VTL2 has visibility of all of the VTL0 memory space.
When CONFIG_X86_MPPARSE is enabled for VTL2, the kernel performs
a scan of low memory to search for MP tables. However, in systems
where VTL0 controls the low memory and may contain valid tables
specific to VTL0, this scanning process can lead to confusion
within the VTL2 kernel.
In !ACPI system, there is no way to disable CONFIG_X86_MPPARSE
hence add the noop function instead.
Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com>
---
[V2]: Improve commit message
arch/x86/hyperv/hv_vtl.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/arch/x86/hyperv/hv_vtl.c b/arch/x86/hyperv/hv_vtl.c
index 85d38b9f3586..db5d2ea39fc0 100644
--- a/arch/x86/hyperv/hv_vtl.c
+++ b/arch/x86/hyperv/hv_vtl.c
@@ -25,6 +25,10 @@ void __init hv_vtl_init_platform(void)
x86_init.irqs.pre_vector_init = x86_init_noop;
x86_init.timers.timer_init = x86_init_noop;
+ /* Avoid searching for BIOS MP tables */
+ x86_init.mpparse.find_smp_config = x86_init_noop;
+ x86_init.mpparse.get_smp_config = x86_init_uint_noop;
+
x86_platform.get_wallclock = get_rtc_noop;
x86_platform.set_wallclock = set_rtc_noop;
x86_platform.get_nmi_reason = hv_get_nmi_reason;
--
2.34.1
^ permalink raw reply related
* Re: [PATCH v2 1/1] RDMA/mana_ib: Add EQ interrupt support to mana ib driver.
From: Leon Romanovsky @ 2023-06-13 7:29 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Wei Hu, netdev@vger.kernel.org, linux-hyperv@vger.kernel.org,
linux-rdma@vger.kernel.org, Long Li, Ajay Sharma, jgg@ziepe.ca,
KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org, Dexuan Cui,
davem@davemloft.net, edumazet@google.com, pabeni@redhat.com,
vkuznets@redhat.com, ssengar@linux.microsoft.com,
shradhagupta@linux.microsoft.com
In-Reply-To: <20230612102221.2ca726fd@kernel.org>
On Mon, Jun 12, 2023 at 10:22:21AM -0700, Jakub Kicinski wrote:
> On Mon, 12 Jun 2023 09:13:49 +0300 Leon Romanovsky wrote:
> > > Thanks for you comment. I am new to the process. I have a few
> > > questions regarding to this and hope you can help. First of all,
> > > the patch is mostly for IB. Is it possible for the patch to just go
> > > through the RDMA branch, since most of the changes are in RDMA?
> >
> > Yes, it can, we (RDMA) will handle it.
>
> Probably, although it's better to teach them some process sooner
> rather than later?
Like Jason wrote above, shared branch is valuable once the submitter has
enough volume to justify. In previous cycles, when we needed it, I simply
created shared branch [1] for them.
Even we (Nvidia), who has enough patch volume, sometimes prefer to
delay submission to next cycle and don't bother ourselves with shared
branch.
[1] https://lore.kernel.org/all/Y2v2CGEWC70g+Ot+@unreal/
Thanks
^ permalink raw reply
* Re: [PATCH v2] x86/hyperv: add noop functions to x86_init mpparse functions
From: Dave Hansen @ 2023-06-13 17:33 UTC (permalink / raw)
To: Saurabh Sengar, kys, haiyangz, wei.liu, decui, tglx, mingo, bp,
dave.hansen, x86, mikelley, linux-kernel, linux-hyperv, hpa
In-Reply-To: <1686640391-13001-1-git-send-email-ssengar@linux.microsoft.com>
On 6/13/23 00:13, Saurabh Sengar wrote:
> In certain configurations, VTL0 and VTL2 can share the same VM
> partition and hence share the same memory address space. In such
> systems VTL2 has visibility of all of the VTL0 memory space.
FWIW, this is pretty much gibberish to me. The way I suggest avoiding
producing gibberish is avoiding acronyms:
Hyper-V can run VMs at different privilege "levels". Sometimes,
it chooses to run two different VMs at different levels but
they share some of their address space. This <insert reason
that someone might want to do this>.
That's not gibberish.
> When CONFIG_X86_MPPARSE is enabled for VTL2, the kernel performs
> a scan of low memory to search for MP tables. However, in systems
> where VTL0 controls the low memory and may contain valid tables
> specific to VTL0, this scanning process can lead to confusion
> within the VTL2 kernel.
What is the end-user-visible effect of this "confusion"? A crash? A
warning? An error message? What is the impact on end users?
This information will help the maintainers decide how to disposition
your patch. Should we send it upstream immediately because it's
impacting millions of users? Or can we do it in a bit more leisurely
fashion because nobody cares?
> In !ACPI system, there is no way to disable CONFIG_X86_MPPARSE
> hence add the noop function instead.
This makes zero sense to me.
Like I told you before, we don't compile things out just because they
don't work on one weirdo system. If we did that, we'd have a billion
incompatible x86 kernel images that can't boot across systems.
^ permalink raw reply
* RE: [EXTERNAL] Re: [PATCH v2] x86/hyperv: add noop functions to x86_init mpparse functions
From: Saurabh Singh Sengar @ 2023-06-14 4:05 UTC (permalink / raw)
To: Dave Hansen, Saurabh Sengar, KY Srinivasan, Haiyang Zhang,
wei.liu@kernel.org, Dexuan Cui, tglx@linutronix.de,
mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
x86@kernel.org, Michael Kelley (LINUX),
linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org,
hpa@zytor.com
In-Reply-To: <23749756-022a-5574-af4d-a4a03d9542e1@intel.com>
> -----Original Message-----
> From: Dave Hansen <dave.hansen@intel.com>
> Sent: Tuesday, June 13, 2023 11:03 PM
> To: Saurabh Sengar <ssengar@linux.microsoft.com>; KY Srinivasan
> <kys@microsoft.com>; Haiyang Zhang <haiyangz@microsoft.com>;
> wei.liu@kernel.org; Dexuan Cui <decui@microsoft.com>; tglx@linutronix.de;
> mingo@redhat.com; bp@alien8.de; dave.hansen@linux.intel.com;
> x86@kernel.org; Michael Kelley (LINUX) <mikelley@microsoft.com>; linux-
> kernel@vger.kernel.org; linux-hyperv@vger.kernel.org; hpa@zytor.com
> Subject: [EXTERNAL] Re: [PATCH v2] x86/hyperv: add noop functions to
> x86_init mpparse functions
>
> On 6/13/23 00:13, Saurabh Sengar wrote:
> > In certain configurations, VTL0 and VTL2 can share the same VM
> > partition and hence share the same memory address space. In such
> > systems VTL2 has visibility of all of the VTL0 memory space.
>
> FWIW, this is pretty much gibberish to me. The way I suggest avoiding
> producing gibberish is avoiding acronyms:
>
> Hyper-V can run VMs at different privilege "levels". Sometimes,
> it chooses to run two different VMs at different levels but
> they share some of their address space. This <insert reason
> that someone might want to do this>.
>
> That's not gibberish.
Thanks for your suggestion I can add this in v3.
>
> > When CONFIG_X86_MPPARSE is enabled for VTL2, the kernel performs a
> > scan of low memory to search for MP tables. However, in systems where
> > VTL0 controls the low memory and may contain valid tables specific to
> > VTL0, this scanning process can lead to confusion within the VTL2
> > kernel.
>
> What is the end-user-visible effect of this "confusion"? A crash? A warning?
> An error message? What is the impact on end users?
The VTL2 kernel is currently scanning the VTL0 MP table and incorporating that
information, which is incorrect because VTL2 doesn't support MP tables and
instead, is booted with DT. While I don't have an immediate crash or error
message to present, this situation could potentially result in incorrect behaviour.
>
> This information will help the maintainers decide how to disposition your
> patch. Should we send it upstream immediately because it's impacting
> millions of users? Or can we do it in a bit more leisurely fashion because
> nobody cares?
I understand, I have provided all the information I have please consider what is
appropriate in this case.
>
> > In !ACPI system, there is no way to disable CONFIG_X86_MPPARSE hence
> > add the noop function instead.
>
> This makes zero sense to me.
My intention was to tell that this fix is required because we are in !ACPI system.
If it was ACPI system we could have simply disable this CONFIG_X86_MPPARSE
Option. But as you suggested I am fine removing these 2 lines.
>
> Like I told you before, we don't compile things out just because they don't
> work on one weirdo system. If we did that, we'd have a billion incompatible
> x86 kernel images that can't boot across systems.
>
Understood, thank you. I was just considering the option of keeping the default
setting for CONFIG_X86_MPPARSE as 'Y' and allowing the choice to change it to
'N' if someone specifically desires to do so. I also considered that nowadays, not
many kernels are likely to utilize MP tables for booting x86 machines, which is why
I thought this change wouldn't have a significant impact. Moreover, there is a
potential benefit in terms of reducing the kernel's size. However, I completely
respect and am open to whatever you decide, having better visibility of wider
kernel community needs.
- Saurabh
^ permalink raw reply
* [PATCH v2 4/5] tools: hv: Remove hv_fcopy_daemon
From: Saurabh Sengar @ 2023-06-14 18:15 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, mikelley, gregkh, corbet,
linux-kernel, linux-hyperv, linux-doc
In-Reply-To: <1686766512-2589-1-git-send-email-ssengar@linux.microsoft.com>
As the new fcopy application using uio is introduced, remove
obsolete hv_fcopy_daemon.c
Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com>
---
tools/hv/hv_fcopy_daemon.c | 266 -------------------------------------
1 file changed, 266 deletions(-)
delete mode 100644 tools/hv/hv_fcopy_daemon.c
diff --git a/tools/hv/hv_fcopy_daemon.c b/tools/hv/hv_fcopy_daemon.c
deleted file mode 100644
index 16d629b22c25..000000000000
--- a/tools/hv/hv_fcopy_daemon.c
+++ /dev/null
@@ -1,266 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0-only
-/*
- * An implementation of host to guest copy functionality for Linux.
- *
- * Copyright (C) 2014, Microsoft, Inc.
- *
- * Author : K. Y. Srinivasan <kys@microsoft.com>
- */
-
-
-#include <sys/types.h>
-#include <stdio.h>
-#include <stdlib.h>
-#include <unistd.h>
-#include <string.h>
-#include <errno.h>
-#include <linux/hyperv.h>
-#include <linux/limits.h>
-#include <syslog.h>
-#include <sys/stat.h>
-#include <fcntl.h>
-#include <getopt.h>
-
-static int target_fd;
-static char target_fname[PATH_MAX];
-static unsigned long long filesize;
-
-static int hv_start_fcopy(struct hv_start_fcopy *smsg)
-{
- int error = HV_E_FAIL;
- char *q, *p;
-
- filesize = 0;
- p = (char *)smsg->path_name;
- snprintf(target_fname, sizeof(target_fname), "%s/%s",
- (char *)smsg->path_name, (char *)smsg->file_name);
-
- syslog(LOG_INFO, "Target file name: %s", target_fname);
- /*
- * Check to see if the path is already in place; if not,
- * create if required.
- */
- while ((q = strchr(p, '/')) != NULL) {
- if (q == p) {
- p++;
- continue;
- }
- *q = '\0';
- if (access((char *)smsg->path_name, F_OK)) {
- if (smsg->copy_flags & CREATE_PATH) {
- if (mkdir((char *)smsg->path_name, 0755)) {
- syslog(LOG_ERR, "Failed to create %s",
- (char *)smsg->path_name);
- goto done;
- }
- } else {
- syslog(LOG_ERR, "Invalid path: %s",
- (char *)smsg->path_name);
- goto done;
- }
- }
- p = q + 1;
- *q = '/';
- }
-
- if (!access(target_fname, F_OK)) {
- syslog(LOG_INFO, "File: %s exists", target_fname);
- if (!(smsg->copy_flags & OVER_WRITE)) {
- error = HV_ERROR_ALREADY_EXISTS;
- goto done;
- }
- }
-
- target_fd = open(target_fname,
- O_RDWR | O_CREAT | O_TRUNC | O_CLOEXEC, 0744);
- if (target_fd == -1) {
- syslog(LOG_INFO, "Open Failed: %s", strerror(errno));
- goto done;
- }
-
- error = 0;
-done:
- if (error)
- target_fname[0] = '\0';
- return error;
-}
-
-static int hv_copy_data(struct hv_do_fcopy *cpmsg)
-{
- ssize_t bytes_written;
- int ret = 0;
-
- bytes_written = pwrite(target_fd, cpmsg->data, cpmsg->size,
- cpmsg->offset);
-
- filesize += cpmsg->size;
- if (bytes_written != cpmsg->size) {
- switch (errno) {
- case ENOSPC:
- ret = HV_ERROR_DISK_FULL;
- break;
- default:
- ret = HV_E_FAIL;
- break;
- }
- syslog(LOG_ERR, "pwrite failed to write %llu bytes: %ld (%s)",
- filesize, (long)bytes_written, strerror(errno));
- }
-
- return ret;
-}
-
-/*
- * Reset target_fname to "" in the two below functions for hibernation: if
- * the fcopy operation is aborted by hibernation, the daemon should remove the
- * partially-copied file; to achieve this, the hv_utils driver always fakes a
- * CANCEL_FCOPY message upon suspend, and later when the VM resumes back,
- * the daemon calls hv_copy_cancel() to remove the file; if a file is copied
- * successfully before suspend, hv_copy_finished() must reset target_fname to
- * avoid that the file can be incorrectly removed upon resume, since the faked
- * CANCEL_FCOPY message is spurious in this case.
- */
-static int hv_copy_finished(void)
-{
- close(target_fd);
- target_fname[0] = '\0';
- return 0;
-}
-static int hv_copy_cancel(void)
-{
- close(target_fd);
- if (strlen(target_fname) > 0) {
- unlink(target_fname);
- target_fname[0] = '\0';
- }
- return 0;
-
-}
-
-void print_usage(char *argv[])
-{
- fprintf(stderr, "Usage: %s [options]\n"
- "Options are:\n"
- " -n, --no-daemon stay in foreground, don't daemonize\n"
- " -h, --help print this help\n", argv[0]);
-}
-
-int main(int argc, char *argv[])
-{
- int fcopy_fd = -1;
- int error;
- int daemonize = 1, long_index = 0, opt;
- int version = FCOPY_CURRENT_VERSION;
- union {
- struct hv_fcopy_hdr hdr;
- struct hv_start_fcopy start;
- struct hv_do_fcopy copy;
- __u32 kernel_modver;
- } buffer = { };
- int in_handshake;
-
- static struct option long_options[] = {
- {"help", no_argument, 0, 'h' },
- {"no-daemon", no_argument, 0, 'n' },
- {0, 0, 0, 0 }
- };
-
- while ((opt = getopt_long(argc, argv, "hn", long_options,
- &long_index)) != -1) {
- switch (opt) {
- case 'n':
- daemonize = 0;
- break;
- case 'h':
- default:
- print_usage(argv);
- exit(EXIT_FAILURE);
- }
- }
-
- if (daemonize && daemon(1, 0)) {
- syslog(LOG_ERR, "daemon() failed; error: %s", strerror(errno));
- exit(EXIT_FAILURE);
- }
-
- openlog("HV_FCOPY", 0, LOG_USER);
- syslog(LOG_INFO, "starting; pid is:%d", getpid());
-
-reopen_fcopy_fd:
- if (fcopy_fd != -1)
- close(fcopy_fd);
- /* Remove any possible partially-copied file on error */
- hv_copy_cancel();
- in_handshake = 1;
- fcopy_fd = open("/dev/vmbus/hv_fcopy", O_RDWR);
-
- if (fcopy_fd < 0) {
- syslog(LOG_ERR, "open /dev/vmbus/hv_fcopy failed; error: %d %s",
- errno, strerror(errno));
- exit(EXIT_FAILURE);
- }
-
- /*
- * Register with the kernel.
- */
- if ((write(fcopy_fd, &version, sizeof(int))) != sizeof(int)) {
- syslog(LOG_ERR, "Registration failed: %s", strerror(errno));
- exit(EXIT_FAILURE);
- }
-
- while (1) {
- /*
- * In this loop we process fcopy messages after the
- * handshake is complete.
- */
- ssize_t len;
-
- len = pread(fcopy_fd, &buffer, sizeof(buffer), 0);
- if (len < 0) {
- syslog(LOG_ERR, "pread failed: %s", strerror(errno));
- goto reopen_fcopy_fd;
- }
-
- if (in_handshake) {
- if (len != sizeof(buffer.kernel_modver)) {
- syslog(LOG_ERR, "invalid version negotiation");
- exit(EXIT_FAILURE);
- }
- in_handshake = 0;
- syslog(LOG_INFO, "kernel module version: %u",
- buffer.kernel_modver);
- continue;
- }
-
- switch (buffer.hdr.operation) {
- case START_FILE_COPY:
- error = hv_start_fcopy(&buffer.start);
- break;
- case WRITE_TO_FILE:
- error = hv_copy_data(&buffer.copy);
- break;
- case COMPLETE_FCOPY:
- error = hv_copy_finished();
- break;
- case CANCEL_FCOPY:
- error = hv_copy_cancel();
- break;
-
- default:
- error = HV_E_FAIL;
- syslog(LOG_ERR, "Unknown operation: %d",
- buffer.hdr.operation);
-
- }
-
- /*
- * pwrite() may return an error due to the faked CANCEL_FCOPY
- * message upon hibernation. Ignore the error by resetting the
- * dev file, i.e. closing and re-opening it.
- */
- if (pwrite(fcopy_fd, &error, sizeof(int), 0) != sizeof(int)) {
- syslog(LOG_ERR, "pwrite failed: %s", strerror(errno));
- goto reopen_fcopy_fd;
- }
- }
-}
--
2.34.1
^ permalink raw reply related
* [PATCH v2 1/5] uio: Add hv_vmbus_client driver
From: Saurabh Sengar @ 2023-06-14 18:15 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, mikelley, gregkh, corbet,
linux-kernel, linux-hyperv, linux-doc
In-Reply-To: <1686766512-2589-1-git-send-email-ssengar@linux.microsoft.com>
Generic VMBus driver that can be dynamically bound, to any simple
low speed Hyper-V VMBus device. This driver supports single
channel and uses hypercall instead of monitor bits to interrupt
host, which is suitable for low speed devices. Additionally, the
driver also provide the flexibility of custom ring buffer sizes and
avoid memory allocation for gpadl to offer memory optimization.
Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com>
---
[V2]
- Update driver info in Documentation/driver-api/uio-howto.rst
- Update ring_size sysfs info in Documentation/ABI/stable/sysfs-bus-vmbus
- Remove DRIVER_VERSION
- Remove refcnt
- scnprintf -> sysfs_emit
- sysfs_create_file -> ATTRIBUTE_GROUPS + ".driver.groups";
- sysfs_create_bin_file -> device_create_bin_file
- dev_notice -> dev_err
- remove MODULE_VERSION
Documentation/ABI/stable/sysfs-bus-vmbus | 7 +
Documentation/driver-api/uio-howto.rst | 46 +++++
drivers/uio/Kconfig | 12 ++
drivers/uio/Makefile | 1 +
drivers/uio/uio_hv_vmbus_client.c | 217 +++++++++++++++++++++++
5 files changed, 283 insertions(+)
create mode 100644 drivers/uio/uio_hv_vmbus_client.c
diff --git a/Documentation/ABI/stable/sysfs-bus-vmbus b/Documentation/ABI/stable/sysfs-bus-vmbus
index 3066feae1d8d..5d075fbd150b 100644
--- a/Documentation/ABI/stable/sysfs-bus-vmbus
+++ b/Documentation/ABI/stable/sysfs-bus-vmbus
@@ -153,6 +153,13 @@ Contact: Stephen Hemminger <sthemmin@microsoft.com>
Description: Binary file created by uio_hv_generic for ring buffer
Users: Userspace drivers
+What: /sys/bus/vmbus/devices/<UUID>/ring_size
+Date: June. 2023
+KernelVersion: 6.4
+Contact: Saurabh Sengar <ssengar@microsoft.com>
+Description: File created by uio_hv_vmbus_client for setting device ring buffer size
+Users: Userspace drivers
+
What: /sys/bus/vmbus/devices/<UUID>/channels/<N>/intr_in_full
Date: February 2019
KernelVersion: 5.0
diff --git a/Documentation/driver-api/uio-howto.rst b/Documentation/driver-api/uio-howto.rst
index 907ffa3b38f5..33b67f876b96 100644
--- a/Documentation/driver-api/uio-howto.rst
+++ b/Documentation/driver-api/uio-howto.rst
@@ -722,6 +722,52 @@ For example::
/sys/bus/vmbus/devices/3811fe4d-0fa0-4b62-981a-74fc1084c757/channels/21/ring
+Generic Hyper-V driver for low speed devices
+============================================
+
+The generic driver is a kernel module named uio_hv_vmbus_client. It
+supports slow devices on the Hyper-V VMBus similar to uio_hv_generic
+for faster devices. This driver also gives flexibility of customized
+ring buffer sizes.
+
+Making the driver recognize the device
+--------------------------------------
+
+Since the driver does not declare any device GUID's, it will not get
+loaded automatically and will not automatically bind to any devices, you
+must load it and allocate id to the driver yourself. For example, to use
+the fcopy device class GUID::
+
+ DEV_UUID=eb765408-105f-49b6-b4aa-c123b64d17d4
+ driverctl -b vmbus set-override $DEV_UUID uio_hv_vmbus_client
+
+You can verify that the device has been bound to the driver by looking
+for it in sysfs, for example like the following::
+
+ ls -l /sys/bus/vmbus/devices/eb765408-105f-49b6-b4aa-c123b64d17d4/driver
+
+Which if successful should print::
+
+ .../eb765408-105f-49b6-b4aa-c123b64d17d4/driver -> ../../../bus/vmbus/drivers/uio_hv_vmbus_client
+
+Things to know about uio_hv_vmbus_client
+----------------------------------------
+
+The uio_hv_vmbus_client driver facilitates the mapping of the Hyper-V device
+ring buffer to userspace and offers an interface to manage it.
+
+The userspace API for mapping and performing read/write operations on the device
+ring buffer is implemented in tools/hv/vmbus_bufring.c. Userspace applications
+can utilize this file as a library to build their logic on top of it.
+
+Additionally, the uio_hv_vmbus_client driver offers a sysfs entry called
+"ring_size" that allows users to adjust the size of the device ring buffer
+before opening it.
+
+For example::
+
+ /sys/bus/vmbus/devices/eb765408-105f-49b6-b4aa-c123b64d17d4/ring_size
+
Further information
===================
diff --git a/drivers/uio/Kconfig b/drivers/uio/Kconfig
index 2e16c5338e5b..dcc727e6fd3f 100644
--- a/drivers/uio/Kconfig
+++ b/drivers/uio/Kconfig
@@ -166,6 +166,18 @@ config UIO_HV_GENERIC
If you compile this as a module, it will be called uio_hv_generic.
+config UIO_HV_SLOW_DEVICES
+ tristate "Generic driver for low speed VMBus devices"
+ depends on HYPERV
+ help
+ Generic driver that you can bind, dynamically, to any simple
+ Hyper-V VMBus device with single channel and these devices
+ uses hypercall instead of monitor bits to interrupt host. This
+ driver also provide the flexibility of custom ring buffer sizes.
+ It is useful for slower VMbus devices.
+
+ If you compile this as a module, it will be called uio_hv_vmbus_client.
+
config UIO_DFL
tristate "Generic driver for DFL (Device Feature List) bus"
depends on FPGA_DFL
diff --git a/drivers/uio/Makefile b/drivers/uio/Makefile
index f2f416a14228..44be0f96da34 100644
--- a/drivers/uio/Makefile
+++ b/drivers/uio/Makefile
@@ -11,4 +11,5 @@ obj-$(CONFIG_UIO_PRUSS) += uio_pruss.o
obj-$(CONFIG_UIO_MF624) += uio_mf624.o
obj-$(CONFIG_UIO_FSL_ELBC_GPCM) += uio_fsl_elbc_gpcm.o
obj-$(CONFIG_UIO_HV_GENERIC) += uio_hv_generic.o
+obj-$(CONFIG_UIO_HV_SLOW_DEVICES) += uio_hv_vmbus_client.o
obj-$(CONFIG_UIO_DFL) += uio_dfl.o
diff --git a/drivers/uio/uio_hv_vmbus_client.c b/drivers/uio/uio_hv_vmbus_client.c
new file mode 100644
index 000000000000..92fa54271971
--- /dev/null
+++ b/drivers/uio/uio_hv_vmbus_client.c
@@ -0,0 +1,217 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * uio_hv_vmbus_client - UIO driver for low speed VMBus devices
+ *
+ * Copyright (c) 2023, Microsoft Corporation.
+ *
+ * Authors:
+ * Saurabh Sengar <ssengar@microsoft.com>
+ *
+ * Since the driver does not declare any device ids, you must allocate
+ * id and bind the device to the driver yourself. For example:
+ * driverctl -b vmbus set-override <dev uuid> uio_hv_vmbus_client
+ */
+
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/uio_driver.h>
+#include <linux/hyperv.h>
+
+#define DRIVER_AUTHOR "Saurabh Sengar <ssengar@microsoft.com>"
+#define DRIVER_DESC "Generic UIO driver for low speed VMBus devices"
+
+#define DEFAULT_HV_RING_SIZE VMBUS_RING_SIZE(3 * HV_HYP_PAGE_SIZE)
+static int ring_size = DEFAULT_HV_RING_SIZE;
+
+struct uio_hv_vmbus_dev {
+ struct uio_info info;
+ struct hv_device *device;
+};
+
+/* Sysfs API to allow mmap of the ring buffers */
+static int uio_hv_vmbus_mmap(struct file *filp, struct kobject *kobj,
+ struct bin_attribute *attr, struct vm_area_struct *vma)
+{
+ struct device *dev = container_of(kobj, struct device, kobj);
+ struct hv_device *hv_dev = container_of(dev, struct hv_device, device);
+ struct vmbus_channel *channel = hv_dev->channel;
+ void *ring_buffer = page_address(channel->ringbuffer_page);
+
+ return vm_iomap_memory(vma, virt_to_phys(ring_buffer),
+ channel->ringbuffer_pagecount << PAGE_SHIFT);
+}
+
+static const struct bin_attribute ring_buffer_bin_attr = {
+ .attr = {
+ .name = "ringbuffer",
+ .mode = 0600,
+ },
+ .mmap = uio_hv_vmbus_mmap,
+};
+
+/*
+ * This is the irqcontrol callback to be registered to uio_info.
+ * It can be used to disable/enable interrupt from user space processes.
+ *
+ * @param info
+ * pointer to uio_info.
+ * @param irq_state
+ * state value. 1 to enable interrupt, 0 to disable interrupt.
+ */
+static int uio_hv_vmbus_irqcontrol(struct uio_info *info, s32 irq_state)
+{
+ struct uio_hv_vmbus_dev *pdata = info->priv;
+ struct hv_device *hv_dev = pdata->device;
+
+ /* Issue a full memory barrier before triggering the notification */
+ virt_mb();
+
+ vmbus_setevent(hv_dev->channel);
+ return 0;
+}
+
+/*
+ * Callback from vmbus_event when something is in inbound ring.
+ */
+static void uio_hv_vmbus_channel_cb(void *context)
+{
+ struct uio_hv_vmbus_dev *pdata = context;
+
+ /* Issue a full memory barrier before sending the event to userspace */
+ virt_mb();
+
+ uio_event_notify(&pdata->info);
+}
+
+static int uio_hv_vmbus_open(struct uio_info *info, struct inode *inode)
+{
+ struct uio_hv_vmbus_dev *pdata = container_of(info, struct uio_hv_vmbus_dev, info);
+ struct hv_device *hv_dev = pdata->device;
+ struct vmbus_channel *channel = hv_dev->channel;
+ int ret;
+
+ ret = vmbus_open(channel, ring_size, ring_size, NULL, 0,
+ uio_hv_vmbus_channel_cb, pdata);
+ if (ret) {
+ dev_err(&hv_dev->device, "error %d when opening the channel\n", ret);
+ return ret;
+ }
+ channel->inbound.ring_buffer->interrupt_mask = 0;
+ set_channel_read_mode(channel, HV_CALL_ISR);
+
+ ret = device_create_bin_file(&hv_dev->device, &ring_buffer_bin_attr);
+ if (ret)
+ dev_err(&hv_dev->device, "sysfs create ring bin file failed; %d\n", ret);
+
+ return ret;
+}
+
+/* VMbus primary channel is closed on last close */
+static int uio_hv_vmbus_release(struct uio_info *info, struct inode *inode)
+{
+ struct uio_hv_vmbus_dev *pdata = container_of(info, struct uio_hv_vmbus_dev, info);
+ struct hv_device *hv_dev = pdata->device;
+ struct vmbus_channel *channel = hv_dev->channel;
+
+ device_remove_bin_file(&hv_dev->device, &ring_buffer_bin_attr);
+ vmbus_close(channel);
+
+ return 0;
+}
+
+static ssize_t ring_size_show(struct device *dev, struct device_attribute *attr,
+ char *buf)
+{
+ return sysfs_emit(buf, "%d\n", ring_size);
+}
+
+static ssize_t ring_size_store(struct device *dev, struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ unsigned int val;
+
+ if (kstrtouint(buf, 0, &val) < 0)
+ return -EINVAL;
+
+ if (val < HV_HYP_PAGE_SIZE)
+ return -EINVAL;
+
+ ring_size = val;
+
+ return count;
+}
+
+static DEVICE_ATTR_RW(ring_size);
+
+static struct attribute *uio_hv_vmbus_client_attrs[] = {
+ &dev_attr_ring_size.attr,
+ NULL,
+};
+ATTRIBUTE_GROUPS(uio_hv_vmbus_client);
+
+static int uio_hv_vmbus_probe(struct hv_device *dev, const struct hv_vmbus_device_id *dev_id)
+{
+ struct uio_hv_vmbus_dev *pdata;
+ int ret;
+ char *name = NULL;
+
+ pdata = devm_kzalloc(&dev->device, sizeof(*pdata), GFP_KERNEL);
+ if (!pdata)
+ return -ENOMEM;
+
+ name = kasprintf(GFP_KERNEL, "%pUl", &dev->dev_instance);
+
+ /* Fill general uio info */
+ pdata->info.name = name; /* /sys/class/uio/uioX/name */
+ pdata->info.version = "1";
+ pdata->info.irqcontrol = uio_hv_vmbus_irqcontrol;
+ pdata->info.open = uio_hv_vmbus_open;
+ pdata->info.release = uio_hv_vmbus_release;
+ pdata->info.irq = UIO_IRQ_CUSTOM;
+ pdata->info.priv = pdata;
+ pdata->device = dev;
+
+ ret = uio_register_device(&dev->device, &pdata->info);
+ if (ret) {
+ dev_err(&dev->device, "uio_hv_vmbus register failed\n");
+ return ret;
+ }
+
+ hv_set_drvdata(dev, pdata);
+
+ return 0;
+}
+
+static void uio_hv_vmbus_remove(struct hv_device *dev)
+{
+ struct uio_hv_vmbus_dev *pdata = hv_get_drvdata(dev);
+
+ if (pdata)
+ uio_unregister_device(&pdata->info);
+}
+
+static struct hv_driver uio_hv_vmbus_drv = {
+ .driver.dev_groups = uio_hv_vmbus_client_groups,
+ .name = "uio_hv_vmbus_client",
+ .id_table = NULL, /* only dynamic id's */
+ .probe = uio_hv_vmbus_probe,
+ .remove = uio_hv_vmbus_remove,
+};
+
+static int __init uio_hv_vmbus_init(void)
+{
+ return vmbus_driver_register(&uio_hv_vmbus_drv);
+}
+
+static void __exit uio_hv_vmbus_exit(void)
+{
+ vmbus_driver_unregister(&uio_hv_vmbus_drv);
+}
+
+module_init(uio_hv_vmbus_init);
+module_exit(uio_hv_vmbus_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
--
2.34.1
^ permalink raw reply related
* [PATCH v2 2/5] tools: hv: Add vmbus_bufring
From: Saurabh Sengar @ 2023-06-14 18:15 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, mikelley, gregkh, corbet,
linux-kernel, linux-hyperv, linux-doc
In-Reply-To: <1686766512-2589-1-git-send-email-ssengar@linux.microsoft.com>
Common userspace interface for read/write from VMBus ringbuffer.
This implementation is open for use by any userspace driver or
application seeking direct control over VMBus ring buffers.
A significant part of this code is borrowed from DPDK.
Link: https://github.com/DPDK/dpdk/
Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com>
---
[V2]
- simpler sysfs path, less parsing
tools/hv/vmbus_bufring.c | 322 +++++++++++++++++++++++++++++++++++++++
tools/hv/vmbus_bufring.h | 158 +++++++++++++++++++
2 files changed, 480 insertions(+)
create mode 100644 tools/hv/vmbus_bufring.c
create mode 100644 tools/hv/vmbus_bufring.h
diff --git a/tools/hv/vmbus_bufring.c b/tools/hv/vmbus_bufring.c
new file mode 100644
index 000000000000..d44a06d45b03
--- /dev/null
+++ b/tools/hv/vmbus_bufring.c
@@ -0,0 +1,322 @@
+// SPDX-License-Identifier: BSD-3-Clause
+/*
+ * Copyright (c) 2009-2012,2016,2023 Microsoft Corp.
+ * Copyright (c) 2012 NetApp Inc.
+ * Copyright (c) 2012 Citrix Inc.
+ * All rights reserved.
+ */
+
+#include <errno.h>
+#include <fcntl.h>
+#include <emmintrin.h>
+#include <linux/limits.h>
+#include <stdbool.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <string.h>
+#include <sys/mman.h>
+#include <sys/uio.h>
+#include <unistd.h>
+#include "vmbus_bufring.h"
+
+#define rte_compiler_barrier() ({ asm volatile ("" : : : "memory"); })
+
+#define rte_smp_rwmb() ({ asm volatile ("" : : : "memory"); })
+
+#define VMBUS_RQST_ERROR 0xFFFFFFFFFFFFFFFF
+#define ALIGN(val, align) ((typeof(val))((val) & (~((typeof(val))((align) - 1)))))
+
+void *vmbus_uio_map(char *sys_ring_buf_path, int size)
+{
+ void *map;
+ int fd;
+
+ fd = open(sys_ring_buf_path, O_RDWR);
+ if (fd < 0)
+ return NULL;
+
+ map = mmap(NULL, 2 * size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+ close(fd);
+ if (map == MAP_FAILED)
+ return NULL;
+
+ return map;
+}
+
+/* Increase bufring index by inc with wraparound */
+static inline uint32_t vmbus_br_idxinc(uint32_t idx, uint32_t inc, uint32_t sz)
+{
+ idx += inc;
+ if (idx >= sz)
+ idx -= sz;
+
+ return idx;
+}
+
+void vmbus_br_setup(struct vmbus_br *br, void *buf, unsigned int blen)
+{
+ br->vbr = buf;
+ br->windex = br->vbr->windex;
+ br->dsize = blen - sizeof(struct vmbus_bufring);
+}
+
+static inline __always_inline void
+rte_smp_mb(void)
+{
+ asm volatile("lock addl $0, -128(%%rsp); " ::: "memory");
+}
+
+static inline int
+rte_atomic32_cmpset(volatile uint32_t *dst, uint32_t exp, uint32_t src)
+{
+ uint8_t res;
+
+ asm volatile("lock ; "
+ "cmpxchgl %[src], %[dst];"
+ "sete %[res];"
+ : [res] "=a" (res), /* output */
+ [dst] "=m" (*dst)
+ : [src] "r" (src), /* input */
+ "a" (exp),
+ "m" (*dst)
+ : "memory"); /* no-clobber list */
+ return res;
+}
+
+static inline uint32_t
+vmbus_txbr_copyto(const struct vmbus_br *tbr, uint32_t windex,
+ const void *src0, uint32_t cplen)
+{
+ uint8_t *br_data = tbr->vbr->data;
+ uint32_t br_dsize = tbr->dsize;
+ const uint8_t *src = src0;
+
+ /* XXX use double mapping like Linux kernel? */
+ if (cplen > br_dsize - windex) {
+ uint32_t fraglen = br_dsize - windex;
+
+ /* Wrap-around detected */
+ memcpy(br_data + windex, src, fraglen);
+ memcpy(br_data, src + fraglen, cplen - fraglen);
+ } else {
+ memcpy(br_data + windex, src, cplen);
+ }
+
+ return vmbus_br_idxinc(windex, cplen, br_dsize);
+}
+
+/*
+ * Write scattered channel packet to TX bufring.
+ *
+ * The offset of this channel packet is written as a 64bits value
+ * immediately after this channel packet.
+ *
+ * The write goes through three stages:
+ * 1. Reserve space in ring buffer for the new data.
+ * Writer atomically moves priv_write_index.
+ * 2. Copy the new data into the ring.
+ * 3. Update the tail of the ring (visible to host) that indicates
+ * next read location. Writer updates write_index
+ */
+static int
+vmbus_txbr_write(struct vmbus_br *tbr, const struct iovec iov[], int iovlen,
+ bool *need_sig)
+{
+ struct vmbus_bufring *vbr = tbr->vbr;
+ uint32_t ring_size = tbr->dsize;
+ uint32_t old_windex, next_windex, windex, total;
+ uint64_t save_windex;
+ int i;
+
+ total = 0;
+ for (i = 0; i < iovlen; i++)
+ total += iov[i].iov_len;
+ total += sizeof(save_windex);
+
+ /* Reserve space in ring */
+ do {
+ uint32_t avail;
+
+ /* Get current free location */
+ old_windex = tbr->windex;
+
+ /* Prevent compiler reordering this with calculation */
+ rte_compiler_barrier();
+
+ avail = vmbus_br_availwrite(tbr, old_windex);
+
+ /* If not enough space in ring, then tell caller. */
+ if (avail <= total)
+ return -EAGAIN;
+
+ next_windex = vmbus_br_idxinc(old_windex, total, ring_size);
+
+ /* Atomic update of next write_index for other threads */
+ } while (!rte_atomic32_cmpset(&tbr->windex, old_windex, next_windex));
+
+ /* Space from old..new is now reserved */
+ windex = old_windex;
+ for (i = 0; i < iovlen; i++)
+ windex = vmbus_txbr_copyto(tbr, windex, iov[i].iov_base, iov[i].iov_len);
+
+ /* Set the offset of the current channel packet. */
+ save_windex = ((uint64_t)old_windex) << 32;
+ windex = vmbus_txbr_copyto(tbr, windex, &save_windex,
+ sizeof(save_windex));
+
+ /* The region reserved should match region used */
+ if (windex != next_windex)
+ return -EINVAL;
+
+ /* Ensure that data is available before updating host index */
+ rte_smp_rwmb();
+
+ /* Checkin for our reservation. wait for our turn to update host */
+ while (!rte_atomic32_cmpset(&vbr->windex, old_windex, next_windex))
+ _mm_pause();
+
+ return 0;
+}
+
+int rte_vmbus_chan_send(struct vmbus_br *txbr, uint16_t type, void *data,
+ uint32_t dlen, uint32_t flags)
+{
+ struct vmbus_chanpkt pkt;
+ unsigned int pktlen, pad_pktlen;
+ const uint32_t hlen = sizeof(pkt);
+ bool send_evt = false;
+ uint64_t pad = 0;
+ struct iovec iov[3];
+ int error;
+
+ pktlen = hlen + dlen;
+ pad_pktlen = ALIGN(pktlen, sizeof(uint64_t));
+
+ pkt.hdr.type = type;
+ pkt.hdr.flags = flags;
+ pkt.hdr.hlen = hlen >> VMBUS_CHANPKT_SIZE_SHIFT;
+ pkt.hdr.tlen = pad_pktlen >> VMBUS_CHANPKT_SIZE_SHIFT;
+ pkt.hdr.xactid = VMBUS_RQST_ERROR;
+
+ iov[0].iov_base = &pkt;
+ iov[0].iov_len = hlen;
+ iov[1].iov_base = data;
+ iov[1].iov_len = dlen;
+ iov[2].iov_base = &pad;
+ iov[2].iov_len = pad_pktlen - pktlen;
+
+ error = vmbus_txbr_write(txbr, iov, 3, &send_evt);
+
+ return error;
+}
+
+static inline uint32_t
+vmbus_rxbr_copyfrom(const struct vmbus_br *rbr, uint32_t rindex,
+ void *dst0, size_t cplen)
+{
+ const uint8_t *br_data = rbr->vbr->data;
+ uint32_t br_dsize = rbr->dsize;
+ uint8_t *dst = dst0;
+
+ if (cplen > br_dsize - rindex) {
+ uint32_t fraglen = br_dsize - rindex;
+
+ /* Wrap-around detected. */
+ memcpy(dst, br_data + rindex, fraglen);
+ memcpy(dst + fraglen, br_data, cplen - fraglen);
+ } else {
+ memcpy(dst, br_data + rindex, cplen);
+ }
+
+ return vmbus_br_idxinc(rindex, cplen, br_dsize);
+}
+
+/* Copy data from receive ring but don't change index */
+static int
+vmbus_rxbr_peek(const struct vmbus_br *rbr, void *data, size_t dlen)
+{
+ uint32_t avail;
+
+ /*
+ * The requested data and the 64bits channel packet
+ * offset should be there at least.
+ */
+ avail = vmbus_br_availread(rbr);
+ if (avail < dlen + sizeof(uint64_t))
+ return -EAGAIN;
+
+ vmbus_rxbr_copyfrom(rbr, rbr->vbr->rindex, data, dlen);
+ return 0;
+}
+
+/*
+ * Copy data from receive ring and change index
+ * NOTE:
+ * We assume (dlen + skip) == sizeof(channel packet).
+ */
+static int
+vmbus_rxbr_read(struct vmbus_br *rbr, void *data, size_t dlen, size_t skip)
+{
+ struct vmbus_bufring *vbr = rbr->vbr;
+ uint32_t br_dsize = rbr->dsize;
+ uint32_t rindex;
+
+ if (vmbus_br_availread(rbr) < dlen + skip + sizeof(uint64_t))
+ return -EAGAIN;
+
+ /* Record where host was when we started read (for debug) */
+ rbr->windex = rbr->vbr->windex;
+
+ /*
+ * Copy channel packet from RX bufring.
+ */
+ rindex = vmbus_br_idxinc(rbr->vbr->rindex, skip, br_dsize);
+ rindex = vmbus_rxbr_copyfrom(rbr, rindex, data, dlen);
+
+ /*
+ * Discard this channel packet's 64bits offset, which is useless to us.
+ */
+ rindex = vmbus_br_idxinc(rindex, sizeof(uint64_t), br_dsize);
+
+ /* Update the read index _after_ the channel packet is fetched. */
+ rte_compiler_barrier();
+
+ vbr->rindex = rindex;
+
+ return 0;
+}
+
+int rte_vmbus_chan_recv_raw(struct vmbus_br *rxbr,
+ void *data, uint32_t *len)
+{
+ struct vmbus_chanpkt_hdr pkt;
+ uint32_t dlen, bufferlen = *len;
+ int error;
+
+ error = vmbus_rxbr_peek(rxbr, &pkt, sizeof(pkt));
+ if (error)
+ return error;
+
+ if (unlikely(pkt.hlen < VMBUS_CHANPKT_HLEN_MIN))
+ /* XXX this channel is dead actually. */
+ return -EIO;
+
+ if (unlikely(pkt.hlen > pkt.tlen))
+ return -EIO;
+
+ /* Length are in quad words */
+ dlen = pkt.tlen << VMBUS_CHANPKT_SIZE_SHIFT;
+ *len = dlen;
+
+ /* If caller buffer is not large enough */
+ if (unlikely(dlen > bufferlen))
+ return -ENOBUFS;
+
+ /* Read data and skip packet header */
+ error = vmbus_rxbr_read(rxbr, data, dlen, 0);
+ if (error)
+ return error;
+
+ /* Return the number of bytes read */
+ return dlen + sizeof(uint64_t);
+}
diff --git a/tools/hv/vmbus_bufring.h b/tools/hv/vmbus_bufring.h
new file mode 100644
index 000000000000..9c82822587a4
--- /dev/null
+++ b/tools/hv/vmbus_bufring.h
@@ -0,0 +1,158 @@
+/* SPDX-License-Identifier: BSD-3-Clause */
+
+#ifndef _VMBUS_BUF_H_
+#define _VMBUS_BUF_H_
+
+#include <stdbool.h>
+#include <stdint.h>
+
+#define __packed __attribute__((__packed__))
+#define unlikely(x) __builtin_expect(!!(x), 0)
+
+#define ICMSGHDRFLAG_TRANSACTION 1
+#define ICMSGHDRFLAG_REQUEST 2
+#define ICMSGHDRFLAG_RESPONSE 4
+
+#define IC_VERSION_NEGOTIATION_MAX_VER_COUNT 100
+#define ICMSG_HDR (sizeof(struct vmbuspipe_hdr) + sizeof(struct icmsg_hdr))
+#define ICMSG_NEGOTIATE_PKT_SIZE(icframe_vercnt, icmsg_vercnt) \
+ (ICMSG_HDR + sizeof(struct icmsg_negotiate) + \
+ (((icframe_vercnt) + (icmsg_vercnt)) * sizeof(struct ic_version)))
+
+/*
+ * Channel packets
+ */
+
+/* Channel packet flags */
+#define VMBUS_CHANPKT_TYPE_INBAND 0x0006
+#define VMBUS_CHANPKT_TYPE_RXBUF 0x0007
+#define VMBUS_CHANPKT_TYPE_GPA 0x0009
+#define VMBUS_CHANPKT_TYPE_COMP 0x000b
+
+#define VMBUS_CHANPKT_FLAG_NONE 0
+#define VMBUS_CHANPKT_FLAG_RC 0x0001 /* report completion */
+
+#define VMBUS_CHANPKT_SIZE_SHIFT 3
+#define VMBUS_CHANPKT_SIZE_ALIGN BIT(VMBUS_CHANPKT_SIZE_SHIFT)
+#define VMBUS_CHANPKT_HLEN_MIN \
+ (sizeof(struct vmbus_chanpkt_hdr) >> VMBUS_CHANPKT_SIZE_SHIFT)
+
+/*
+ * Buffer ring
+ */
+struct vmbus_bufring {
+ volatile uint32_t windex;
+ volatile uint32_t rindex;
+
+ /*
+ * Interrupt mask {0,1}
+ *
+ * For TX bufring, host set this to 1, when it is processing
+ * the TX bufring, so that we can safely skip the TX event
+ * notification to host.
+ *
+ * For RX bufring, once this is set to 1 by us, host will not
+ * further dispatch interrupts to us, even if there are data
+ * pending on the RX bufring. This effectively disables the
+ * interrupt of the channel to which this RX bufring is attached.
+ */
+ volatile uint32_t imask;
+
+ /*
+ * Win8 uses some of the reserved bits to implement
+ * interrupt driven flow management. On the send side
+ * we can request that the receiver interrupt the sender
+ * when the ring transitions from being full to being able
+ * to handle a message of size "pending_send_sz".
+ *
+ * Add necessary state for this enhancement.
+ */
+ volatile uint32_t pending_send;
+ uint32_t reserved1[12];
+
+ union {
+ struct {
+ uint32_t feat_pending_send_sz:1;
+ };
+ uint32_t value;
+ } feature_bits;
+
+ /* Pad it to rte_mem_page_size() so that data starts on page boundary */
+ uint8_t reserved2[4028];
+
+ /*
+ * Ring data starts here + RingDataStartOffset
+ * !!! DO NOT place any fields below this !!!
+ */
+ uint8_t data[];
+} __packed;
+
+struct vmbus_br {
+ struct vmbus_bufring *vbr;
+ uint32_t dsize;
+ uint32_t windex; /* next available location */
+};
+
+struct vmbus_chanpkt_hdr {
+ uint16_t type; /* VMBUS_CHANPKT_TYPE_ */
+ uint16_t hlen; /* header len, in 8 bytes */
+ uint16_t tlen; /* total len, in 8 bytes */
+ uint16_t flags; /* VMBUS_CHANPKT_FLAG_ */
+ uint64_t xactid;
+} __packed;
+
+struct vmbus_chanpkt {
+ struct vmbus_chanpkt_hdr hdr;
+} __packed;
+
+struct vmbuspipe_hdr {
+ unsigned int flags;
+ unsigned int msgsize;
+} __packed;
+
+struct ic_version {
+ unsigned short major;
+ unsigned short minor;
+} __packed;
+
+struct icmsg_negotiate {
+ unsigned short icframe_vercnt;
+ unsigned short icmsg_vercnt;
+ unsigned int reserved;
+ struct ic_version icversion_data[]; /* any size array */
+} __packed;
+
+struct icmsg_hdr {
+ struct ic_version icverframe;
+ unsigned short icmsgtype;
+ struct ic_version icvermsg;
+ unsigned short icmsgsize;
+ unsigned int status;
+ unsigned char ictransaction_id;
+ unsigned char icflags;
+ unsigned char reserved[2];
+} __packed;
+
+int rte_vmbus_chan_recv_raw(struct vmbus_br *rxbr, void *data, uint32_t *len);
+int rte_vmbus_chan_send(struct vmbus_br *txbr, uint16_t type, void *data,
+ uint32_t dlen, uint32_t flags);
+void vmbus_br_setup(struct vmbus_br *br, void *buf, unsigned int blen);
+void *vmbus_uio_map(char *sys_ring_buf_path, int size);
+
+/* Amount of space available for write */
+static inline uint32_t vmbus_br_availwrite(const struct vmbus_br *br, uint32_t windex)
+{
+ uint32_t rindex = br->vbr->rindex;
+
+ if (windex >= rindex)
+ return br->dsize - (windex - rindex);
+ else
+ return rindex - windex;
+}
+
+static inline uint32_t vmbus_br_availread(const struct vmbus_br *br)
+{
+ return br->dsize - vmbus_br_availwrite(br, br->vbr->windex);
+}
+
+#endif /* !_VMBUS_BUF_H_ */
--
2.34.1
^ permalink raw reply related
* [PATCH v2 3/5] tools: hv: Add new fcopy application based on uio driver
From: Saurabh Sengar @ 2023-06-14 18:15 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, mikelley, gregkh, corbet,
linux-kernel, linux-hyperv, linux-doc
In-Reply-To: <1686766512-2589-1-git-send-email-ssengar@linux.microsoft.com>
New fcopy application which utilizes uio_hv_vmbus_client driver
Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com>
---
[V2]
- simpler sysfs path
tools/hv/Build | 3 +-
tools/hv/Makefile | 10 +-
tools/hv/hv_fcopy_uio_daemon.c | 489 +++++++++++++++++++++++++++++++++
3 files changed, 496 insertions(+), 6 deletions(-)
create mode 100644 tools/hv/hv_fcopy_uio_daemon.c
diff --git a/tools/hv/Build b/tools/hv/Build
index 6cf51fa4b306..7d1f1698069b 100644
--- a/tools/hv/Build
+++ b/tools/hv/Build
@@ -1,3 +1,4 @@
hv_kvp_daemon-y += hv_kvp_daemon.o
hv_vss_daemon-y += hv_vss_daemon.o
-hv_fcopy_daemon-y += hv_fcopy_daemon.o
+hv_fcopy_uio_daemon-y += hv_fcopy_uio_daemon.o
+hv_fcopy_uio_daemon-y += vmbus_bufring.o
diff --git a/tools/hv/Makefile b/tools/hv/Makefile
index fe770e679ae8..944180cf916e 100644
--- a/tools/hv/Makefile
+++ b/tools/hv/Makefile
@@ -17,7 +17,7 @@ MAKEFLAGS += -r
override CFLAGS += -O2 -Wall -g -D_GNU_SOURCE -I$(OUTPUT)include
-ALL_TARGETS := hv_kvp_daemon hv_vss_daemon hv_fcopy_daemon
+ALL_TARGETS := hv_kvp_daemon hv_vss_daemon hv_fcopy_uio_daemon
ALL_PROGRAMS := $(patsubst %,$(OUTPUT)%,$(ALL_TARGETS))
ALL_SCRIPTS := hv_get_dhcp_info.sh hv_get_dns_info.sh hv_set_ifconfig.sh
@@ -39,10 +39,10 @@ $(HV_VSS_DAEMON_IN): FORCE
$(OUTPUT)hv_vss_daemon: $(HV_VSS_DAEMON_IN)
$(QUIET_LINK)$(CC) $(CFLAGS) $(LDFLAGS) $< -o $@
-HV_FCOPY_DAEMON_IN := $(OUTPUT)hv_fcopy_daemon-in.o
-$(HV_FCOPY_DAEMON_IN): FORCE
- $(Q)$(MAKE) $(build)=hv_fcopy_daemon
-$(OUTPUT)hv_fcopy_daemon: $(HV_FCOPY_DAEMON_IN)
+HV_FCOPY_UIO_DAEMON_IN := $(OUTPUT)hv_fcopy_uio_daemon-in.o
+$(HV_FCOPY_UIO_DAEMON_IN): FORCE
+ $(Q)$(MAKE) $(build)=hv_fcopy_uio_daemon
+$(OUTPUT)hv_fcopy_uio_daemon: $(HV_FCOPY_UIO_DAEMON_IN)
$(QUIET_LINK)$(CC) $(CFLAGS) $(LDFLAGS) $< -o $@
clean:
diff --git a/tools/hv/hv_fcopy_uio_daemon.c b/tools/hv/hv_fcopy_uio_daemon.c
new file mode 100644
index 000000000000..a84210fdc888
--- /dev/null
+++ b/tools/hv/hv_fcopy_uio_daemon.c
@@ -0,0 +1,489 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * An implementation of host to guest copy functionality for Linux.
+ *
+ * Copyright (C) 2023, Microsoft, Inc.
+ *
+ * Author : K. Y. Srinivasan <kys@microsoft.com>
+ * Author : Saurabh Sengar <ssengar@microsoft.com>
+ *
+ */
+
+#include <dirent.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <getopt.h>
+#include <locale.h>
+#include <stdbool.h>
+#include <stddef.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <syslog.h>
+#include <unistd.h>
+#include <wchar.h>
+#include <sys/stat.h>
+#include <linux/hyperv.h>
+#include <linux/limits.h>
+#include "vmbus_bufring.h"
+
+#define ICMSGTYPE_NEGOTIATE 0
+#define ICMSGTYPE_FCOPY 7
+
+#define WIN8_SRV_MAJOR 1
+#define WIN8_SRV_MINOR 1
+#define WIN8_SRV_VERSION (WIN8_SRV_MAJOR << 16 | WIN8_SRV_MINOR)
+
+#define MAX_FOLDER_NAME 15
+#define MAX_PATH_LEN 15
+#define FCOPY_SYSFS_RINGBUF "/sys/bus/vmbus/devices/eb765408-105f-49b6-b4aa-c123b64d17d4/ringbuffer"
+#define FCOPY_UIO "/sys/bus/vmbus/devices/eb765408-105f-49b6-b4aa-c123b64d17d4/uio"
+
+#define FCOPY_VER_COUNT 1
+static const int fcopy_versions[] = {
+ WIN8_SRV_VERSION
+};
+
+#define FW_VER_COUNT 1
+static const int fw_versions[] = {
+ UTIL_FW_VERSION
+};
+
+#define HV_RING_SIZE (4 * 4096)
+
+unsigned char desc[HV_RING_SIZE];
+
+static int target_fd;
+static char target_fname[PATH_MAX];
+static unsigned long long filesize;
+
+static int hv_fcopy_create_file(char *file_name, char *path_name, __u32 flags)
+{
+ int error = HV_E_FAIL;
+ char *q, *p;
+
+ filesize = 0;
+ p = (char *)path_name;
+ snprintf(target_fname, sizeof(target_fname), "%s/%s",
+ (char *)path_name, (char *)file_name);
+
+ /*
+ * Check to see if the path is already in place; if not,
+ * create if required.
+ */
+ while ((q = strchr(p, '/')) != NULL) {
+ if (q == p) {
+ p++;
+ continue;
+ }
+ *q = '\0';
+ if (access(path_name, F_OK)) {
+ if (flags & CREATE_PATH) {
+ if (mkdir(path_name, 0755)) {
+ syslog(LOG_ERR, "Failed to create %s",
+ path_name);
+ goto done;
+ }
+ } else {
+ syslog(LOG_ERR, "Invalid path: %s", path_name);
+ goto done;
+ }
+ }
+ p = q + 1;
+ *q = '/';
+ }
+
+ if (!access(target_fname, F_OK)) {
+ syslog(LOG_INFO, "File: %s exists", target_fname);
+ if (!(flags & OVER_WRITE)) {
+ error = HV_ERROR_ALREADY_EXISTS;
+ goto done;
+ }
+ }
+
+ target_fd = open(target_fname,
+ O_RDWR | O_CREAT | O_TRUNC | O_CLOEXEC, 0744);
+ if (target_fd == -1) {
+ syslog(LOG_INFO, "Open Failed: %s", strerror(errno));
+ goto done;
+ }
+
+ error = 0;
+done:
+ if (error)
+ target_fname[0] = '\0';
+ return error;
+}
+
+/* copy the data into the file */
+static int hv_copy_data(struct hv_do_fcopy *cpmsg)
+{
+ ssize_t len;
+ int ret = 0;
+
+ len = pwrite(target_fd, cpmsg->data, cpmsg->size, cpmsg->offset);
+
+ filesize += cpmsg->size;
+ if (len != cpmsg->size) {
+ switch (errno) {
+ case ENOSPC:
+ ret = HV_ERROR_DISK_FULL;
+ break;
+ default:
+ ret = HV_E_FAIL;
+ break;
+ }
+ syslog(LOG_ERR, "pwrite failed to write %llu bytes: %ld (%s)",
+ filesize, (long)len, strerror(errno));
+ }
+
+ return ret;
+}
+
+static int hv_copy_finished(void)
+{
+ close(target_fd);
+ target_fname[0] = '\0';
+
+ return 0;
+}
+
+static void print_usage(char *argv[])
+{
+ fprintf(stderr, "Usage: %s [options]\n"
+ "Options are:\n"
+ " -n, --no-daemon stay in foreground, don't daemonize\n"
+ " -h, --help print this help\n", argv[0]);
+}
+
+static bool vmbus_prep_negotiate_resp(struct icmsg_hdr *icmsghdrp, unsigned char *buf,
+ unsigned int buflen, const int *fw_version, int fw_vercnt,
+ const int *srv_version, int srv_vercnt,
+ int *nego_fw_version, int *nego_srv_version)
+{
+ int icframe_major, icframe_minor;
+ int icmsg_major, icmsg_minor;
+ int fw_major, fw_minor;
+ int srv_major, srv_minor;
+ int i, j;
+ bool found_match = false;
+ struct icmsg_negotiate *negop;
+
+ /* Check that there's enough space for icframe_vercnt, icmsg_vercnt */
+ if (buflen < ICMSG_HDR + offsetof(struct icmsg_negotiate, reserved)) {
+ syslog(LOG_ERR, "Invalid icmsg negotiate");
+ return false;
+ }
+
+ icmsghdrp->icmsgsize = 0x10;
+ negop = (struct icmsg_negotiate *)&buf[ICMSG_HDR];
+
+ icframe_major = negop->icframe_vercnt;
+ icframe_minor = 0;
+
+ icmsg_major = negop->icmsg_vercnt;
+ icmsg_minor = 0;
+
+ /* Validate negop packet */
+ if (icframe_major > IC_VERSION_NEGOTIATION_MAX_VER_COUNT ||
+ icmsg_major > IC_VERSION_NEGOTIATION_MAX_VER_COUNT ||
+ ICMSG_NEGOTIATE_PKT_SIZE(icframe_major, icmsg_major) > buflen) {
+ syslog(LOG_ERR, "Invalid icmsg negotiate - icframe_major: %u, icmsg_major: %u\n",
+ icframe_major, icmsg_major);
+ goto fw_error;
+ }
+
+ /*
+ * Select the framework version number we will
+ * support.
+ */
+
+ for (i = 0; i < fw_vercnt; i++) {
+ fw_major = (fw_version[i] >> 16);
+ fw_minor = (fw_version[i] & 0xFFFF);
+
+ for (j = 0; j < negop->icframe_vercnt; j++) {
+ if (negop->icversion_data[j].major == fw_major &&
+ negop->icversion_data[j].minor == fw_minor) {
+ icframe_major = negop->icversion_data[j].major;
+ icframe_minor = negop->icversion_data[j].minor;
+ found_match = true;
+ break;
+ }
+ }
+
+ if (found_match)
+ break;
+ }
+
+ if (!found_match)
+ goto fw_error;
+
+ found_match = false;
+
+ for (i = 0; i < srv_vercnt; i++) {
+ srv_major = (srv_version[i] >> 16);
+ srv_minor = (srv_version[i] & 0xFFFF);
+
+ for (j = negop->icframe_vercnt;
+ (j < negop->icframe_vercnt + negop->icmsg_vercnt);
+ j++) {
+ if (negop->icversion_data[j].major == srv_major &&
+ negop->icversion_data[j].minor == srv_minor) {
+ icmsg_major = negop->icversion_data[j].major;
+ icmsg_minor = negop->icversion_data[j].minor;
+ found_match = true;
+ break;
+ }
+ }
+
+ if (found_match)
+ break;
+ }
+
+ /*
+ * Respond with the framework and service
+ * version numbers we can support.
+ */
+fw_error:
+ if (!found_match) {
+ negop->icframe_vercnt = 0;
+ negop->icmsg_vercnt = 0;
+ } else {
+ negop->icframe_vercnt = 1;
+ negop->icmsg_vercnt = 1;
+ }
+
+ if (nego_fw_version)
+ *nego_fw_version = (icframe_major << 16) | icframe_minor;
+
+ if (nego_srv_version)
+ *nego_srv_version = (icmsg_major << 16) | icmsg_minor;
+
+ negop->icversion_data[0].major = icframe_major;
+ negop->icversion_data[0].minor = icframe_minor;
+ negop->icversion_data[1].major = icmsg_major;
+ negop->icversion_data[1].minor = icmsg_minor;
+
+ return found_match;
+}
+
+static void wcstoutf8(char *dest, const __u16 *src, size_t dest_size)
+{
+ size_t len = 0;
+
+ while (len < dest_size) {
+ if (src[len] < 0x80)
+ dest[len++] = (char)(*src++);
+ else
+ dest[len++] = 'X';
+ }
+
+ dest[len] = '\0';
+}
+
+static int hv_fcopy_start(struct hv_start_fcopy *smsg_in)
+{
+ setlocale(LC_ALL, "en_US.utf8");
+ size_t file_size, path_size;
+ char *file_name, *path_name;
+ char *in_file_name = (char *)smsg_in->file_name;
+ char *in_path_name = (char *)smsg_in->path_name;
+
+ file_size = wcstombs(NULL, (const wchar_t *restrict)in_file_name, 0) + 1;
+ path_size = wcstombs(NULL, (const wchar_t *restrict)in_path_name, 0) + 1;
+
+ file_name = (char *)malloc(file_size * sizeof(char));
+ path_name = (char *)malloc(path_size * sizeof(char));
+
+ wcstoutf8(file_name, (__u16 *)in_file_name, file_size);
+ wcstoutf8(path_name, (__u16 *)in_path_name, path_size);
+
+ return hv_fcopy_create_file(file_name, path_name, smsg_in->copy_flags);
+}
+
+static int hv_fcopy_send_data(struct hv_fcopy_hdr *fcopy_msg, int recvlen)
+{
+ int operation = fcopy_msg->operation;
+
+ /*
+ * The strings sent from the host are encoded in
+ * utf16; convert it to utf8 strings.
+ * The host assures us that the utf16 strings will not exceed
+ * the max lengths specified. We will however, reserve room
+ * for the string terminating character - in the utf16s_utf8s()
+ * function we limit the size of the buffer where the converted
+ * string is placed to W_MAX_PATH -1 to guarantee
+ * that the strings can be properly terminated!
+ */
+
+ switch (operation) {
+ case START_FILE_COPY:
+ return hv_fcopy_start((struct hv_start_fcopy *)fcopy_msg);
+ case WRITE_TO_FILE:
+ return hv_copy_data((struct hv_do_fcopy *)fcopy_msg);
+ case COMPLETE_FCOPY:
+ return hv_copy_finished();
+ }
+
+ return HV_E_FAIL;
+}
+
+/* process the packet recv from host */
+static int fcopy_pkt_process(struct vmbus_br *txbr)
+{
+ int ret, offset, pktlen;
+ int fcopy_srv_version;
+ const struct vmbus_chanpkt_hdr *pkt;
+ struct hv_fcopy_hdr *fcopy_msg;
+ struct icmsg_hdr *icmsghdr;
+
+ pkt = (const struct vmbus_chanpkt_hdr *)desc;
+ offset = pkt->hlen << 3;
+ pktlen = (pkt->tlen << 3) - offset;
+ icmsghdr = (struct icmsg_hdr *)&desc[offset + sizeof(struct vmbuspipe_hdr)];
+ icmsghdr->status = HV_E_FAIL;
+
+ if (icmsghdr->icmsgtype == ICMSGTYPE_NEGOTIATE) {
+ if (vmbus_prep_negotiate_resp(icmsghdr, desc + offset, pktlen, fw_versions,
+ FW_VER_COUNT, fcopy_versions, FCOPY_VER_COUNT,
+ NULL, &fcopy_srv_version)) {
+ syslog(LOG_INFO, "FCopy IC version %d.%d",
+ fcopy_srv_version >> 16, fcopy_srv_version & 0xFFFF);
+ icmsghdr->status = 0;
+ }
+ } else if (icmsghdr->icmsgtype == ICMSGTYPE_FCOPY) {
+ /* Ensure recvlen is big enough to contain hv_fcopy_hdr */
+ if (pktlen < ICMSG_HDR + sizeof(struct hv_fcopy_hdr)) {
+ syslog(LOG_ERR, "Invalid Fcopy hdr. Packet length too small: %u",
+ pktlen);
+ return -ENOBUFS;
+ }
+
+ fcopy_msg = (struct hv_fcopy_hdr *)&desc[offset + ICMSG_HDR];
+ icmsghdr->status = hv_fcopy_send_data(fcopy_msg, pktlen);
+ }
+
+ icmsghdr->icflags = ICMSGHDRFLAG_TRANSACTION | ICMSGHDRFLAG_RESPONSE;
+ ret = rte_vmbus_chan_send(txbr, 0x6, desc + offset, pktlen, 0);
+ if (ret) {
+ syslog(LOG_ERR, "Write to ringbuffer failed err: %d", ret);
+ return ret;
+ }
+
+ return 0;
+}
+
+static void fcopy_get_first_folder(char *path, char *chan_no)
+{
+ DIR *dir = opendir(path);
+ struct dirent *entry;
+
+ if (!dir) {
+ syslog(LOG_ERR, "Failed to open directory (errno=%s).\n", strerror(errno));
+ return;
+ }
+
+ while ((entry = readdir(dir)) != NULL) {
+ if (entry->d_type == DT_DIR && strcmp(entry->d_name, ".") != 0 &&
+ strcmp(entry->d_name, "..") != 0) {
+ strcpy(chan_no, entry->d_name);
+ break;
+ }
+ }
+
+ closedir(dir);
+}
+
+int main(int argc, char *argv[])
+{
+ int fcopy_fd = -1, tmp = 1;
+ int daemonize = 1, long_index = 0, opt, ret = -EINVAL;
+ struct vmbus_br txbr, rxbr;
+ void *ring;
+ uint32_t len = HV_RING_SIZE;
+ char uio_name[MAX_FOLDER_NAME] = {0};
+ char uio_dev_path[MAX_PATH_LEN] = {0};
+
+ static struct option long_options[] = {
+ {"help", no_argument, 0, 'h' },
+ {"no-daemon", no_argument, 0, 'n' },
+ {0, 0, 0, 0 }
+ };
+
+ while ((opt = getopt_long(argc, argv, "hn", long_options,
+ &long_index)) != -1) {
+ switch (opt) {
+ case 'n':
+ daemonize = 0;
+ break;
+ case 'h':
+ default:
+ print_usage(argv);
+ goto exit;
+ }
+ }
+
+ if (daemonize && daemon(1, 0)) {
+ syslog(LOG_ERR, "daemon() failed; error: %s", strerror(errno));
+ goto exit;
+ }
+
+ openlog("HV_UIO_FCOPY", 0, LOG_USER);
+ syslog(LOG_INFO, "starting; pid is:%d", getpid());
+
+ fcopy_get_first_folder(FCOPY_UIO, uio_name);
+ snprintf(uio_dev_path, sizeof(uio_dev_path), "/dev/%s", uio_name);
+ fcopy_fd = open(uio_dev_path, O_RDWR);
+
+ if (fcopy_fd < 0) {
+ syslog(LOG_ERR, "open %s failed; error: %d %s",
+ uio_dev_path, errno, strerror(errno));
+ ret = fcopy_fd;
+ goto exit;
+ }
+
+ ring = vmbus_uio_map(FCOPY_SYSFS_RINGBUF, HV_RING_SIZE);
+ if (!ring) {
+ ret = errno;
+ syslog(LOG_ERR, "mmap ringbuffer failed; error: %d %s", ret, strerror(ret));
+ goto close;
+ }
+ vmbus_br_setup(&txbr, ring, HV_RING_SIZE);
+ vmbus_br_setup(&rxbr, (char *)ring + HV_RING_SIZE, HV_RING_SIZE);
+
+ while (1) {
+ /*
+ * In this loop we process fcopy messages after the
+ * handshake is complete.
+ */
+ ret = pread(fcopy_fd, &tmp, sizeof(int), 0);
+ if (ret < 0) {
+ syslog(LOG_ERR, "pread failed: %s", strerror(errno));
+ continue;
+ }
+
+ len = HV_RING_SIZE;
+ ret = rte_vmbus_chan_recv_raw(&rxbr, desc, &len);
+ if (unlikely(ret <= 0)) {
+ /* This indicates a failure to communicate (or worse) */
+ syslog(LOG_ERR, "VMBus channel recv error: %d", ret);
+ } else {
+ ret = fcopy_pkt_process(&txbr);
+ if (ret < 0)
+ goto close;
+
+ /* Signal host */
+ if ((write(fcopy_fd, &tmp, sizeof(int))) != sizeof(int)) {
+ ret = errno;
+ syslog(LOG_ERR, "Registration failed: %s\n", strerror(ret));
+ goto close;
+ }
+ }
+ }
+close:
+ close(fcopy_fd);
+exit:
+ return ret;
+}
--
2.34.1
^ permalink raw reply related
* [PATCH v2 5/5] Drivers: hv: Remove fcopy driver
From: Saurabh Sengar @ 2023-06-14 18:15 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, mikelley, gregkh, corbet,
linux-kernel, linux-hyperv, linux-doc
In-Reply-To: <1686766512-2589-1-git-send-email-ssengar@linux.microsoft.com>
As the new fcopy driver using uio is introduced, remove obsolete driver.
Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com>
---
drivers/hv/Makefile | 2 +-
drivers/hv/hv_fcopy.c | 427 ------------------------------------------
drivers/hv/hv_util.c | 12 --
3 files changed, 1 insertion(+), 440 deletions(-)
delete mode 100644 drivers/hv/hv_fcopy.c
diff --git a/drivers/hv/Makefile b/drivers/hv/Makefile
index d76df5c8c2a9..b992c0ed182b 100644
--- a/drivers/hv/Makefile
+++ b/drivers/hv/Makefile
@@ -10,7 +10,7 @@ hv_vmbus-y := vmbus_drv.o \
hv.o connection.o channel.o \
channel_mgmt.o ring_buffer.o hv_trace.o
hv_vmbus-$(CONFIG_HYPERV_TESTING) += hv_debugfs.o
-hv_utils-y := hv_util.o hv_kvp.o hv_snapshot.o hv_fcopy.o hv_utils_transport.o
+hv_utils-y := hv_util.o hv_kvp.o hv_snapshot.o hv_utils_transport.o
# Code that must be built-in
obj-$(subst m,y,$(CONFIG_HYPERV)) += hv_common.o
diff --git a/drivers/hv/hv_fcopy.c b/drivers/hv/hv_fcopy.c
deleted file mode 100644
index 922d83eb7ddf..000000000000
--- a/drivers/hv/hv_fcopy.c
+++ /dev/null
@@ -1,427 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0-only
-/*
- * An implementation of file copy service.
- *
- * Copyright (C) 2014, Microsoft, Inc.
- *
- * Author : K. Y. Srinivasan <ksrinivasan@novell.com>
- */
-
-#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
-
-#include <linux/nls.h>
-#include <linux/workqueue.h>
-#include <linux/hyperv.h>
-#include <linux/sched.h>
-#include <asm/hyperv-tlfs.h>
-
-#include "hyperv_vmbus.h"
-#include "hv_utils_transport.h"
-
-#define WIN8_SRV_MAJOR 1
-#define WIN8_SRV_MINOR 1
-#define WIN8_SRV_VERSION (WIN8_SRV_MAJOR << 16 | WIN8_SRV_MINOR)
-
-#define FCOPY_VER_COUNT 1
-static const int fcopy_versions[] = {
- WIN8_SRV_VERSION
-};
-
-#define FW_VER_COUNT 1
-static const int fw_versions[] = {
- UTIL_FW_VERSION
-};
-
-/*
- * Global state maintained for transaction that is being processed.
- * For a class of integration services, including the "file copy service",
- * the specified protocol is a "request/response" protocol which means that
- * there can only be single outstanding transaction from the host at any
- * given point in time. We use this to simplify memory management in this
- * driver - we cache and process only one message at a time.
- *
- * While the request/response protocol is guaranteed by the host, we further
- * ensure this by serializing packet processing in this driver - we do not
- * read additional packets from the VMBUs until the current packet is fully
- * handled.
- */
-
-static struct {
- int state; /* hvutil_device_state */
- int recv_len; /* number of bytes received. */
- struct hv_fcopy_hdr *fcopy_msg; /* current message */
- struct vmbus_channel *recv_channel; /* chn we got the request */
- u64 recv_req_id; /* request ID. */
-} fcopy_transaction;
-
-static void fcopy_respond_to_host(int error);
-static void fcopy_send_data(struct work_struct *dummy);
-static void fcopy_timeout_func(struct work_struct *dummy);
-static DECLARE_DELAYED_WORK(fcopy_timeout_work, fcopy_timeout_func);
-static DECLARE_WORK(fcopy_send_work, fcopy_send_data);
-static const char fcopy_devname[] = "vmbus/hv_fcopy";
-static u8 *recv_buffer;
-static struct hvutil_transport *hvt;
-/*
- * This state maintains the version number registered by the daemon.
- */
-static int dm_reg_value;
-
-static void fcopy_poll_wrapper(void *channel)
-{
- /* Transaction is finished, reset the state here to avoid races. */
- fcopy_transaction.state = HVUTIL_READY;
- tasklet_schedule(&((struct vmbus_channel *)channel)->callback_event);
-}
-
-static void fcopy_timeout_func(struct work_struct *dummy)
-{
- /*
- * If the timer fires, the user-mode component has not responded;
- * process the pending transaction.
- */
- fcopy_respond_to_host(HV_E_FAIL);
- hv_poll_channel(fcopy_transaction.recv_channel, fcopy_poll_wrapper);
-}
-
-static void fcopy_register_done(void)
-{
- pr_debug("FCP: userspace daemon registered\n");
- hv_poll_channel(fcopy_transaction.recv_channel, fcopy_poll_wrapper);
-}
-
-static int fcopy_handle_handshake(u32 version)
-{
- u32 our_ver = FCOPY_CURRENT_VERSION;
-
- switch (version) {
- case FCOPY_VERSION_0:
- /* Daemon doesn't expect us to reply */
- dm_reg_value = version;
- break;
- case FCOPY_VERSION_1:
- /* Daemon expects us to reply with our own version */
- if (hvutil_transport_send(hvt, &our_ver, sizeof(our_ver),
- fcopy_register_done))
- return -EFAULT;
- dm_reg_value = version;
- break;
- default:
- /*
- * For now we will fail the registration.
- * If and when we have multiple versions to
- * deal with, we will be backward compatible.
- * We will add this code when needed.
- */
- return -EINVAL;
- }
- pr_debug("FCP: userspace daemon ver. %d connected\n", version);
- return 0;
-}
-
-static void fcopy_send_data(struct work_struct *dummy)
-{
- struct hv_start_fcopy *smsg_out = NULL;
- int operation = fcopy_transaction.fcopy_msg->operation;
- struct hv_start_fcopy *smsg_in;
- void *out_src;
- int rc, out_len;
-
- /*
- * The strings sent from the host are encoded in
- * utf16; convert it to utf8 strings.
- * The host assures us that the utf16 strings will not exceed
- * the max lengths specified. We will however, reserve room
- * for the string terminating character - in the utf16s_utf8s()
- * function we limit the size of the buffer where the converted
- * string is placed to W_MAX_PATH -1 to guarantee
- * that the strings can be properly terminated!
- */
-
- switch (operation) {
- case START_FILE_COPY:
- out_len = sizeof(struct hv_start_fcopy);
- smsg_out = kzalloc(sizeof(*smsg_out), GFP_KERNEL);
- if (!smsg_out)
- return;
-
- smsg_out->hdr.operation = operation;
- smsg_in = (struct hv_start_fcopy *)fcopy_transaction.fcopy_msg;
-
- utf16s_to_utf8s((wchar_t *)smsg_in->file_name, W_MAX_PATH,
- UTF16_LITTLE_ENDIAN,
- (__u8 *)&smsg_out->file_name, W_MAX_PATH - 1);
-
- utf16s_to_utf8s((wchar_t *)smsg_in->path_name, W_MAX_PATH,
- UTF16_LITTLE_ENDIAN,
- (__u8 *)&smsg_out->path_name, W_MAX_PATH - 1);
-
- smsg_out->copy_flags = smsg_in->copy_flags;
- smsg_out->file_size = smsg_in->file_size;
- out_src = smsg_out;
- break;
-
- case WRITE_TO_FILE:
- out_src = fcopy_transaction.fcopy_msg;
- out_len = sizeof(struct hv_do_fcopy);
- break;
- default:
- out_src = fcopy_transaction.fcopy_msg;
- out_len = fcopy_transaction.recv_len;
- break;
- }
-
- fcopy_transaction.state = HVUTIL_USERSPACE_REQ;
- rc = hvutil_transport_send(hvt, out_src, out_len, NULL);
- if (rc) {
- pr_debug("FCP: failed to communicate to the daemon: %d\n", rc);
- if (cancel_delayed_work_sync(&fcopy_timeout_work)) {
- fcopy_respond_to_host(HV_E_FAIL);
- fcopy_transaction.state = HVUTIL_READY;
- }
- }
- kfree(smsg_out);
-}
-
-/*
- * Send a response back to the host.
- */
-
-static void
-fcopy_respond_to_host(int error)
-{
- struct icmsg_hdr *icmsghdr;
- u32 buf_len;
- struct vmbus_channel *channel;
- u64 req_id;
-
- /*
- * Copy the global state for completing the transaction. Note that
- * only one transaction can be active at a time. This is guaranteed
- * by the file copy protocol implemented by the host. Furthermore,
- * the "transaction active" state we maintain ensures that there can
- * only be one active transaction at a time.
- */
-
- buf_len = fcopy_transaction.recv_len;
- channel = fcopy_transaction.recv_channel;
- req_id = fcopy_transaction.recv_req_id;
-
- icmsghdr = (struct icmsg_hdr *)
- &recv_buffer[sizeof(struct vmbuspipe_hdr)];
-
- if (channel->onchannel_callback == NULL)
- /*
- * We have raced with util driver being unloaded;
- * silently return.
- */
- return;
-
- icmsghdr->status = error;
- icmsghdr->icflags = ICMSGHDRFLAG_TRANSACTION | ICMSGHDRFLAG_RESPONSE;
- vmbus_sendpacket(channel, recv_buffer, buf_len, req_id,
- VM_PKT_DATA_INBAND, 0);
-}
-
-void hv_fcopy_onchannelcallback(void *context)
-{
- struct vmbus_channel *channel = context;
- u32 recvlen;
- u64 requestid;
- struct hv_fcopy_hdr *fcopy_msg;
- struct icmsg_hdr *icmsghdr;
- int fcopy_srv_version;
-
- if (fcopy_transaction.state > HVUTIL_READY)
- return;
-
- if (vmbus_recvpacket(channel, recv_buffer, HV_HYP_PAGE_SIZE * 2, &recvlen, &requestid)) {
- pr_err_ratelimited("Fcopy request received. Could not read into recv buf\n");
- return;
- }
-
- if (!recvlen)
- return;
-
- /* Ensure recvlen is big enough to read header data */
- if (recvlen < ICMSG_HDR) {
- pr_err_ratelimited("Fcopy request received. Packet length too small: %d\n",
- recvlen);
- return;
- }
-
- icmsghdr = (struct icmsg_hdr *)&recv_buffer[
- sizeof(struct vmbuspipe_hdr)];
-
- if (icmsghdr->icmsgtype == ICMSGTYPE_NEGOTIATE) {
- if (vmbus_prep_negotiate_resp(icmsghdr,
- recv_buffer, recvlen,
- fw_versions, FW_VER_COUNT,
- fcopy_versions, FCOPY_VER_COUNT,
- NULL, &fcopy_srv_version)) {
-
- pr_info("FCopy IC version %d.%d\n",
- fcopy_srv_version >> 16,
- fcopy_srv_version & 0xFFFF);
- }
- } else if (icmsghdr->icmsgtype == ICMSGTYPE_FCOPY) {
- /* Ensure recvlen is big enough to contain hv_fcopy_hdr */
- if (recvlen < ICMSG_HDR + sizeof(struct hv_fcopy_hdr)) {
- pr_err_ratelimited("Invalid Fcopy hdr. Packet length too small: %u\n",
- recvlen);
- return;
- }
- fcopy_msg = (struct hv_fcopy_hdr *)&recv_buffer[ICMSG_HDR];
-
- /*
- * Stash away this global state for completing the
- * transaction; note transactions are serialized.
- */
-
- fcopy_transaction.recv_len = recvlen;
- fcopy_transaction.recv_req_id = requestid;
- fcopy_transaction.fcopy_msg = fcopy_msg;
-
- if (fcopy_transaction.state < HVUTIL_READY) {
- /* Userspace is not registered yet */
- fcopy_respond_to_host(HV_E_FAIL);
- return;
- }
- fcopy_transaction.state = HVUTIL_HOSTMSG_RECEIVED;
-
- /*
- * Send the information to the user-level daemon.
- */
- schedule_work(&fcopy_send_work);
- schedule_delayed_work(&fcopy_timeout_work,
- HV_UTIL_TIMEOUT * HZ);
- return;
- } else {
- pr_err_ratelimited("Fcopy request received. Invalid msg type: %d\n",
- icmsghdr->icmsgtype);
- return;
- }
- icmsghdr->icflags = ICMSGHDRFLAG_TRANSACTION | ICMSGHDRFLAG_RESPONSE;
- vmbus_sendpacket(channel, recv_buffer, recvlen, requestid,
- VM_PKT_DATA_INBAND, 0);
-}
-
-/* Callback when data is received from userspace */
-static int fcopy_on_msg(void *msg, int len)
-{
- int *val = (int *)msg;
-
- if (len != sizeof(int))
- return -EINVAL;
-
- if (fcopy_transaction.state == HVUTIL_DEVICE_INIT)
- return fcopy_handle_handshake(*val);
-
- if (fcopy_transaction.state != HVUTIL_USERSPACE_REQ)
- return -EINVAL;
-
- /*
- * Complete the transaction by forwarding the result
- * to the host. But first, cancel the timeout.
- */
- if (cancel_delayed_work_sync(&fcopy_timeout_work)) {
- fcopy_transaction.state = HVUTIL_USERSPACE_RECV;
- fcopy_respond_to_host(*val);
- hv_poll_channel(fcopy_transaction.recv_channel,
- fcopy_poll_wrapper);
- }
-
- return 0;
-}
-
-static void fcopy_on_reset(void)
-{
- /*
- * The daemon has exited; reset the state.
- */
- fcopy_transaction.state = HVUTIL_DEVICE_INIT;
-
- if (cancel_delayed_work_sync(&fcopy_timeout_work))
- fcopy_respond_to_host(HV_E_FAIL);
-}
-
-int hv_fcopy_init(struct hv_util_service *srv)
-{
- recv_buffer = srv->recv_buffer;
- fcopy_transaction.recv_channel = srv->channel;
- fcopy_transaction.recv_channel->max_pkt_size = HV_HYP_PAGE_SIZE * 2;
-
- /*
- * When this driver loads, the user level daemon that
- * processes the host requests may not yet be running.
- * Defer processing channel callbacks until the daemon
- * has registered.
- */
- fcopy_transaction.state = HVUTIL_DEVICE_INIT;
-
- hvt = hvutil_transport_init(fcopy_devname, 0, 0,
- fcopy_on_msg, fcopy_on_reset);
- if (!hvt)
- return -EFAULT;
-
- return 0;
-}
-
-static void hv_fcopy_cancel_work(void)
-{
- cancel_delayed_work_sync(&fcopy_timeout_work);
- cancel_work_sync(&fcopy_send_work);
-}
-
-int hv_fcopy_pre_suspend(void)
-{
- struct vmbus_channel *channel = fcopy_transaction.recv_channel;
- struct hv_fcopy_hdr *fcopy_msg;
-
- /*
- * Fake a CANCEL_FCOPY message for the user space daemon in case the
- * daemon is in the middle of copying some file. It doesn't matter if
- * there is already a message pending to be delivered to the user
- * space since we force fcopy_transaction.state to be HVUTIL_READY, so
- * the user space daemon's write() will fail with EINVAL (see
- * fcopy_on_msg()), and the daemon will reset the device by closing
- * and re-opening it.
- */
- fcopy_msg = kzalloc(sizeof(*fcopy_msg), GFP_KERNEL);
- if (!fcopy_msg)
- return -ENOMEM;
-
- tasklet_disable(&channel->callback_event);
-
- fcopy_msg->operation = CANCEL_FCOPY;
-
- hv_fcopy_cancel_work();
-
- /* We don't care about the return value. */
- hvutil_transport_send(hvt, fcopy_msg, sizeof(*fcopy_msg), NULL);
-
- kfree(fcopy_msg);
-
- fcopy_transaction.state = HVUTIL_READY;
-
- /* tasklet_enable() will be called in hv_fcopy_pre_resume(). */
- return 0;
-}
-
-int hv_fcopy_pre_resume(void)
-{
- struct vmbus_channel *channel = fcopy_transaction.recv_channel;
-
- tasklet_enable(&channel->callback_event);
-
- return 0;
-}
-
-void hv_fcopy_deinit(void)
-{
- fcopy_transaction.state = HVUTIL_DEVICE_DYING;
-
- hv_fcopy_cancel_work();
-
- hvutil_transport_destroy(hvt);
-}
diff --git a/drivers/hv/hv_util.c b/drivers/hv/hv_util.c
index 42aec2c5606a..4bf16e58ee50 100644
--- a/drivers/hv/hv_util.c
+++ b/drivers/hv/hv_util.c
@@ -154,14 +154,6 @@ static struct hv_util_service util_vss = {
.util_deinit = hv_vss_deinit,
};
-static struct hv_util_service util_fcopy = {
- .util_cb = hv_fcopy_onchannelcallback,
- .util_init = hv_fcopy_init,
- .util_pre_suspend = hv_fcopy_pre_suspend,
- .util_pre_resume = hv_fcopy_pre_resume,
- .util_deinit = hv_fcopy_deinit,
-};
-
static void perform_shutdown(struct work_struct *dummy)
{
orderly_poweroff(true);
@@ -671,10 +663,6 @@ static const struct hv_vmbus_device_id id_table[] = {
{ HV_VSS_GUID,
.driver_data = (unsigned long)&util_vss
},
- /* File copy GUID */
- { HV_FCOPY_GUID,
- .driver_data = (unsigned long)&util_fcopy
- },
{ },
};
--
2.34.1
^ permalink raw reply related
* [PATCH v2 0/5] UIO driver for low speed Hyper-V devices
From: Saurabh Sengar @ 2023-06-14 18:15 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, mikelley, gregkh, corbet,
linux-kernel, linux-hyperv, linux-doc
Hyper-V is adding some low speed "specialty" synthetic devices.
This patch series propose the solution to support these devices.
Instead of writing new kernel-level VMBus drivers for all of
these devices, we propose a solution wherein these devices are
made accessible to user space through a dedicated UIO-based
hv_vmbus_client driver, allowing for efficient device handling
via user space drivers. This solution aims to optimize the
development process by eliminating the need to create
individual kernel-level VMBus drivers for each device and
provide flexibility to user space applications to control the
ring buffer independently.
Since all these new synthetic devices are low speed devices,
they don't support monitor bits and we must use vmbus_setevent()
to enable interrupts from the host. The new uio driver supports
all these requirements effectively. Additionally, this new driver
also provide the support for having smaller/cutom ringbuffer
size.
Furthermore, this patch series includes a revision of the fcopy
application to leverage the new interface seamlessly along with
removal of old driver and application. However, please note that
the development of other similar drivers is still a work in
progress, and will be shared as they become available.
[V2]
1. Update driver info in Documentation/driver-api/uio-howto.rst
2. Update ring_size sysfs info in Documentation/ABI/stable/sysfs-bus-vmbus
3. Remove DRIVER_VERSION
4. Remove refcnt
5. scnprintf -> sysfs_emit
6. sysfs_create_file -> ATTRIBUTE_GROUPS + ".driver.groups";
7. sysfs_create_bin_file -> device_create_bin_file
8. dev_notice -> dev_err
9. Removed invalid free of devm_ allocated data
10. Updated application with simpler sysfs path
Saurabh Sengar (5):
uio: Add hv_vmbus_client driver
tools: hv: Add vmbus_bufring
tools: hv: Add new fcopy application based on uio driver
tools: hv: Remove hv_fcopy_daemon
Drivers: hv: Remove fcopy driver
Documentation/ABI/stable/sysfs-bus-vmbus | 7 +
Documentation/driver-api/uio-howto.rst | 46 +++
drivers/hv/Makefile | 2 +-
drivers/hv/hv_fcopy.c | 427 --------------------
drivers/hv/hv_util.c | 12 -
drivers/uio/Kconfig | 12 +
drivers/uio/Makefile | 1 +
drivers/uio/uio_hv_vmbus_client.c | 217 ++++++++++
tools/hv/Build | 3 +-
tools/hv/Makefile | 10 +-
tools/hv/hv_fcopy_daemon.c | 266 ------------
tools/hv/hv_fcopy_uio_daemon.c | 489 +++++++++++++++++++++++
tools/hv/vmbus_bufring.c | 322 +++++++++++++++
tools/hv/vmbus_bufring.h | 158 ++++++++
14 files changed, 1260 insertions(+), 712 deletions(-)
delete mode 100644 drivers/hv/hv_fcopy.c
create mode 100644 drivers/uio/uio_hv_vmbus_client.c
delete mode 100644 tools/hv/hv_fcopy_daemon.c
create mode 100644 tools/hv/hv_fcopy_uio_daemon.c
create mode 100644 tools/hv/vmbus_bufring.c
create mode 100644 tools/hv/vmbus_bufring.h
--
2.34.1
^ permalink raw reply
* Re: [PATCH v2 1/5] uio: Add hv_vmbus_client driver
From: Greg KH @ 2023-06-14 21:13 UTC (permalink / raw)
To: Saurabh Sengar
Cc: kys, haiyangz, wei.liu, decui, mikelley, corbet, linux-kernel,
linux-hyperv, linux-doc
In-Reply-To: <1686766512-2589-2-git-send-email-ssengar@linux.microsoft.com>
On Wed, Jun 14, 2023 at 11:15:08AM -0700, Saurabh Sengar wrote:
> --- a/Documentation/ABI/stable/sysfs-bus-vmbus
> +++ b/Documentation/ABI/stable/sysfs-bus-vmbus
> @@ -153,6 +153,13 @@ Contact: Stephen Hemminger <sthemmin@microsoft.com>
> Description: Binary file created by uio_hv_generic for ring buffer
> Users: Userspace drivers
>
> +What: /sys/bus/vmbus/devices/<UUID>/ring_size
> +Date: June. 2023
No need for the "."
> +KernelVersion: 6.4
6.4 will be released without this, sorry.
> +Contact: Saurabh Sengar <ssengar@microsoft.com>
> +Description: File created by uio_hv_vmbus_client for setting device ring buffer size
> +Users: Userspace drivers
> +
> What: /sys/bus/vmbus/devices/<UUID>/channels/<N>/intr_in_full
> Date: February 2019
> KernelVersion: 5.0
> diff --git a/Documentation/driver-api/uio-howto.rst b/Documentation/driver-api/uio-howto.rst
> index 907ffa3b38f5..33b67f876b96 100644
> --- a/Documentation/driver-api/uio-howto.rst
> +++ b/Documentation/driver-api/uio-howto.rst
> @@ -722,6 +722,52 @@ For example::
>
> /sys/bus/vmbus/devices/3811fe4d-0fa0-4b62-981a-74fc1084c757/channels/21/ring
>
> +Generic Hyper-V driver for low speed devices
> +============================================
> +
> +The generic driver is a kernel module named uio_hv_vmbus_client. It
> +supports slow devices on the Hyper-V VMBus similar to uio_hv_generic
> +for faster devices. This driver also gives flexibility of customized
> +ring buffer sizes.
> +
> +Making the driver recognize the device
> +--------------------------------------
> +
> +Since the driver does not declare any device GUID's, it will not get
> +loaded automatically and will not automatically bind to any devices, you
> +must load it and allocate id to the driver yourself. For example, to use
> +the fcopy device class GUID::
> +
> + DEV_UUID=eb765408-105f-49b6-b4aa-c123b64d17d4
> + driverctl -b vmbus set-override $DEV_UUID uio_hv_vmbus_client
Why are you adding a dependancy on a 300 line bash script that is not
used by most distros?
Why not just show the "real" commands that you can use here that don't
require an external tool not controlled by the kernel at all.
> --- /dev/null
> +++ b/drivers/uio/uio_hv_vmbus_client.c
> @@ -0,0 +1,217 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * uio_hv_vmbus_client - UIO driver for low speed VMBus devices
> + *
> + * Copyright (c) 2023, Microsoft Corporation.
> + *
> + * Authors:
> + * Saurabh Sengar <ssengar@microsoft.com>
> + *
> + * Since the driver does not declare any device ids, you must allocate
> + * id and bind the device to the driver yourself. For example:
> + * driverctl -b vmbus set-override <dev uuid> uio_hv_vmbus_client
Again, no need to discuss driverctl.
> + */
> +
> +#include <linux/device.h>
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/uio_driver.h>
> +#include <linux/hyperv.h>
> +
> +#define DRIVER_AUTHOR "Saurabh Sengar <ssengar@microsoft.com>"
> +#define DRIVER_DESC "Generic UIO driver for low speed VMBus devices"
You only use these defines in one place, so why not just spell them out
there, no need for 2 extra lines, right?
> +
> +#define DEFAULT_HV_RING_SIZE VMBUS_RING_SIZE(3 * HV_HYP_PAGE_SIZE)
> +static int ring_size = DEFAULT_HV_RING_SIZE;
You only use that #define in one place, why have it at all?
And you are defining a "global" variable that can be modified by an
individual sysfs file for ANY device bound to this driver, messing with
the other device's ring buffer size, right? This needs to be
per-device, or explain in huge detail here why not.
> +
> +struct uio_hv_vmbus_dev {
> + struct uio_info info;
> + struct hv_device *device;
> +};
> +
> +/* Sysfs API to allow mmap of the ring buffers */
> +static int uio_hv_vmbus_mmap(struct file *filp, struct kobject *kobj,
> + struct bin_attribute *attr, struct vm_area_struct *vma)
> +{
> + struct device *dev = container_of(kobj, struct device, kobj);
> + struct hv_device *hv_dev = container_of(dev, struct hv_device, device);
> + struct vmbus_channel *channel = hv_dev->channel;
> + void *ring_buffer = page_address(channel->ringbuffer_page);
> +
> + return vm_iomap_memory(vma, virt_to_phys(ring_buffer),
> + channel->ringbuffer_pagecount << PAGE_SHIFT);
> +}
> +
> +static const struct bin_attribute ring_buffer_bin_attr = {
> + .attr = {
> + .name = "ringbuffer",
> + .mode = 0600,
> + },
> + .mmap = uio_hv_vmbus_mmap,
> +};
> +
> +/*
> + * This is the irqcontrol callback to be registered to uio_info.
> + * It can be used to disable/enable interrupt from user space processes.
> + *
> + * @param info
> + * pointer to uio_info.
> + * @param irq_state
> + * state value. 1 to enable interrupt, 0 to disable interrupt.
> + */
> +static int uio_hv_vmbus_irqcontrol(struct uio_info *info, s32 irq_state)
> +{
> + struct uio_hv_vmbus_dev *pdata = info->priv;
> + struct hv_device *hv_dev = pdata->device;
> +
> + /* Issue a full memory barrier before triggering the notification */
> + virt_mb();
> +
> + vmbus_setevent(hv_dev->channel);
> + return 0;
> +}
> +
> +/*
> + * Callback from vmbus_event when something is in inbound ring.
> + */
> +static void uio_hv_vmbus_channel_cb(void *context)
> +{
> + struct uio_hv_vmbus_dev *pdata = context;
> +
> + /* Issue a full memory barrier before sending the event to userspace */
> + virt_mb();
> +
> + uio_event_notify(&pdata->info);
> +}
> +
> +static int uio_hv_vmbus_open(struct uio_info *info, struct inode *inode)
> +{
> + struct uio_hv_vmbus_dev *pdata = container_of(info, struct uio_hv_vmbus_dev, info);
> + struct hv_device *hv_dev = pdata->device;
> + struct vmbus_channel *channel = hv_dev->channel;
> + int ret;
> +
> + ret = vmbus_open(channel, ring_size, ring_size, NULL, 0,
> + uio_hv_vmbus_channel_cb, pdata);
> + if (ret) {
> + dev_err(&hv_dev->device, "error %d when opening the channel\n", ret);
> + return ret;
> + }
> + channel->inbound.ring_buffer->interrupt_mask = 0;
> + set_channel_read_mode(channel, HV_CALL_ISR);
> +
> + ret = device_create_bin_file(&hv_dev->device, &ring_buffer_bin_attr);
> + if (ret)
> + dev_err(&hv_dev->device, "sysfs create ring bin file failed; %d\n", ret);
> +
> + return ret;
> +}
> +
> +/* VMbus primary channel is closed on last close */
> +static int uio_hv_vmbus_release(struct uio_info *info, struct inode *inode)
> +{
> + struct uio_hv_vmbus_dev *pdata = container_of(info, struct uio_hv_vmbus_dev, info);
> + struct hv_device *hv_dev = pdata->device;
> + struct vmbus_channel *channel = hv_dev->channel;
> +
> + device_remove_bin_file(&hv_dev->device, &ring_buffer_bin_attr);
> + vmbus_close(channel);
> +
> + return 0;
> +}
> +
> +static ssize_t ring_size_show(struct device *dev, struct device_attribute *attr,
> + char *buf)
> +{
> + return sysfs_emit(buf, "%d\n", ring_size);
> +}
> +
> +static ssize_t ring_size_store(struct device *dev, struct device_attribute *attr,
> + const char *buf, size_t count)
> +{
> + unsigned int val;
> +
> + if (kstrtouint(buf, 0, &val) < 0)
> + return -EINVAL;
> +
> + if (val < HV_HYP_PAGE_SIZE)
> + return -EINVAL;
> +
> + ring_size = val;
> +
> + return count;
> +}
> +
> +static DEVICE_ATTR_RW(ring_size);
> +
> +static struct attribute *uio_hv_vmbus_client_attrs[] = {
> + &dev_attr_ring_size.attr,
> + NULL,
> +};
> +ATTRIBUTE_GROUPS(uio_hv_vmbus_client);
> +
> +static int uio_hv_vmbus_probe(struct hv_device *dev, const struct hv_vmbus_device_id *dev_id)
> +{
> + struct uio_hv_vmbus_dev *pdata;
> + int ret;
> + char *name = NULL;
> +
> + pdata = devm_kzalloc(&dev->device, sizeof(*pdata), GFP_KERNEL);
> + if (!pdata)
> + return -ENOMEM;
> +
> + name = kasprintf(GFP_KERNEL, "%pUl", &dev->dev_instance);
> +
> + /* Fill general uio info */
> + pdata->info.name = name; /* /sys/class/uio/uioX/name */
> + pdata->info.version = "1";
> + pdata->info.irqcontrol = uio_hv_vmbus_irqcontrol;
> + pdata->info.open = uio_hv_vmbus_open;
> + pdata->info.release = uio_hv_vmbus_release;
> + pdata->info.irq = UIO_IRQ_CUSTOM;
> + pdata->info.priv = pdata;
> + pdata->device = dev;
> +
> + ret = uio_register_device(&dev->device, &pdata->info);
> + if (ret) {
> + dev_err(&dev->device, "uio_hv_vmbus register failed\n");
> + return ret;
> + }
> +
> + hv_set_drvdata(dev, pdata);
> +
> + return 0;
> +}
> +
> +static void uio_hv_vmbus_remove(struct hv_device *dev)
> +{
> + struct uio_hv_vmbus_dev *pdata = hv_get_drvdata(dev);
> +
> + if (pdata)
> + uio_unregister_device(&pdata->info);
> +}
> +
> +static struct hv_driver uio_hv_vmbus_drv = {
> + .driver.dev_groups = uio_hv_vmbus_client_groups,
> + .name = "uio_hv_vmbus_client",
> + .id_table = NULL, /* only dynamic id's */
No need to set this if it's NULL.
thanks,
greg k-h
^ permalink raw reply
* Re: [PATCH v2 2/5] tools: hv: Add vmbus_bufring
From: Greg KH @ 2023-06-14 21:16 UTC (permalink / raw)
To: Saurabh Sengar
Cc: kys, haiyangz, wei.liu, decui, mikelley, corbet, linux-kernel,
linux-hyperv, linux-doc
In-Reply-To: <1686766512-2589-3-git-send-email-ssengar@linux.microsoft.com>
On Wed, Jun 14, 2023 at 11:15:09AM -0700, Saurabh Sengar wrote:
> Common userspace interface for read/write from VMBus ringbuffer.
> This implementation is open for use by any userspace driver or
> application seeking direct control over VMBus ring buffers.
> A significant part of this code is borrowed from DPDK.
" "?
Anyway, this does not explain what this is at all.
And if you "borrowed" it from DPDK, that feels odd, are you sure you are
allowed to do so?
> Link: https://github.com/DPDK/dpdk/
Not what a Link: tag is for, sorry.
>
> Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com>
> ---
> [V2]
> - simpler sysfs path, less parsing
>
> tools/hv/vmbus_bufring.c | 322 +++++++++++++++++++++++++++++++++++++++
> tools/hv/vmbus_bufring.h | 158 +++++++++++++++++++
> 2 files changed, 480 insertions(+)
> create mode 100644 tools/hv/vmbus_bufring.c
> create mode 100644 tools/hv/vmbus_bufring.h
You add new files to the tools directory, yet say nothing about how to
use them or even how to build them.
Why is there a .h file for a single .c file? That seems pointless,
right?
> diff --git a/tools/hv/vmbus_bufring.c b/tools/hv/vmbus_bufring.c
> new file mode 100644
> index 000000000000..d44a06d45b03
> --- /dev/null
> +++ b/tools/hv/vmbus_bufring.c
> @@ -0,0 +1,322 @@
> +// SPDX-License-Identifier: BSD-3-Clause
> +/*
> + * Copyright (c) 2009-2012,2016,2023 Microsoft Corp.
> + * Copyright (c) 2012 NetApp Inc.
> + * Copyright (c) 2012 Citrix Inc.
> + * All rights reserved.
No copyright for the work you did?
> + */
> +
> +#include <errno.h>
> +#include <fcntl.h>
> +#include <emmintrin.h>
> +#include <linux/limits.h>
> +#include <stdbool.h>
> +#include <stdint.h>
> +#include <stdio.h>
> +#include <string.h>
> +#include <sys/mman.h>
> +#include <sys/uio.h>
> +#include <unistd.h>
> +#include "vmbus_bufring.h"
> +
> +#define rte_compiler_barrier() ({ asm volatile ("" : : : "memory"); })
> +
> +#define rte_smp_rwmb() ({ asm volatile ("" : : : "memory"); })
These aren't in any common header file already?
thanks,
greg k-h
^ permalink raw reply
* Re: [PATCH v2 3/5] tools: hv: Add new fcopy application based on uio driver
From: Greg KH @ 2023-06-14 21:17 UTC (permalink / raw)
To: Saurabh Sengar
Cc: kys, haiyangz, wei.liu, decui, mikelley, corbet, linux-kernel,
linux-hyperv, linux-doc
In-Reply-To: <1686766512-2589-4-git-send-email-ssengar@linux.microsoft.com>
On Wed, Jun 14, 2023 at 11:15:10AM -0700, Saurabh Sengar wrote:
> New fcopy application which utilizes uio_hv_vmbus_client driver
Please read the kernel documentation for how to write a good kernel
changelog.
THen get another microsoft kernel developer to review this patchset
before you send it out again, it needs some basic rework for simple
things like this.
I've stopped reviewing this series here.
thanks,
greg k-h
^ permalink raw reply
* Re: [PATCH 1/1] scsi: storvsc: Always set no_report_opcodes
From: Martin K. Petersen @ 2023-06-15 2:15 UTC (permalink / raw)
To: kys, longli, wei.liu, decui, jejb, linux-hyperv, linux-kernel,
linux-scsi, Michael Kelley
Cc: Martin K . Petersen
In-Reply-To: <1686343101-18930-1-git-send-email-mikelley@microsoft.com>
On Fri, 09 Jun 2023 13:38:21 -0700, Michael Kelley wrote:
> Hyper-V synthetic SCSI devices do not support the MAINTENANCE_IN SCSI
> command, so scsi_report_opcode() always fails, resulting in messages
> like this:
>
> hv_storvsc <guid>: tag#205 cmd 0xa3 status: scsi 0x2 srb 0x86 hv 0xc0000001
>
> The recently added support for command duration limits calls
> scsi_report_opcode() four times as each device comes online, which
> significantly increases the number of messages logged in a system with
> many disks.
>
> [...]
Applied to 6.4/scsi-fixes, thanks!
[1/1] scsi: storvsc: Always set no_report_opcodes
https://git.kernel.org/mkp/scsi/c/31d16e712bdc
--
Martin K. Petersen Oracle Linux Engineering
^ permalink raw reply
* RE: [PATCH v3 1/6] PCI: hv: Fix a race condition bug in hv_pci_query_relations()
From: Dexuan Cui @ 2023-06-15 3:55 UTC (permalink / raw)
To: Lorenzo Pieralisi
Cc: bhelgaas@google.com, davem@davemloft.net, edumazet@google.com,
Haiyang Zhang, Jake Oshins, kuba@kernel.org, kw@linux.com,
KY Srinivasan, leon@kernel.org, linux-pci@vger.kernel.org,
Michael Kelley (LINUX), pabeni@redhat.com, robh@kernel.org,
saeedm@nvidia.com, wei.liu@kernel.org, Long Li,
boqun.feng@gmail.com, Saurabh Singh Sengar, helgaas@kernel.org,
linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-rdma@vger.kernel.org, netdev@vger.kernel.org,
Jose Teuttli Carranco, stable@vger.kernel.org
In-Reply-To: <ZG8YzuK/5+8iE8He@lpieralisi>
> From: Lorenzo Pieralisi <lpieralisi@kernel.org>
> Sent: Thursday, May 25, 2023 1:14 AM
> To: Dexuan Cui <decui@microsoft.com>
>
> On Wed, Apr 19, 2023 at 07:40:32PM -0700, Dexuan Cui wrote:
> > Fix the longstanding race between hv_pci_query_relations() and
> > survey_child_resources() by flushing the workqueue before we exit from
> > hv_pci_query_relations().
>
> "Fix the longstanding race" is vague. Please describe the race
> succinctly at least to give an idea of what the problem is.
Hi Lozenzo, I'm sorry for the late response -- finally I'm back on the patchset...
I'll use the below commit message in v4:
Since day 1 of the driver, there has been a race between
hv_pci_query_relations() and survey_child_resources(): during fast
device hotplug, hv_pci_query_relations() may error out due to
device-remove and the stack variable 'comp' is no longer valid;
however, pci_devices_present_work() -> survey_child_resources() ->
complete() may be running on another CPU and accessing the no-longer-valid
'comp'. Fix the race by flushing the workqueue before we exit from
hv_pci_query_relations().
I'll also update the comment to share more details of the race:
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -3401,6 +3401,24 @@ static int hv_pci_query_relations(struct hv_device *hdev)
if (!ret)
ret = wait_for_response(hdev, &comp);
+ /*
+ * In the case of fast device addition/removal, it's possible that
+ * vmbus_sendpacket() or wait_for_response() returns -ENODEV but we
+ * already got a PCI_BUS_RELATIONS* message from the host and the
+ * channel callback already scheduled a work to hbus->wq, which can be
+ * running pci_devices_present_work() -> survey_child_resources() ->
+ * complete(&hbus->survey_event), even after hv_pci_query_relations()
+ * exits and the stack variable 'comp' is no longer valid; as a result,
+ * a hang or a page fault may happen when the complete() calls
+ * raw_spin_lock_irqsave(). Flush hbus->wq before we exit from
+ * hv_pci_query_relations() to avoid the issues. Note: if 'ret' is
+ * -ENODEV, there can't be any more work item scheduled to hbus->wq
+ * after the flush_workqueue(): see vmbus_onoffer_rescind() ->
+ * vmbus_reset_channel_cb(), vmbus_rescind_cleanup() ->
+ * channel->rescind = true.
+ */
+ flush_workqueue(hbus->wq);
+
return ret;
}
> > + * In the case of fast device addition/removal, it's possible that
> > + * vmbus_sendpacket() or wait_for_response() returns -ENODEV but we
> > + * already got a PCI_BUS_RELATIONS* message from the host and the
> > + * channel callback already scheduled a work to hbus->wq, which can
> > be
> > + * running survey_child_resources() -> complete(&hbus->survey_event),
> > + * even after hv_pci_query_relations() exits and the stack variable
> > + * 'comp' is no longer valid. This can cause a strange hang issue
>
> "A strange hang" sounds like we don't understand what's happening, it
> does not feel like it is a solid understanding of the issue.
>
> I would remove it - given that you already explain that comp is no
> longer valid - that is already a bug that needs fixing.
I share more details in the comment (see the above).
> Acked-by: Lorenzo Pieralisi <lpieralisi@kernel.org>
Thanks!
^ permalink raw reply
* RE: [PATCH v3 2/6] PCI: hv: Fix a race condition in hv_irq_unmask() that can cause panic
From: Dexuan Cui @ 2023-06-15 4:27 UTC (permalink / raw)
To: Lorenzo Pieralisi
Cc: bhelgaas@google.com, davem@davemloft.net, edumazet@google.com,
Haiyang Zhang, Jake Oshins, kuba@kernel.org, kw@linux.com,
KY Srinivasan, leon@kernel.org, linux-pci@vger.kernel.org,
Michael Kelley (LINUX), pabeni@redhat.com, robh@kernel.org,
saeedm@nvidia.com, wei.liu@kernel.org, Long Li,
boqun.feng@gmail.com, Saurabh Singh Sengar, helgaas@kernel.org,
linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-rdma@vger.kernel.org, netdev@vger.kernel.org,
Jose Teuttli Carranco, stable@vger.kernel.org
In-Reply-To: <ZG81WpJBBegbLSbT@lpieralisi>
> From: Lorenzo Pieralisi <lpieralisi@kernel.org>
> Sent: Thursday, May 25, 2023 3:16 AM
> > --- a/drivers/pci/controller/pci-hyperv.c
> > +++ b/drivers/pci/controller/pci-hyperv.c
> > @@ -643,6 +643,11 @@ static void hv_arch_irq_unmask(struct irq_data
> > *data)
> > pbus = pdev->bus;
> > hbus = container_of(pbus->sysdata, struct hv_pcibus_device, sysdata);
> > int_desc = data->chip_data;
> > + if (!int_desc) {
> > + dev_warn(&hbus->hdev->device, "%s() can not unmask irq %u\n",
> > + __func__, data->irq);
> > + return;
> > + }
>
> That's a check that should be there regardless ?
Yes. Normally data->chip_data is set at the end of hv_compose_msi_msg(),
and it should not be NULL. However, in rare circumstances, we might see a
failure in hv_compose_msi_msg(), e.g. the hypervisor/host might return an
error in comp.comp_pkt.completion_status (at least this is possible in theory).
In case we see a failure in hv_compose_msi_msg(), data->chip_data stays
with its default value of NULL; because the return type of
hv_compose_msi_msg() is "void", we can not return an error to the upper
layer; later, when the upper layer calls request_irq() -> ... -> hv_irq_unmask(),
hv_arch_irq_unmask() crashes because data->chip_data is NULL -- with this
check, we're able to error out gracefully, and the user can better understand
what goes wrong.
> > spin_lock_irqsave(&hbus->retarget_msi_interrupt_lock, flags);
> >
> > @@ -1911,12 +1916,6 @@ static void hv_compose_msi_msg(struct
> irq_data *data, struct msi_msg *msg)
> > hv_pci_onchannelcallback(hbus);
> > spin_unlock_irqrestore(&channel->sched_lock, flags);
> >
> > - if (hpdev->state == hv_pcichild_ejecting) {
> > - dev_err_once(&hbus->hdev->device,
> > - "the device is being ejected\n");
> > - goto enable_tasklet;
> > - }
> > -
> > udelay(100);
> > }
>
> I don't understand why this code is in hv_compose_msi_msg() in the first
> place (and why only in that function ?) to me this looks like you are
> adding plasters in the code that can turn out to be problematic while
> ejecting a device, this does not seem robust at all - that's my opinion.
The code was incorrectly added by
de0aa7b2f97d ("PCI: hv: Fix 2 hang issues in hv_compose_msi_msg()")
de0aa7b2f97d says
"
2. If the host is ejecting the VF device before we reach
hv_compose_msi_msg(), in a UP VM, we can hang in hv_compose_msi_msg()
forever, because at this time the host doesn't respond to the
CREATE_INTERRUPT request.
"
de0aa7b2f97d implies that the host doesn't respond to the guest's
CREATE_INTERRUPT request once the guest receives the PCI_EJECT message -- this
is incorrect: after the guest receives the PCI_EJECT message, actually the host
still responds to the guest's request, as long as the guest sends the request within
1 minute AND the guest doesn't send a PCI_EJECTION_COMPLETE message to
the host in hv_eject_device_work(). The real issue is that currently the guest
can send PCI_EJECTION_COMPLETE to the host before the guest finishes the
device-probing/removing handling -- once the guest sends PCI_EJECTION_COMPLETE,
the host unassigns the PCI device from the guest and ignores any request
from the guest.
So here the check "hpdev->state == hv_pcichild_ejecting" is incorrect. We
should remove the check since it can cause a panic (see the commit messsage
for the detailed explanation)
The "premature PCI_EJECTION_COMPLETE" issue is resolved by:
[PATCH v3 5/6] PCI: hv: Add a per-bus mutex state_lock
> Feel free to merge this code, I can't ACK it, sorry.
>
> Lorenzo
Thanks for sharing the thougths!
^ permalink raw reply
* RE: [PATCH v3 3/6] PCI: hv: Remove the useless hv_pcichild_state from struct hv_pci_dev
From: Dexuan Cui @ 2023-06-15 4:36 UTC (permalink / raw)
To: Lorenzo Pieralisi
Cc: bhelgaas@google.com, davem@davemloft.net, edumazet@google.com,
Haiyang Zhang, Jake Oshins, kuba@kernel.org, kw@linux.com,
KY Srinivasan, leon@kernel.org, linux-pci@vger.kernel.org,
Michael Kelley (LINUX), pabeni@redhat.com, robh@kernel.org,
saeedm@nvidia.com, wei.liu@kernel.org, Long Li,
boqun.feng@gmail.com, Saurabh Singh Sengar, helgaas@kernel.org,
linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-rdma@vger.kernel.org, netdev@vger.kernel.org,
Jose Teuttli Carranco, stable@vger.kernel.org
In-Reply-To: <ZG8ZQ1U4kmGBVe4/@lpieralisi>
> From: Lorenzo Pieralisi <lpieralisi@kernel.org>
> Sent: Thursday, May 25, 2023 1:16 AM
> To: Dexuan Cui <decui@microsoft.com>
>
> Is this patch _required_ for subsequent fixes ?
It's not required. IMO it's good to have this patch with patch 2 since
patch 2 also touches hv_pcichild_ejecting. I'm OK if people think
it's better to merge this patch through the "next" branch rather than
the "fixes" branch.
> It is not a fix itself so I am asking.
> Acked-by: Lorenzo Pieralisi <lpieralisi@kernel.org>
Thanks!
^ permalink raw reply
* RE: [PATCH v3 4/6] Revert "PCI: hv: Fix a timing issue which causes kdump to fail occasionally"
From: Dexuan Cui @ 2023-06-15 4:41 UTC (permalink / raw)
To: Lorenzo Pieralisi
Cc: bhelgaas@google.com, davem@davemloft.net, edumazet@google.com,
Haiyang Zhang, Jake Oshins, kuba@kernel.org, kw@linux.com,
KY Srinivasan, leon@kernel.org, linux-pci@vger.kernel.org,
Michael Kelley (LINUX), pabeni@redhat.com, robh@kernel.org,
saeedm@nvidia.com, wei.liu@kernel.org, Long Li,
boqun.feng@gmail.com, Saurabh Singh Sengar, helgaas@kernel.org,
linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-rdma@vger.kernel.org, netdev@vger.kernel.org,
Jose Teuttli Carranco, stable@vger.kernel.org
In-Reply-To: <ZG8axpfFQArZ0Hj/@lpieralisi>
> From: Lorenzo Pieralisi <lpieralisi@kernel.org>
> Sent: Thursday, May 25, 2023 1:23 AM
> ...
> On Wed, Apr 19, 2023 at 07:40:35PM -0700, Dexuan Cui wrote:
> > This reverts commit d6af2ed29c7c1c311b96dac989dcb991e90ee195.
> >
> > The statement "the hv_pci_bus_exit() call releases structures of all its
> > child devices" in commit d6af2ed29c7c is not true: in the path
> > hv_pci_probe() -> hv_pci_enter_d0() -> hv_pci_bus_exit(hdev, true): the
> > parameter "keep_devs" is true, so hv_pci_bus_exit() does *not* release the
> > child "struct hv_pci_dev *hpdev" that is created earlier in
> > pci_devices_present_work() -> new_pcichild_device().
> >
> > The commit d6af2ed29c7c was originally made in July 2020 for RHEL 7.7,
> > where the old version of hv_pci_bus_exit() was used; when the commit
> > was
> > rebased and merged into the upstream, people didn't notice that it's
> > not really necessary. The commit itself doesn't cause any issue, but it
> > makes hv_pci_probe() more complicated. Revert it to facilitate some
> > upcoming changes to hv_pci_probe().
>
> If d6af2ed29c7c does not cause any issue this is not a fix and should be
> merged only with subsequent changes.
d6af2ed29c7c does not cause any functional issue, but it makes the code
less readable, and so I'd like to not merge this patch with patch 5 -- this way
people can easily know what the real change is in patch 5.
^ permalink raw reply
* [PATCH v4 0/5] pci-hyperv: Fix race condition bugs for fast device hotplug
From: Dexuan Cui @ 2023-06-15 4:44 UTC (permalink / raw)
To: bhelgaas, davem, decui, edumazet, haiyangz, jakeo, kuba, kw, kys,
leon, linux-pci, lpieralisi, mikelley, pabeni, robh, saeedm,
wei.liu, longli, boqun.feng, ssengar, helgaas
Cc: linux-hyperv, linux-kernel, linux-rdma, netdev, josete,
simon.horman
Before the guest finishes probing a device, the host may be already starting
to remove the device. Currently there are multiple race condition bugs in the
pci-hyperv driver, which can cause the guest to panic. The patchset fixes
the crashes.
The patchset also does some cleanup work: patch 3 removes the useless
hv_pcichild_state, and patch 4 reverts an old patch which is not really
useful (without patch 4, it would be hard to make patch 5 clean).
Patch 6 in v3 is dropped for now since it's a feature rather than a fix.
Patch 6 will be split into two patches as suggested by Lorenzo and will be
posted after the 5 patches are accepted first.
The v4 addressed Lorenzo's comments and added Lorenzo' Acks to patch
1, 3 and 5.
The v4 is based on v6.4-rc6, and can apply cleanly to the Hyper-V tree's
hyperv-fixes branch.
The patchset is also availsble in my github branch:
https://github.com/dcui/tdx/commits/decui/vpci/v6.4-rc6-vpci-v4
FYI, v3 can be found here:
https://lwn.net/ml/linux-kernel/20230420024037.5921-1-decui@microsoft.com/
Please review. Thanks!
Dexuan Cui (5):
PCI: hv: Fix a race condition bug in hv_pci_query_relations()
PCI: hv: Fix a race condition in hv_irq_unmask() that can cause panic
PCI: hv: Remove the useless hv_pcichild_state from struct hv_pci_dev
Revert "PCI: hv: Fix a timing issue which causes kdump to fail
occasionally"
PCI: hv: Add a per-bus mutex state_lock
drivers/pci/controller/pci-hyperv.c | 139 ++++++++++++++++------------
1 file changed, 82 insertions(+), 57 deletions(-)
--
2.25.1
^ permalink raw reply
* [PATCH v4 1/5] PCI: hv: Fix a race condition bug in hv_pci_query_relations()
From: Dexuan Cui @ 2023-06-15 4:44 UTC (permalink / raw)
To: bhelgaas, davem, decui, edumazet, haiyangz, jakeo, kuba, kw, kys,
leon, linux-pci, lpieralisi, mikelley, pabeni, robh, saeedm,
wei.liu, longli, boqun.feng, ssengar, helgaas
Cc: linux-hyperv, linux-kernel, linux-rdma, netdev, josete,
simon.horman, stable
In-Reply-To: <20230615044451.5580-1-decui@microsoft.com>
Since day 1 of the driver, there has been a race between
hv_pci_query_relations() and survey_child_resources(): during fast
device hotplug, hv_pci_query_relations() may error out due to
device-remove and the stack variable 'comp' is no longer valid;
however, pci_devices_present_work() -> survey_child_resources() ->
complete() may be running on another CPU and accessing the no-longer-valid
'comp'. Fix the race by flushing the workqueue before we exit from
hv_pci_query_relations().
Fixes: 4daace0d8ce8 ("PCI: hv: Add paravirtual PCI front-end for Microsoft Hyper-V VMs")
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Acked-by: Lorenzo Pieralisi <lpieralisi@kernel.org>
Cc: stable@vger.kernel.org
---
v2:
Removed the "debug code".
No change to the patch body.
Added Cc:stable
v3:
Added Michael's Reviewed-by.
v4:
Provided more details of the issue in the comment and commit log.
Added Lorenzo's Acked-by.
drivers/pci/controller/pci-hyperv.c | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)
diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index bc32662c6bb7f..ea8862e656b68 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -3401,6 +3401,24 @@ static int hv_pci_query_relations(struct hv_device *hdev)
if (!ret)
ret = wait_for_response(hdev, &comp);
+ /*
+ * In the case of fast device addition/removal, it's possible that
+ * vmbus_sendpacket() or wait_for_response() returns -ENODEV but we
+ * already got a PCI_BUS_RELATIONS* message from the host and the
+ * channel callback already scheduled a work to hbus->wq, which can be
+ * running pci_devices_present_work() -> survey_child_resources() ->
+ * complete(&hbus->survey_event), even after hv_pci_query_relations()
+ * exits and the stack variable 'comp' is no longer valid; as a result,
+ * a hang or a page fault may happen when the complete() calls
+ * raw_spin_lock_irqsave(). Flush hbus->wq before we exit from
+ * hv_pci_query_relations() to avoid the issues. Note: if 'ret' is
+ * -ENODEV, there can't be any more work item scheduled to hbus->wq
+ * after the flush_workqueue(): see vmbus_onoffer_rescind() ->
+ * vmbus_reset_channel_cb(), vmbus_rescind_cleanup() ->
+ * channel->rescind = true.
+ */
+ flush_workqueue(hbus->wq);
+
return ret;
}
--
2.25.1
^ permalink raw reply related
* [PATCH v4 2/5] PCI: hv: Fix a race condition in hv_irq_unmask() that can cause panic
From: Dexuan Cui @ 2023-06-15 4:44 UTC (permalink / raw)
To: bhelgaas, davem, decui, edumazet, haiyangz, jakeo, kuba, kw, kys,
leon, linux-pci, lpieralisi, mikelley, pabeni, robh, saeedm,
wei.liu, longli, boqun.feng, ssengar, helgaas
Cc: linux-hyperv, linux-kernel, linux-rdma, netdev, josete,
simon.horman, stable
In-Reply-To: <20230615044451.5580-1-decui@microsoft.com>
When the host tries to remove a PCI device, the host first sends a
PCI_EJECT message to the guest, and the guest is supposed to gracefully
remove the PCI device and send a PCI_EJECTION_COMPLETE message to the host;
the host then sends a VMBus message CHANNELMSG_RESCIND_CHANNELOFFER to
the guest (when the guest receives this message, the device is already
unassigned from the guest) and the guest can do some final cleanup work;
if the guest fails to respond to the PCI_EJECT message within one minute,
the host sends the VMBus message CHANNELMSG_RESCIND_CHANNELOFFER and
removes the PCI device forcibly.
In the case of fast device addition/removal, it's possible that the PCI
device driver is still configuring MSI-X interrupts when the guest receives
the PCI_EJECT message; the channel callback calls hv_pci_eject_device(),
which sets hpdev->state to hv_pcichild_ejecting, and schedules a work
hv_eject_device_work(); if the PCI device driver is calling
pci_alloc_irq_vectors() -> ... -> hv_compose_msi_msg(), we can break the
while loop in hv_compose_msi_msg() due to the updated hpdev->state, and
leave data->chip_data with its default value of NULL; later, when the PCI
device driver calls request_irq() -> ... -> hv_irq_unmask(), the guest
crashes in hv_arch_irq_unmask() due to data->chip_data being NULL.
Fix the issue by not testing hpdev->state in the while loop: when the
guest receives PCI_EJECT, the device is still assigned to the guest, and
the guest has one minute to finish the device removal gracefully. We don't
really need to (and we should not) test hpdev->state in the loop.
Fixes: de0aa7b2f97d ("PCI: hv: Fix 2 hang issues in hv_compose_msi_msg()")
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Cc: stable@vger.kernel.org
---
v2:
Removed the "debug code".
No change to the patch body.
Added Cc:stable
v3:
Added Michael's Reviewed-by.
v4:
No change since v3.
drivers/pci/controller/pci-hyperv.c | 11 +++++------
1 file changed, 5 insertions(+), 6 deletions(-)
diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index ea8862e656b68..733637d967654 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -635,6 +635,11 @@ static void hv_arch_irq_unmask(struct irq_data *data)
pbus = pdev->bus;
hbus = container_of(pbus->sysdata, struct hv_pcibus_device, sysdata);
int_desc = data->chip_data;
+ if (!int_desc) {
+ dev_warn(&hbus->hdev->device, "%s() can not unmask irq %u\n",
+ __func__, data->irq);
+ return;
+ }
local_irq_save(flags);
@@ -2004,12 +2009,6 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
hv_pci_onchannelcallback(hbus);
spin_unlock_irqrestore(&channel->sched_lock, flags);
- if (hpdev->state == hv_pcichild_ejecting) {
- dev_err_once(&hbus->hdev->device,
- "the device is being ejected\n");
- goto enable_tasklet;
- }
-
udelay(100);
}
--
2.25.1
^ permalink raw reply related
* [PATCH v4 3/5] PCI: hv: Remove the useless hv_pcichild_state from struct hv_pci_dev
From: Dexuan Cui @ 2023-06-15 4:44 UTC (permalink / raw)
To: bhelgaas, davem, decui, edumazet, haiyangz, jakeo, kuba, kw, kys,
leon, linux-pci, lpieralisi, mikelley, pabeni, robh, saeedm,
wei.liu, longli, boqun.feng, ssengar, helgaas
Cc: linux-hyperv, linux-kernel, linux-rdma, netdev, josete,
simon.horman, stable
In-Reply-To: <20230615044451.5580-1-decui@microsoft.com>
The hpdev->state is never really useful. The only use in
hv_pci_eject_device() and hv_eject_device_work() is not really necessary.
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Acked-by: Lorenzo Pieralisi <lpieralisi@kernel.org>
Cc: stable@vger.kernel.org
---
v2:
No change to the patch body.
Added Cc:stable
v3:
Added Michael's Reviewed-by.
v3:
Added Lorenzo's Acked-by.
drivers/pci/controller/pci-hyperv.c | 12 ------------
1 file changed, 12 deletions(-)
diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index 733637d967654..a826b41c949a1 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -545,19 +545,10 @@ struct hv_dr_state {
struct hv_pcidev_description func[];
};
-enum hv_pcichild_state {
- hv_pcichild_init = 0,
- hv_pcichild_requirements,
- hv_pcichild_resourced,
- hv_pcichild_ejecting,
- hv_pcichild_maximum
-};
-
struct hv_pci_dev {
/* List protected by pci_rescan_remove_lock */
struct list_head list_entry;
refcount_t refs;
- enum hv_pcichild_state state;
struct pci_slot *pci_slot;
struct hv_pcidev_description desc;
bool reported_missing;
@@ -2843,8 +2834,6 @@ static void hv_eject_device_work(struct work_struct *work)
hpdev = container_of(work, struct hv_pci_dev, wrk);
hbus = hpdev->hbus;
- WARN_ON(hpdev->state != hv_pcichild_ejecting);
-
/*
* Ejection can come before or after the PCI bus has been set up, so
* attempt to find it and tear down the bus state, if it exists. This
@@ -2901,7 +2890,6 @@ static void hv_pci_eject_device(struct hv_pci_dev *hpdev)
return;
}
- hpdev->state = hv_pcichild_ejecting;
get_pcichild(hpdev);
INIT_WORK(&hpdev->wrk, hv_eject_device_work);
queue_work(hbus->wq, &hpdev->wrk);
--
2.25.1
^ permalink raw reply related
* [PATCH v4 4/5] Revert "PCI: hv: Fix a timing issue which causes kdump to fail occasionally"
From: Dexuan Cui @ 2023-06-15 4:44 UTC (permalink / raw)
To: bhelgaas, davem, decui, edumazet, haiyangz, jakeo, kuba, kw, kys,
leon, linux-pci, lpieralisi, mikelley, pabeni, robh, saeedm,
wei.liu, longli, boqun.feng, ssengar, helgaas
Cc: linux-hyperv, linux-kernel, linux-rdma, netdev, josete,
simon.horman, stable
In-Reply-To: <20230615044451.5580-1-decui@microsoft.com>
This reverts commit d6af2ed29c7c1c311b96dac989dcb991e90ee195.
The statement "the hv_pci_bus_exit() call releases structures of all its
child devices" in commit d6af2ed29c7c is not true: in the path
hv_pci_probe() -> hv_pci_enter_d0() -> hv_pci_bus_exit(hdev, true): the
parameter "keep_devs" is true, so hv_pci_bus_exit() does *not* release the
child "struct hv_pci_dev *hpdev" that is created earlier in
pci_devices_present_work() -> new_pcichild_device().
The commit d6af2ed29c7c was originally made in July 2020 for RHEL 7.7,
where the old version of hv_pci_bus_exit() was used; when the commit was
rebased and merged into the upstream, people didn't notice that it's
not really necessary. The commit itself doesn't cause any issue, but it
makes hv_pci_probe() more complicated. Revert it to facilitate some
upcoming changes to hv_pci_probe().
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Acked-by: Wei Hu <weh@microsoft.com>
Cc: stable@vger.kernel.org
---
v2:
No change to the patch body.
Added Wei Hu's Acked-by.
Added Cc:stable
v3:
Added Michael's Reviewed-by.
v4:
NO change since v3.
drivers/pci/controller/pci-hyperv.c | 71 ++++++++++++++---------------
1 file changed, 34 insertions(+), 37 deletions(-)
diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index a826b41c949a1..1a5296fad1c48 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -3318,8 +3318,10 @@ static int hv_pci_enter_d0(struct hv_device *hdev)
struct pci_bus_d0_entry *d0_entry;
struct hv_pci_compl comp_pkt;
struct pci_packet *pkt;
+ bool retry = true;
int ret;
+enter_d0_retry:
/*
* Tell the host that the bus is ready to use, and moved into the
* powered-on state. This includes telling the host which region
@@ -3346,6 +3348,38 @@ static int hv_pci_enter_d0(struct hv_device *hdev)
if (ret)
goto exit;
+ /*
+ * In certain case (Kdump) the pci device of interest was
+ * not cleanly shut down and resource is still held on host
+ * side, the host could return invalid device status.
+ * We need to explicitly request host to release the resource
+ * and try to enter D0 again.
+ */
+ if (comp_pkt.completion_status < 0 && retry) {
+ retry = false;
+
+ dev_err(&hdev->device, "Retrying D0 Entry\n");
+
+ /*
+ * Hv_pci_bus_exit() calls hv_send_resource_released()
+ * to free up resources of its child devices.
+ * In the kdump kernel we need to set the
+ * wslot_res_allocated to 255 so it scans all child
+ * devices to release resources allocated in the
+ * normal kernel before panic happened.
+ */
+ hbus->wslot_res_allocated = 255;
+
+ ret = hv_pci_bus_exit(hdev, true);
+
+ if (ret == 0) {
+ kfree(pkt);
+ goto enter_d0_retry;
+ }
+ dev_err(&hdev->device,
+ "Retrying D0 failed with ret %d\n", ret);
+ }
+
if (comp_pkt.completion_status < 0) {
dev_err(&hdev->device,
"PCI Pass-through VSP failed D0 Entry with status %x\n",
@@ -3591,7 +3625,6 @@ static int hv_pci_probe(struct hv_device *hdev,
struct hv_pcibus_device *hbus;
u16 dom_req, dom;
char *name;
- bool enter_d0_retry = true;
int ret;
bridge = devm_pci_alloc_host_bridge(&hdev->device, 0);
@@ -3708,47 +3741,11 @@ static int hv_pci_probe(struct hv_device *hdev,
if (ret)
goto free_fwnode;
-retry:
ret = hv_pci_query_relations(hdev);
if (ret)
goto free_irq_domain;
ret = hv_pci_enter_d0(hdev);
- /*
- * In certain case (Kdump) the pci device of interest was
- * not cleanly shut down and resource is still held on host
- * side, the host could return invalid device status.
- * We need to explicitly request host to release the resource
- * and try to enter D0 again.
- * Since the hv_pci_bus_exit() call releases structures
- * of all its child devices, we need to start the retry from
- * hv_pci_query_relations() call, requesting host to send
- * the synchronous child device relations message before this
- * information is needed in hv_send_resources_allocated()
- * call later.
- */
- if (ret == -EPROTO && enter_d0_retry) {
- enter_d0_retry = false;
-
- dev_err(&hdev->device, "Retrying D0 Entry\n");
-
- /*
- * Hv_pci_bus_exit() calls hv_send_resources_released()
- * to free up resources of its child devices.
- * In the kdump kernel we need to set the
- * wslot_res_allocated to 255 so it scans all child
- * devices to release resources allocated in the
- * normal kernel before panic happened.
- */
- hbus->wslot_res_allocated = 255;
- ret = hv_pci_bus_exit(hdev, true);
-
- if (ret == 0)
- goto retry;
-
- dev_err(&hdev->device,
- "Retrying D0 failed with ret %d\n", ret);
- }
if (ret)
goto free_irq_domain;
--
2.25.1
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox