[lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming

lustre-devel-lustre.org archive mirror
 help / color / mirror / Atom feed

* [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming
@ 2025-01-16 21:25 Day, Timothy
       [not found] ` <C9513675-3287-4784-90B7-AD133328C42A@ddn.com>
                   ` (2 more replies)
  0 siblings, 3 replies; 61+ messages in thread
From: Day, Timothy @ 2025-01-16 21:25 UTC (permalink / raw)
  To: lustre-devel@lists.lustre.org

[-- Attachment #1.1: Type: text/plain, Size: 2510 bytes --]

The following is a draft topic for the upcoming LSF/MM conference.
I wanted to solicit feedback from the wider Lustre development
community before submitting this to fsdevel. If I’ve omitted anything,
something doesn’t seem right, or you know of something that strengthens
the argument, please let me know!

----------------------------------------------------

Lustre is a high-performance parallel filesystem used for HPC and AI/ML
compute clusters available under GPLv2. Lustre has achieved widespread
adoption in the HPC and AI/ML and is commercially supported by numerous
vendors and cloud service providers [1].

After 21 years and an ill-fated stint in staging, Lustre is still maintained as
an out-of-tree module [6]. The previous upstreaming effort suffered from a
lack of developer focus and user adoption, which eventually led to Lustre
being removed from staging altogether [2].

However, the work to improve Lustre has not stopped. In the intervening
years, the code improvements that would preempt a return to mainline
have been steadily progressing. At least 25% of patches accepted for
Lustre 2.16 were related to the upstreaming effort [3]. And all of the
remaining work is in-flight [4][5]. Our eventual goal is to a get a minimal
TCP/IP-only Lustre client to an acceptable quality before submitting to
mainline.

I propose to discuss:

- Expectations for a new filesystem to be accepted to mainline
- Weaknesses in the previous upstreaming effort in staging

Lustre has already received a plethora of feedback in the past. While much
of that has been addressed since - the kernel is a moving target. Several
filesystems have been merged (and removed) since Lustre left staging. We're
aiming to avoid the mistakes of the past and hope to address as many
concerns as possible before submitting for inclusion.

Thanks!

Timothy Day (Amazon Web Services - AWS)
James Simmons (Oak Ridge National Labs - ORNL)

[1] Lustre Community Update: https://youtu.be/BE--ySVQb2M?si=YMHitJfcE4ASWQcE&t=960
[2] Kicked out of staging: https://lwn.net/Articles/756565/
[3] ORNL, Aeon, SuSe, AWS, and more: https://youtu.be/BE--ySVQb2M?si=YMHitJfcE4ASWQcE&t=960
[4] LUG24 Upstreaming Update: https://www.depts.ttu.edu/hpcc/events/LUG24/slides/Day1/LUG_2024_Talk_02-Native_Linux_client_status.pdf
[5] Lustre Jira Upstream Progress: https://jira.whamcloud.com/browse/LU-12511
[6] Out-of-tree codebase: https://git.whamcloud.com/?p=fs/lustre-release.git;a=tree

[-- Attachment #1.2: Type: text/html, Size: 5494 bytes --]

[-- Attachment #2: Type: text/plain, Size: 165 bytes --]

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming
       [not found] ` <C9513675-3287-4784-90B7-AD133328C42A@ddn.com>
@ 2025-01-17 22:46   ` Day, Timothy
  0 siblings, 0 replies; 61+ messages in thread
From: Day, Timothy @ 2025-01-17 22:46 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: lustre-devel@lists.lustre.org

On 1/16/25, 7:12 PM, "Andreas Dilger" <adilger@ddn.com <mailto:adilger@ddn.com>> wrote:
> On Jan 16, 2025, at 14:25, Day, Timothy <timday@amazon.com <mailto:timday@amazon.com>> wrote:
> >
> > The following is a draft topic for the upcoming LSF/MM conference.
> > I wanted to solicit feedback from the wider Lustre development
> > community before submitting this to fsdevel. If I’ve omitted anything,
> > something doesn’t seem right, or you know of something that strengthens
> > the argument, please let me know!
> > ----------------------------------------------------
> > Lustre is a high-performance parallel filesystem used for HPC and AI/ML
> > compute clusters available under GPLv2. Lustre has achieved widespread
> > adoption in the HPC and AI/ML and is commercially supported by numerous
> > vendors and cloud service providers [1].
>
>
> I don't see Peter's graph that shows adoption as a fraction of the Top-100
> systems. I think this could be listed here explicitly, like:
>
>
> "Lustre is currently used by 65% of the Top-500 (9 of Top-10) systems in HPC
> and is used by the largest AI/ML clusters in the world, and is commercially ..."

That's a good stat. I also want to show that Lustre is gaining a lot of adoption
outside of HPC. I should probably just say that explicitly.

> > After 21 years and an ill-fated stint in staging, Lustre is still maintained as
> > an out-of-tree module [6]. The previous upstreaming effort suffered from a
> > lack of developer focus and user adoption, which eventually led to Lustre
> > being removed from staging altogether [2].
> > However, the work to improve Lustre has not stopped.
>
> "... has continued regardless."
>
> > In the intervening
> > years, the code improvements that would preempt a return to mainline
>
> s/would preempt/previously prevented/
>
> > have been steadily progressing. At least 25% of patches accepted for
> > Lustre 2.16 were related to the upstreaming effort [3].
>
> While [3] is showing the distribution of submissions between organizations,
> it isn't clear how that translates to "25% of patches relate to upstreaming",
> unless you count all of the ORNL, AWS, and AEON submissions toward this?

It's kind of heuristic - the majority of patches from ORNL/AEON/SuSe/AWS are
related to upstreaming work. And quite a lot from other contributors as well. I
should include that thinking in the appendix part.

> > And all of the
> > remaining work is in-flight [4][5].
>
>
> Looking at [5] it would appear that most of the items in LU-12511 are *not* finished, so it
> would make sense to go through those linked tickets and tasks listed in the Description
> to see if they can be closed and/or struck out so that it shows that we are nearly complete.
>
> Similarly, James' presentation in [4] is missing the commentary that would explain which
> of the listed items/tickets were actually finished and which ones are enumerating the
> "todo" list.

I'm going to update LU-12511 before sending the email more widely. I synced with James
to discuss some of the outstanding work.

> > Our eventual goal is to a get a minimal
> > TCP/IP-only Lustre client to an acceptable quality before submitting to
> > mainline.
> > I propose to discuss:
> > - Expectations for a new filesystem to be accepted to mainline
> > - Weaknesses in the previous upstreaming effort in staging
>
>
> Rather than discuss the "weaknesses" in the previous upstreaming, this should be
> focus in the positive direction:
>
>
> - Expectations for a new filesystem to be accepted to mainline
> - How to manage inclusion of a large code base without rewriting (2.5x XFS)

The second bullet probably sounds too negative. But a lot of folks upstream
were not happy with the state of the codebase at the time Lustre was
removed from staging. And quite a lot of Lustre has been reworked/rewritten
since then. Maybe something in middle:

- Expectations for a new filesystem to be accepted to mainline
- How to manage inclusion of a large code base (client is 200kLoC) without
increasing the burden on fs/net maintainers

I imagine (2) would requiring at least some rewriting to avoid deprecated
interfaces. Support for the new mount API might be one ask, for example.

> > Lustre has already received a plethora of feedback in the past. While much
> > of that has been addressed since - the kernel is a moving target. Several
> > filesystems have been merged (and removed) since Lustre left staging. We're
> > aiming to avoid the mistakes of the past and hope to address as many
> > concerns as possible before submitting for inclusion.
> > Thanks!
> > Timothy Day (Amazon Web Services - AWS)
> > James Simmons (Oak Ridge National Labs - ORNL)
> > [1] Lustre Community Update: https://youtu.be/BE--ySVQb2M?si=YMHitJfcE4ASWQcE&t=960 <https://youtu.be/BE--ySVQb2M?si=YMHitJfcE4ASWQcE&amp;t=960>
> > [2] Kicked out of staging: https://lwn.net/Articles/756565/ <https://lwn.net/Articles/756565/>
> > [3] ORNL, Aeon, SuSe, AWS, and more: https://youtu.be/BE--ySVQb2M?si=YMHitJfcE4ASWQcE&t=960 <https://youtu.be/BE--ySVQb2M?si=YMHitJfcE4ASWQcE&amp;t=960>
>
>
> Your [1] and [3] URLs are exactly the same. Did you mean to be showing something
> different for each (e.g. different "t=NNN" for one or the other)?

For [1], I should probably just enumerate some organizations. I want to
demonstrate that the Lustre community and adoption is growing. I could
also just link to https://en.wikipedia.org/wiki/Lustre_(file_system)#Commercial_technical_support.

> > [4] LUG24 Upstreaming Update: https://www.depts.ttu.edu/hpcc/events/LUG24/slides/Day1/LUG_2024_Talk_02-Native_Linux_client_status.pdf <https://www.depts.ttu.edu/hpcc/events/LUG24/slides/Day1/LUG_2024_Talk_02-Native_Linux_client_status.pdf>
> > [5] Lustre Jira Upstream Progress: https://jira.whamcloud.com/browse/LU-12511 <https://jira.whamcloud.com/browse/LU-12511>
> > [6] Out-of-tree codebase: https://git.whamcloud.com/?p=fs/lustre-release.git;a=tree <https://git.whamcloud.com/?p=fs/lustre-release.git;a=tree>
>
>
> Cheers, Andreas
> —
> Andreas Dilger
> Lustre Principal Architect
> Whamcloud/DDN

I tried to reply inline - hopefully Outlook doesn't mangle the email.

Tim Day

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming
  2025-01-16 21:25 [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming Day, Timothy
       [not found] ` <C9513675-3287-4784-90B7-AD133328C42A@ddn.com>
@ 2025-01-18  0:45 ` NeilBrown
  2025-01-18  3:16   ` Oleg Drokin
                     ` (2 more replies)
  2025-01-22  6:35 ` Day, Timothy
  2 siblings, 3 replies; 61+ messages in thread
From: NeilBrown @ 2025-01-18  0:45 UTC (permalink / raw)
  To: Day, Timothy; +Cc: lustre-devel@lists.lustre.org

On Fri, 17 Jan 2025, Day, Timothy wrote:
> The following is a draft topic for the upcoming LSF/MM conference.
> 
> I wanted to solicit feedback from the wider Lustre development
> 
> community before submitting this to fsdevel. If I’ve omitted anything,
> 
> something doesn’t seem right, or you know of something that strengthens
> 
> the argument, please let me know!
> 
>  
> 
> ----------------------------------------------------
> 
>  
> 
> Lustre is a high-performance parallel filesystem used for HPC and AI/ML
> 
> compute clusters available under GPLv2. Lustre has achieved widespread
> 
> adoption in the HPC and AI/ML and is commercially supported by numerous
> 
> vendors and cloud service providers [1].
> 
>  
> 
> After 21 years and an ill-fated stint in staging, Lustre is still
> maintained as
> 
> an out-of-tree module [6]. The previous upstreaming effort suffered
> from a
> 
> lack of developer focus and user adoption, which eventually led to
> Lustre
> 
> being removed from staging altogether [2].
> 
>  
> 
> However, the work to improve Lustre has not stopped. In the intervening
> 
> years, the code improvements that would preempt a return to mainline
> 
> have been steadily progressing. At least 25% of patches accepted for
> 
> Lustre 2.16 were related to the upstreaming effort [3]. And all of the
> 
> remaining work is in-flight [4][5]. Our eventual goal is to a get a
> minimal
> 
> TCP/IP-only Lustre client to an acceptable quality before submitting to
> 
> mainline.

"Go big, or go home"!!

If our eventual goal is not "Get lustre, both client and server, into
mainline linux with support for TCP/IP and infiniband transports (at
least)"
then we really shouldn't bother.

There is no formal, or even semi-formal, specification of the Lustre
protocol.  The lustre protocol is "what the code does" so it cannot work
to develop client and server separately like it can for, e.g., NFS.

The goal you describe is an interim goal.  A first step (from the
upstream community perspective).

> 
>  
> 
> I propose to discuss:
> 
>  
> 
> - Expectations for a new filesystem to be accepted to mainline
> 
> - Weaknesses in the previous upstreaming effort in staging
> 

I think we know at least one perspective on the weaknesses in the
previous upstreaming effort and we need to demonstrate that we will do
better. 

   https://lore.kernel.org/all/20180601091133.GA27521@kroah.com/

   There is a whole separate out-of-tree copy of this codebase where the
   developers work on it, and then random changes are thrown over the
   wall at staging at some later point in time.  This dual-tree
   development model has never worked, and the state of this codebase is
   proof of that.

We need to demonstrate a process for, and commitment to, moving away
from the dual-tree model.  We need patches to those parts of Lustre
that are upstream to land in upstream first (mostly).

That means we need the model for supporting older kernels to be completely
based on libcfs holding compatibility code with no kernel-version
#ifdefs in the code.

We need a strong separation between server and client so that we can
justify everything that goes upstream as being to support the client,
and when we add server support to that, it just adds files.  Possibly we
could patch a few files to add server support, but we need to maintain
those as patches, not as alternate versions of upstream files.

We need to quickly reach a point where a lustre release is:

 - a verbatim copy of relevant files from a chosen upstream release,
   or just a dependency on that kernel source.
 - a bunch of extra files that might one day go upstream: server code
   and LNet protocol code
 - a *few* patches to integrate that code
 - some number of patches which have since gone upstream - bugfixes etc.
 - libcfs which contains a compat layer for older kernels.
 - user-space code, documentation, test scripts, etc for which there
   is no expectation of upstreaming to linux kernel.

Maybe the question for LSF is : what is a sufficient demonstration of commitment?

The big question for us is : how are we going to transition our
infrastructure to this model?

It would be nice to have a timeline for getting the second and third
bullet points down to zero.  Obviously it would be aspirational at best,
but a list of steps could be useful.

Thanks,
NeilBrown

>  
> 
> Lustre has already received a plethora of feedback in the past. While
> much
> 
> of that has been addressed since - the kernel is a moving target.
> Several
> 
> filesystems have been merged (and removed) since Lustre left staging.
> We're
> 
> aiming to avoid the mistakes of the past and hope to address as many
> 
> concerns as possible before submitting for inclusion.
> 
>  
> 
> Thanks!
> 
>  
> 
> Timothy Day (Amazon Web Services - AWS)
> 
> James Simmons (Oak Ridge National Labs - ORNL)
> 
>  
> 
> [1] Lustre Community Update: https://youtu.be/BE--ySVQb2M?si=
> YMHitJfcE4ASWQcE&t=960
> 
> [2] Kicked out of staging: https://lwn.net/Articles/756565/
> 
> [3] ORNL, Aeon, SuSe, AWS, and more: https://youtu.be/BE--ySVQb2M?si=
> YMHitJfcE4ASWQcE&t=960
> 
> [4] LUG24 Upstreaming Update: https://www.depts.ttu.edu/hpcc/events/
> LUG24/slides/Day1/LUG_2024_Talk_02-Native_Linux_client_status.pdf
> 
> [5] Lustre Jira Upstream Progress: https://jira.whamcloud.com/browse/
> LU-12511
> 
> [6] Out-of-tree codebase: https://git.whamcloud.com/?p=fs/
> lustre-release.git;a=tree
> 
>  
> 
> 
> 
> 

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming
  2025-01-18  0:45 ` NeilBrown
@ 2025-01-18  3:16   ` Oleg Drokin
  2025-01-18 21:46     ` Day, Timothy
  2025-01-18 22:48     ` NeilBrown
  2025-01-18 17:51   ` Day, Timothy
       [not found]   ` <E4481869-E21A-4941-9A97-8C59B7104528@ddn.com>
  2 siblings, 2 replies; 61+ messages in thread
From: Oleg Drokin @ 2025-01-18  3:16 UTC (permalink / raw)
  To: timday@amazon.com, neilb@suse.de; +Cc: lustre-devel@lists.lustre.org

On Sat, 2025-01-18 at 11:45 +1100, NeilBrown wrote:
> We need to demonstrate a process for, and commitment to, moving away
> from the dual-tree model.  We need patches to those parts of Lustre
> that are upstream to land in upstream first (mostly).

I think this is not very realistic.
Large chunk (100%?) of users do not run not only the latest kernel
release, they don't run the latest LTS either.

When we were in staging last this manifested in random patches being
landed and breaking the client completely and nobody noticing for
months.

Of course some automatic infrastructure could be built up to make it
somewhat better, but it does not remove the problem of "nobody would
run this mainline tree", I am afraid.

It does not hep that there are what 3? 4? trees, not "dual-tree" by any
stretch of imagination.

There's DDN/whamcloud (that's really two trees), there's HPE, LLNL
keeps their fork still I think (thought it's mostly backports?). There
are likely others I am less exposed to.

Sure, only one of those trees is considered "community Lustre", but if
it will detach too much from what majority of developers really runs
and gets paid to do - the "community Lustre" contributions probably
would diminish greatly, I am afraid.

The past situation of "oh, this new enterprise linux comes with a
community lustre version, so the first step to get something usable is
to rip it entirely off and then apply the new good version" is not
exactly desirable either I am afraid.

And solving this problem is mostly outside of hands of individual
developers no matter how cool I think it would be to actually have an
up to date Lustre in the mainline linux kernel.

> That means we need the model for supporting older kernels to be
> completely
> based on libcfs holding compatibility code with no kernel-version
> #ifdefs in the code.
> 
> We need a strong separation between server and client so that we can
> justify everything that goes upstream as being to support the client,
> and when we add server support to that, it just adds files.  Possibly
> we
> could patch a few files to add server support, but we need to
> maintain
> those as patches, not as alternate versions of upstream files.
> 
> We need to quickly reach a point where a lustre release is:
> 
>  - a verbatim copy of relevant files from a chosen upstream release,
>    or just a dependency on that kernel source.
>  - a bunch of extra files that might one day go upstream: server code
>    and LNet protocol code
>  - a *few* patches to integrate that code
>  - some number of patches which have since gone upstream - bugfixes
> etc.
>  - libcfs which contains a compat layer for older kernels.
>  - user-space code, documentation, test scripts, etc for which there
>    is no expectation of upstreaming to linux kernel.

All these sound like an awful lot of dedicated developer-hours.

> Maybe the question for LSF is : what is a sufficient demonstration of
> commitment?
> 
> The big question for us is : how are we going to transition our
> infrastructure to this model?

and who would pay for it.

This in the end was the downfall of the previous attempt. There never
was any serious funding behind the effort so it became an afterthought
for most.

> It would be nice to have a timeline for getting the second and third
> bullet points down to zero.  Obviously it would be aspirational at
> best,
> but a list of steps could be useful.
> 
> Thanks,
> NeilBrown
> 

Bye,
   Oleg

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming
  2025-01-18  0:45 ` NeilBrown
  2025-01-18  3:16   ` Oleg Drokin
@ 2025-01-18 17:51   ` Day, Timothy
  2025-01-18 22:21     ` NeilBrown
       [not found]   ` <E4481869-E21A-4941-9A97-8C59B7104528@ddn.com>
  2 siblings, 1 reply; 61+ messages in thread
From: Day, Timothy @ 2025-01-18 17:51 UTC (permalink / raw)
  To: NeilBrown; +Cc: lustre-devel@lists.lustre.org



> On 1/17/25, 7:46 PM, "NeilBrown" <neilb@suse.de <mailto:neilb@suse.de>> wrote:
> > On Fri, 17 Jan 2025, Day, Timothy wrote:
> > The following is a draft topic for the upcoming LSF/MM conference.
> >
> > I wanted to solicit feedback from the wider Lustre development
> >
> > community before submitting this to fsdevel. If I’ve omitted anything,
> >
> > something doesn’t seem right, or you know of something that strengthens
> >
> > the argument, please let me know!
> >
> >
> >
> > ----------------------------------------------------
> >
> >
> >
> > Lustre is a high-performance parallel filesystem used for HPC and AI/ML
> >
> > compute clusters available under GPLv2. Lustre has achieved widespread
> >
> > adoption in the HPC and AI/ML and is commercially supported by numerous
> >
> > vendors and cloud service providers [1].
> >
> >
> >
> > After 21 years and an ill-fated stint in staging, Lustre is still
> > maintained as
> >
> > an out-of-tree module [6]. The previous upstreaming effort suffered
> > from a
> >
> > lack of developer focus and user adoption, which eventually led to
> > Lustre
> >
> > being removed from staging altogether [2].
> >
> >
> >
> > However, the work to improve Lustre has not stopped. In the intervening
> >
> > years, the code improvements that would preempt a return to mainline
> >
> > have been steadily progressing. At least 25% of patches accepted for
> >
> > Lustre 2.16 were related to the upstreaming effort [3]. And all of the
> >
> > remaining work is in-flight [4][5]. Our eventual goal is to a get a
> > minimal
> >
> > TCP/IP-only Lustre client to an acceptable quality before submitting to
> >
> > mainline.
>
>
> "Go big, or go home"!!
>
>
> If our eventual goal is not "Get lustre, both client and server, into
> mainline linux with support for TCP/IP and infiniband transports (at
> least)"
> then we really shouldn't bother.
>
>
> There is no formal, or even semi-formal, specification of the Lustre
> protocol. The lustre protocol is "what the code does" so it cannot work
> to develop client and server separately like it can for, e.g., NFS.
>
>
> The goal you describe is an interim goal. A first step (from the
> upstream community perspective).

Getting everything upstream is definitely the goal. The near term goal
is much smaller, of course - getting anything at all Lustre upstream. I've
even wondered at times if we could start with only LNET - standalone LNET
is pretty manageable and can be used as a standalone LNET router. So it can
be used for something besides out-of-tree Lustre. But I'm skeptical upstream
would be in favor of that approach, since the primary users would be out-of-tree
Lustre regardless.

Like Andreas said in another thread, I think the Lustre protocol is fairly stable.
So we wouldn't have too much trouble maintaining an independent client
in mainline. Although, ideally the server would follow afterwards.

On the other hand, I wonder if we upstream the whole thing all at once. Beside
the code being a bit nicer, the client isn't really that much closer to being upstream
than the server is. And no one else can test the client without having a Lustre
server on-hand. So no-one can easily run xfstests or similar. And doing everything
all at once would preempt questions of client/server split or the server upstreaming
timeline. But upstreaming so much all at once is probably more unrealistic.

> > I propose to discuss:
> >
> >
> >
> > - Expectations for a new filesystem to be accepted to mainline
> >
> > - Weaknesses in the previous upstreaming effort in staging
> >
>
>
> I think we know at least one perspective on the weaknesses in the
> previous upstreaming effort and we need to demonstrate that we will do
> better.
>
>
> https://lore.kernel.org/all/20180601091133.GA27521@kroah.com <mailto:20180601091133.GA27521@kroah.com>/
>
>
> There is a whole separate out-of-tree copy of this codebase where the
> developers work on it, and then random changes are thrown over the
> wall at staging at some later point in time. This dual-tree
> development model has never worked, and the state of this codebase is
> proof of that.
>
>
> We need to demonstrate a process for, and commitment to, moving away
> from the dual-tree model. We need patches to those parts of Lustre
> that are upstream to land in upstream first (mostly).
>
>
> That means we need the model for supporting older kernels to be completely
> based on libcfs holding compatibility code with no kernel-version
> #ifdefs in the code.
>
>
> We need a strong separation between server and client so that we can
> justify everything that goes upstream as being to support the client,
> and when we add server support to that, it just adds files. Possibly we
> could patch a few files to add server support, but we need to maintain
> those as patches, not as alternate versions of upstream files.
>
>
> We need to quickly reach a point where a lustre release is:
>
>
> - a verbatim copy of relevant files from a chosen upstream release,
> or just a dependency on that kernel source.
> - a bunch of extra files that might one day go upstream: server code
> and LNet protocol code
> - a *few* patches to integrate that code
> - some number of patches which have since gone upstream - bugfixes etc.
> - libcfs which contains a compat layer for older kernels.
> - user-space code, documentation, test scripts, etc for which there
> is no expectation of upstreaming to linux kernel.
>
>
> Maybe the question for LSF is : what is a sufficient demonstration of commitment?
>
>
> The big question for us is : how are we going to transition our
> infrastructure to this model?
>
>
> It would be nice to have a timeline for getting the second and third
> bullet points down to zero. Obviously it would be aspirational at best,
> but a list of steps could be useful.

I agree that the development model needs to adapt - otherwise, we'd have to
soft-fork whatever code goes upstream. Keeping the two trees in-sync while
also doing feature development is unworkable.

The tricky part is: how do we support most Lustre developers current
workflows? Most developers and vendors only care about having a
functional client for older distro kernels. And all developers submit
patches via Whamcloud/DDN Gerrit and CI/CD. So everyone aligns their
workflows to whatever that system enforces (assuming it isn't too arduous).

Your proposed model (as I understand) is to use the upstream client as
a built dependency of the complete Lustre package? I think that could be
workable. But whatever we do, we need to find a way to move to that
development model before anything lands upstream. I think that would
be enough to demonstrate commitment, IMHO.

I wonder how AMDGPU does this? AMDGPU is significantly more complex
than Lustre and it's supported on older kernels via DKMS. I'll have to look
into this.

> Thanks,
> NeilBrown
>
>
> > Lustre has already received a plethora of feedback in the past. While
> > much
> > of that has been addressed since - the kernel is a moving target.
> > Several
> > filesystems have been merged (and removed) since Lustre left staging.
> > We're
> > aiming to avoid the mistakes of the past and hope to address as many
> > concerns as possible before submitting for inclusion.
> >
> >
> Thanks!
> >
> >
> > Timothy Day (Amazon Web Services - AWS)
> > James Simmons (Oak Ridge National Labs - ORNL)
> >
> >
> > [1] Lustre Community Update: https://youtu.be/BE--ySVQb2M?si= <https://youtu.be/BE--ySVQb2M?si=>
> > YMHitJfcE4ASWQcE&t=960
> > [2] Kicked out of staging: https://lwn.net/Articles/756565/ <https://lwn.net/Articles/756565/>
> > [3] ORNL, Aeon, SuSe, AWS, and more: https://youtu.be/BE--ySVQb2M?si= <https://youtu.be/BE--ySVQb2M?si=>
> > YMHitJfcE4ASWQcE&t=960
> > [4] LUG24 Upstreaming Update: https://www.depts.ttu.edu/hpcc/events/ <https://www.depts.ttu.edu/hpcc/events/>
> > LUG24/slides/Day1/LUG_2024_Talk_02-Native_Linux_client_status.pdf
> > [5] Lustre Jira Upstream Progress: https://jira.whamcloud.com/browse/ <https://jira.whamcloud.com/browse/>
> > LU-12511
> > [6] Out-of-tree codebase: https://git.whamcloud.com/?p=fs/ <https://git.whamcloud.com/?p=fs/>
> > lustre-release.git;a=tree

Tim Day




_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming
  2025-01-18  3:16   ` Oleg Drokin
@ 2025-01-18 21:46     ` Day, Timothy
  2025-01-19 20:46       ` Oleg Drokin
  2025-01-18 22:48     ` NeilBrown
  1 sibling, 1 reply; 61+ messages in thread
From: Day, Timothy @ 2025-01-18 21:46 UTC (permalink / raw)
  To: Oleg Drokin, neilb@suse.de; +Cc: lustre-devel@lists.lustre.org



> On 1/17/25, 10:17 PM, "Oleg Drokin" <green@whamcloud.com <mailto:green@whamcloud.com>> wrote:
> > On Sat, 2025-01-18 at 11:45 +1100, NeilBrown wrote:
> > We need to demonstrate a process for, and commitment to, moving away
> > from the dual-tree model. We need patches to those parts of Lustre
> > that are upstream to land in upstream first (mostly).
>
>
> I think this is not very realistic.
> Large chunk (100%?) of users do not run not only the latest kernel
> release, they don't run the latest LTS either.
>
>
> When we were in staging last this manifested in random patches being
> landed and breaking the client completely and nobody noticing for
> months.
>
>
> Of course some automatic infrastructure could be built up to make it
> somewhat better, but it does not remove the problem of "nobody would
> run this mainline tree", I am afraid.

I think there's a decent chunk of users on newer kernels. Ubuntu 22/24 is
on (a bit past latest) LTS 6.8 kernel [1], AL2023 is on previous LTS 6.1 [2], and
working on upcoming LTS 6.12 [3].

When a patch lands in lustre-release/master, it could be around 1 - 1.5 years
before it lands in a proper Lustre release. At that point, it might see real
production usage.

If a patch landed in a hypothetical upstream client, it might be around 6
months until a production kernel is using that client.

So I think it's mostly a matter of convincing people to use an upstream
client. I don't think people wanted to use the staging client because it
didn't work well and wasn't stable. And vendors don't want to work on
something that no one uses. It the client is "good enough" and people
are confident it'll continue to be updated, I think they will use it. The
staging client was neither of those things.

So I think the problem at hand is molding the existing development
practices to allow us to deliver an upstream client that has a baseline of
functionality and stability. And at the same time, supporting older vendor
kernels. I don't think it'd be a quick transition, but I think it's a tractable
problem.

[1] Ubuntu kernels - https://ubuntu.com/kernel/lifecycle
[2] AL2023 6.1 - https://github.com/amazonlinux/linux/commit/ef9660091712fa9edd137180b8925ea6316c8043
[3] AL2023 6.12 - https://github.com/amazonlinux/linux/commits/amazon-6.12.y/mainline/

> It does not hep that there are what 3? 4? trees, not "dual-tree" by any
> stretch of imagination.
>
>
> There's DDN/whamcloud (that's really two trees), there's HPE, LLNL
> keeps their fork still I think (thought it's mostly backports?). There
> are likely others I am less exposed to.

I think most non-community Lustre release are derived from the
community release and periodically rebased. I think AWS,
Whamcloud, LLNL, Microsoft would fall into that bucket. And I
doubt DDN and HPE significantly diverge from community Lustre. But
if someone is diverging significantly from community Lustre, I think
they are opting into a significant maintenance burden regardless of
what we do with lustre-release/master.

> Sure, only one of those trees is considered "community Lustre", but if
> it will detach too much from what majority of developers really runs
> and gets paid to do - the "community Lustre" contributions probably
> would diminish greatly, I am afraid.

As long as the community Lustre development process is sane, I think
most organizations will opt to continue deriving their releases from
it and opt to continue contributing releases upstream. We just need
to make sure we get buy-in from the people contributing to Lustre.

> The past situation of "oh, this new enterprise linux comes with a
> community lustre version, so the first step to get something usable is
> to rip it entirely off and then apply the new good version" is not
> exactly desirable either I am afraid.
>
>
> And solving this problem is mostly outside of hands of individual
> developers no matter how cool I think it would be to actually have an
> up to date Lustre in the mainline linux kernel.
>
> > That means we need the model for supporting older kernels to be
> > completely
> > based on libcfs holding compatibility code with no kernel-version
> > #ifdefs in the code.
> >
> > We need a strong separation between server and client so that we can
> > justify everything that goes upstream as being to support the client,
> > and when we add server support to that, it just adds files. Possibly
> > we
> > could patch a few files to add server support, but we need to
> > maintain
> > those as patches, not as alternate versions of upstream files.
> >
> > We need to quickly reach a point where a lustre release is:
> >
> > - a verbatim copy of relevant files from a chosen upstream release,
> > or just a dependency on that kernel source.
> > - a bunch of extra files that might one day go upstream: server code
> > and LNet protocol code
> > - a *few* patches to integrate that code
> > - some number of patches which have since gone upstream - bugfixes
> > etc.
> > - libcfs which contains a compat layer for older kernels.
> > - user-space code, documentation, test scripts, etc for which there
> > is no expectation of upstreaming to linux kernel.
>
>
> All these sound like an awful lot of dedicated developer-hours.
>
>
> > Maybe the question for LSF is : what is a sufficient demonstration of
> > commitment?
> >
> > The big question for us is : how are we going to transition our
> > infrastructure to this model?
>
>
> and who would pay for it.
>
>
> This in the end was the downfall of the previous attempt. There never
> was any serious funding behind the effort so it became an afterthought
> for most.
>
>
> > It would be nice to have a timeline for getting the second and third
> > bullet points down to zero. Obviously it would be aspirational at
> > best,
> > but a list of steps could be useful.
> >
> > Thanks,
> > NeilBrown
> >
>
>
> Bye,
> Oleg
>


_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming
  2025-01-18 17:51   ` Day, Timothy
@ 2025-01-18 22:21     ` NeilBrown
  2025-01-20  3:57       ` Day, Timothy
  0 siblings, 1 reply; 61+ messages in thread
From: NeilBrown @ 2025-01-18 22:21 UTC (permalink / raw)
  To: Day, Timothy; +Cc: lustre-devel@lists.lustre.org

On Sun, 19 Jan 2025, Day, Timothy wrote:
> 
> On the other hand, I wonder if we upstream the whole thing all at once. Beside
> the code being a bit nicer, the client isn't really that much closer to being upstream
> than the server is. And no one else can test the client without having a Lustre
> server on-hand. So no-one can easily run xfstests or similar. And doing everything
> all at once would preempt questions of client/server split or the server upstreaming
> timeline. But upstreaming so much all at once is probably more unrealistic.

The main difference I see between server and client in upstreaming terms
is the storage backend.  It would need to use un-patched ext4 - ideally
using VFS interfaces though we might be able to negotiate with the ext4
team to get some exports.  I don't know much about the delta between
ldiskfs and ext4 and understand it is much smaller than it once was, but
it would need to be zero.  I'm working towards getting the pdirop patch
upstreamable.  Andreas would know what else is needed better than I.

The other difference is that a lot of the "revise code to match upstream
style" work has focused on client and ignored server-only code.

It might be sensible to set the goal as "client and server" including
only the ext4 backend and possibly only the socklnd network interface.
It will be a big code drop either way.  People aren't going to go over
every line with a fine-tooth-comb.  They will mostly look at whichever
bit particularly interests them, and look at the process and community
behind the code.

Being able to build a pure upstream kernel, add a user-space tools
package, and test would certainly be a plus.  That would be something
worth canvassing at LSF - is there any value in landing the client
without the server?

NeilBrown
_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming
       [not found]   ` <E4481869-E21A-4941-9A97-8C59B7104528@ddn.com>
@ 2025-01-18 22:25     ` NeilBrown
  2025-01-20  4:54     ` Day, Timothy
  1 sibling, 0 replies; 61+ messages in thread
From: NeilBrown @ 2025-01-18 22:25 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: lustre-devel@lists.lustre.org

On Sat, 18 Jan 2025, Andreas Dilger wrote:
> 
> 
> This will definitely need some reorganization of files and directories
> in the
> Lustre source tree to align with the Linux kernel (e.g. moving
> everything
> under fs/lustre and net/lnet).
> 
> That would probably be a question to get answered, whether LNet is
> "too Lustre specific" to be in net/ and should live in the Lustre tree?

I think the existence of 
   net/sunrpc   used only by NFS
   net/rxrpc    used only by AFS
   net/ceph     used only by cephfs and rbd
strongly suggest that net/lnet is the right place for lnet code.

NeilBrown
_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming
  2025-01-18  3:16   ` Oleg Drokin
  2025-01-18 21:46     ` Day, Timothy
@ 2025-01-18 22:48     ` NeilBrown
  2025-01-19  6:37       ` Alexey Lyahkov
  2025-01-19 21:20       ` Oleg Drokin
  1 sibling, 2 replies; 61+ messages in thread
From: NeilBrown @ 2025-01-18 22:48 UTC (permalink / raw)
  To: Oleg Drokin; +Cc: lustre-devel@lists.lustre.org

On Sat, 18 Jan 2025, Oleg Drokin wrote:
> On Sat, 2025-01-18 at 11:45 +1100, NeilBrown wrote:
> > We need to demonstrate a process for, and commitment to, moving away
> > from the dual-tree model.  We need patches to those parts of Lustre
> > that are upstream to land in upstream first (mostly).
> 
> I think this is not very realistic.
> Large chunk (100%?) of users do not run not only the latest kernel
> release, they don't run the latest LTS either.

Are you referring to lustre users or all Linux users?
If the latter, then xfs etc face the same problem and seem to manage.
If lustre users: they can't because the latest kernel doesn't include
lustre.  Maybe you are seeing a chicken-and-egg problem?

> 
> When we were in staging last this manifested in random patches being
> landed and breaking the client completely and nobody noticing for
> months.

The staging exercise was a mess in various ways and suffered a lot of
problems that we wouldn't expect if we did the upstreaming properly.
If net/lnet and fs/lustre were managed by lustre developers rather than
by GregKH, then the "random patches" would be avoided and we could, as
you say below and as all other fs teams do, run tests before committing
to patches.

> 
> Of course some automatic infrastructure could be built up to make it
> somewhat better, but it does not remove the problem of "nobody would
> run this mainline tree", I am afraid.

We've never had a credible lustre in a mainline tree, so we cannot know
how many people would use it.  Importantly developers would use it
because that is where development would happen.

> 
> It does not hep that there are what 3? 4? trees, not "dual-tree" by any
> stretch of imagination.
> 
> There's DDN/whamcloud (that's really two trees), there's HPE, LLNL
> keeps their fork still I think (thought it's mostly backports?). There
> are likely others I am less exposed to.

"dual-tree" maybe isn't the best way of describing what was wrong with
the previous approach.  "upstream-first" is one way of describing how it
should be run, though that needs to be in understood correctly.

Patches should always flow upstream first, then flow downstream into
distro.  So I write a patch in my own devel tree.  I post it or submit a
pull request and eventually it is accepted into the maintainers
"testing" tree (upsream from me).  There it gets more testing and moves
to the maintainers "next" tree from which it is pulled into linux-next
for integration testing.  Then it goes upstream to Linus (possibly
through an intermediary).  From Linus it goes to -stable and to various
distros etc.  Individual patches are selected for further backporting to
all sorts of different LTS tree.

Occasionally there are short-cuts.  I might submit a patch from my tree
to a SUSE kernel before it is accepted upstream, or maybe even before it
is sent if it is urgent.  But these are not the norm.

But you know all this I expect.  It isn't about the total number of
trees. It is about the flow of patches which must all flow through Linus.
And developers must develop against current linus, or something very
close to that.  Developing against an older kernel is simply making more
work for yourself.

> 
> Sure, only one of those trees is considered "community Lustre", but if
> it will detach too much from what majority of developers really runs
> and gets paid to do - the "community Lustre" contributions probably
> would diminish greatly, I am afraid.
> 
> The past situation of "oh, this new enterprise linux comes with a
> community lustre version, so the first step to get something usable is
> to rip it entirely off and then apply the new good version" is not
> exactly desirable either I am afraid.

Obviously that is not what we want, and clearly people aren't tempted to
do that with any of FS so why do you think it will happen with lustre?
The "new good version" will simply be a few patches on top of whatever
kernel you have.  Hopefully the distributor of that kernel will have
applied those already if any of their customers care about the filesystem.

> > 
> > We need to quickly reach a point where a lustre release is:
> > 
> >  - a verbatim copy of relevant files from a chosen upstream release,
> >    or just a dependency on that kernel source.
> >  - a bunch of extra files that might one day go upstream: server code
> >    and LNet protocol code
> >  - a *few* patches to integrate that code
> >  - some number of patches which have since gone upstream - bugfixes
> > etc.
> >  - libcfs which contains a compat layer for older kernels.
> >  - user-space code, documentation, test scripts, etc for which there
> >    is no expectation of upstreaming to linux kernel.
> 
> All these sound like an awful lot of dedicated developer-hours.
> 
> > Maybe the question for LSF is : what is a sufficient demonstration of
> > commitment?
> > 
> > The big question for us is : how are we going to transition our
> > infrastructure to this model?
> 
> and who would pay for it.

Obviously there will be a cost to transition.  It seems someone is
already willing to pay some of that because patches have been landing
which are only there to make the ultimate transition easier.  Why do you
think that will stop.

Once the transition completes there will still be process difficulties,
but there are plenty of of process difficulties now (gerrit: how do I
hate thee, let me count the ways...) but people seem to simply include
that in the cost of doing business.

> 
> This in the end was the downfall of the previous attempt. There never
> was any serious funding behind the effort so it became an afterthought
> for most.

I don't think funding is the big problem.  I think it is "buy-in".
Individual people in positions of power - such as yourself - need to see
the value and be willing to change they way they work.  If you,
personally, are not willing to change then there is no point even
talking about this any more.

NeilBrown
_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming
  2025-01-18 22:48     ` NeilBrown
@ 2025-01-19  6:37       ` Alexey Lyahkov
  2025-01-19  8:03         ` NeilBrown
  2025-01-19 21:20       ` Oleg Drokin
  1 sibling, 1 reply; 61+ messages in thread
From: Alexey Lyahkov @ 2025-01-19  6:37 UTC (permalink / raw)
  To: NeilBrown; +Cc: lustre-devel@lists.lustre.org


Neil, 
> 
> 
>> 
>> It does not hep that there are what 3? 4? trees, not "dual-tree" by any
>> stretch of imagination.
>> 
>> There's DDN/whamcloud (that's really two trees), there's HPE, LLNL
>> keeps their fork still I think (thought it's mostly backports?). There
>> are likely others I am less exposed to.
> 
> "dual-tree" maybe isn't the best way of describing what was wrong with
> the previous approach.  "upstream-first" is one way of describing how it
> should be run, though that needs to be in understood correctly.
> 
> Patches should always flow upstream first, then flow downstream into
> distro.  So I write a patch in my own devel tree.  I post it or submit a
> pull request and eventually it is accepted into the maintainers
> "testing" tree (upsream from me).  There it gets more testing and moves
> to the maintainers "next" tree from which it is pulled into linux-next
> for integration testing.  Then it goes upstream to Linus (possibly
> through an intermediary).  From Linus it goes to -stable and to various
> distros etc.  Individual patches are selected for further backporting to
> all sorts of different LTS tree.
> 
> Occasionally there are short-cuts.  I might submit a patch from my tree
> to a SUSE kernel before it is accepted upstream, or maybe even before it
> is sent if it is urgent.  But these are not the norm.
> 
> But you know all this I expect.  It isn't about the total number of
> trees. It is about the flow of patches which must all flow through Linus.
> And developers must develop against current linus, or something very
> close to that.  Developing against an older kernel is simply making more
> work for yourself.

This will don’t work. Let explain situation in past.
In previous iteration - Ubuntu had an build a kernels with lustre support enabled.
But Ubuntu don’t have a resources to fix own kernel with lustre back ports.
these clients have installed and some clients expect to be work fine.
But this is false, and building an new lustre module have a conflict in names between kernel and out-of tree lustre.
From other side - it make a Lustre platform fragmentation. Once Lustre version in the Ubuntu code have a stale and out generic lustre rule.
Compatibility supported just for one version up and down. It caused a so much problems for support team.

And Ubuntu just one distro. RedHat, SuSe, LTS kernels… all of them used in HPC - all of then have an own release cycle and so much versions had hold a some lustre version to use.


Alex
_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming
  2025-01-19  6:37       ` Alexey Lyahkov
@ 2025-01-19  8:03         ` NeilBrown
  2025-01-19 16:12           ` Alexey Lyahkov
  0 siblings, 1 reply; 61+ messages in thread
From: NeilBrown @ 2025-01-19  8:03 UTC (permalink / raw)
  To: Alexey Lyahkov; +Cc: lustre-devel@lists.lustre.org

On Sun, 19 Jan 2025, Alexey Lyahkov wrote:
> Neil, 
> > 
> > 
> >> 
> >> It does not hep that there are what 3? 4? trees, not "dual-tree" by any
> >> stretch of imagination.
> >> 
> >> There's DDN/whamcloud (that's really two trees), there's HPE, LLNL
> >> keeps their fork still I think (thought it's mostly backports?). There
> >> are likely others I am less exposed to.
> > 
> > "dual-tree" maybe isn't the best way of describing what was wrong with
> > the previous approach.  "upstream-first" is one way of describing how it
> > should be run, though that needs to be in understood correctly.
> > 
> > Patches should always flow upstream first, then flow downstream into
> > distro.  So I write a patch in my own devel tree.  I post it or submit a
> > pull request and eventually it is accepted into the maintainers
> > "testing" tree (upsream from me).  There it gets more testing and moves
> > to the maintainers "next" tree from which it is pulled into linux-next
> > for integration testing.  Then it goes upstream to Linus (possibly
> > through an intermediary).  From Linus it goes to -stable and to various
> > distros etc.  Individual patches are selected for further backporting to
> > all sorts of different LTS tree.
> > 
> > Occasionally there are short-cuts.  I might submit a patch from my tree
> > to a SUSE kernel before it is accepted upstream, or maybe even before it
> > is sent if it is urgent.  But these are not the norm.
> > 
> > But you know all this I expect.  It isn't about the total number of
> > trees. It is about the flow of patches which must all flow through Linus.
> > And developers must develop against current linus, or something very
> > close to that.  Developing against an older kernel is simply making more
> > work for yourself.
> 
> This will don’t work. Let explain situation in past.

No.  I'm not at all interested in explanations of why it won't work.

I'm only interested in suggestions of how to make it work, and offers of
help.

And yes - describing important use-cases which need to work and might be
difficult would be a helpful thing to do.

NeilBrown


> In previous iteration - Ubuntu had an build a kernels with lustre support enabled.
> But Ubuntu don’t have a resources to fix own kernel with lustre back ports.
> these clients have installed and some clients expect to be work fine.
> But this is false, and building an new lustre module have a conflict in names between kernel and out-of tree lustre.
> From other side - it make a Lustre platform fragmentation. Once Lustre version in the Ubuntu code have a stale and out generic lustre rule.
> Compatibility supported just for one version up and down. It caused a so much problems for support team.
> 
> And Ubuntu just one distro. RedHat, SuSe, LTS kernels… all of them used in HPC - all of then have an own release cycle and so much versions had hold a some lustre version to use.
> 
> 
> Alex

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming
  2025-01-19  8:03         ` NeilBrown
@ 2025-01-19 16:12           ` Alexey Lyahkov
  2025-01-22 20:54             ` NeilBrown
  0 siblings, 1 reply; 61+ messages in thread
From: Alexey Lyahkov @ 2025-01-19 16:12 UTC (permalink / raw)
  To: NeilBrown; +Cc: lustre-devel@lists.lustre.org



> On 19 Jan 2025, at 11:03, NeilBrown <neilb@suse.de> wrote:
> 
> On Sun, 19 Jan 2025, Alexey Lyahkov wrote:
>> Neil,
>>> 
>>> 
>>>> 
>>>> It does not hep that there are what 3? 4? trees, not "dual-tree" by any
>>>> stretch of imagination.
>>>> 
>>>> There's DDN/whamcloud (that's really two trees), there's HPE, LLNL
>>>> keeps their fork still I think (thought it's mostly backports?). There
>>>> are likely others I am less exposed to.
>>> 
>>> "dual-tree" maybe isn't the best way of describing what was wrong with
>>> the previous approach.  "upstream-first" is one way of describing how it
>>> should be run, though that needs to be in understood correctly.
>>> 
>>> Patches should always flow upstream first, then flow downstream into
>>> distro.  So I write a patch in my own devel tree.  I post it or submit a
>>> pull request and eventually it is accepted into the maintainers
>>> "testing" tree (upsream from me).  There it gets more testing and moves
>>> to the maintainers "next" tree from which it is pulled into linux-next
>>> for integration testing.  Then it goes upstream to Linus (possibly
>>> through an intermediary).  From Linus it goes to -stable and to various
>>> distros etc.  Individual patches are selected for further backporting to
>>> all sorts of different LTS tree.
>>> 
>>> Occasionally there are short-cuts.  I might submit a patch from my tree
>>> to a SUSE kernel before it is accepted upstream, or maybe even before it
>>> is sent if it is urgent.  But these are not the norm.
>>> 
>>> But you know all this I expect.  It isn't about the total number of
>>> trees. It is about the flow of patches which must all flow through Linus.
>>> And developers must develop against current linus, or something very
>>> close to that.  Developing against an older kernel is simply making more
>>> work for yourself.
>> 
>> This will don’t work. Let explain situation in past.
> 
> No.  I'm not at all interested in explanations of why it won't work.
> 
> I'm only interested in suggestions of how to make it work, and offers of
> help.
> 
If you have a good way how to solve such situation, which had crazy brain in past.
Please share a way how to avoid lustre source code fragmentation because of freeze a code at the different stage in different distributions.
Ubuntu might have a modern lustre with 6.5 kernel, but Redhat had frozen a lustre version in 3 releases in past and these clients not compatible each other.
And not compatible with installed server.
So question - who will do such support ? Did you have an ideas how to solve this problem?


> And yes - describing important use-cases which need to work and might be
> difficult would be a helpful thing to do.
> 
> NeilBrown
> 


> 
>> In previous iteration - Ubuntu had an build a kernels with lustre support enabled.
>> But Ubuntu don’t have a resources to fix own kernel with lustre back ports.
>> these clients have installed and some clients expect to be work fine.
>> But this is false, and building an new lustre module have a conflict in names between kernel and out-of tree lustre.
>> From other side - it make a Lustre platform fragmentation. Once Lustre version in the Ubuntu code have a stale and out generic lustre rule.
>> Compatibility supported just for one version up and down. It caused a so much problems for support team.
>> 
>> And Ubuntu just one distro. RedHat, SuSe, LTS kernels… all of them used in HPC - all of then have an own release cycle and so much versions had hold a some lustre version to use.
>> 
>> 
>> Alex
> 

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming
  2025-01-18 21:46     ` Day, Timothy
@ 2025-01-19 20:46       ` Oleg Drokin
  2025-01-20  4:38         ` Day, Timothy
  0 siblings, 1 reply; 61+ messages in thread
From: Oleg Drokin @ 2025-01-19 20:46 UTC (permalink / raw)
  To: timday@amazon.com, neilb@suse.de; +Cc: lustre-devel@lists.lustre.org

On Sat, 2025-01-18 at 21:46 +0000, Day, Timothy wrote:
> 
> 
> > On 1/17/25, 10:17 PM, "Oleg Drokin"
> > <green@whamcloud.com <mailto:green@whamcloud.com>> wrote:
> > > On Sat, 2025-01-18 at 11:45 +1100, NeilBrown wrote:
> > > We need to demonstrate a process for, and commitment to, moving
> > > away
> > > from the dual-tree model. We need patches to those parts of
> > > Lustre
> > > that are upstream to land in upstream first (mostly).
> > 
> > 
> > I think this is not very realistic.
> > Large chunk (100%?) of users do not run not only the latest kernel
> > release, they don't run the latest LTS either.
> > 
> > 
> > When we were in staging last this manifested in random patches
> > being
> > landed and breaking the client completely and nobody noticing for
> > months.
> > 
> > 
> > Of course some automatic infrastructure could be built up to make
> > it
> > somewhat better, but it does not remove the problem of "nobody
> > would
> > run this mainline tree", I am afraid.
> 
> I think there's a decent chunk of users on newer kernels. Ubuntu
> 22/24 is
> on (a bit past latest) LTS 6.8 kernel [1], AL2023 is on previous LTS
> 6.1 [2], and
> working on upcoming LTS 6.12 [3].

Well, I mostly mean in context of Lustre client use and sure there's
some 6.8 LTS in use on those ubuntu clients, though I cannot assess the
real numbers, majority of reports I see are still on 5.x even on
Ubuntu.

> When a patch lands in lustre-release/master, it could be around 1 -
> 1.5 years
> before it lands in a proper Lustre release. At that point, it might
> see real
> production usage.

Well, not really.
I guess it might not be seen as easily from the outside, but "lustre-
release/master" patches are backports from "true production" branches.
the number approaches 100% for features, but even a sizable number of
fixes are backports.
In particular anything that comes from HPE are backports, they run
their production stuff, sometimes hit problems, create fixes, and the
eventually determine that the problem is present in master as well (or
sometimes b2_x branches) and submit their ports there.

The actual lag between features being developed and then getting into
the master branch could be rather long too.

> So I think it's mostly a matter of convincing people to use an
> upstream
> client. I don't think people wanted to use the staging client because
> it
> didn't work well and wasn't stable. And vendors don't want to work on
> something that no one uses. It the client is "good enough" and people
> are confident it'll continue to be updated, I think they will use it.
> The
> staging client was neither of those things.

I agree once you convince people (both users and developers) to use the
upstream client things will move in this desirable direction, but right
now I don't know how to convince them.
on RHEL (and derivatives) front the time lag is huge in particular.

> > It does not hep that there are what 3? 4? trees, not "dual-tree" by
> > any
> > stretch of imagination.
> > 
> > 
> > There's DDN/whamcloud (that's really two trees), there's HPE, LLNL
> > keeps their fork still I think (thought it's mostly backports?).
> > There
> > are likely others I am less exposed to.
> 
> I think most non-community Lustre release are derived from the
> community release and periodically rebased. I think AWS,
> Whamcloud, LLNL, Microsoft would fall into that bucket. And I
> doubt DDN and HPE significantly diverge from community Lustre. But
> if someone is diverging significantly from community Lustre, I think
> they are opting into a significant maintenance burden regardless of
> what we do with lustre-release/master.

Both DDN and HPE significantly diverge with new features and such.
There's also a (now mostly dormant) Fujitsu "FEFS" fork that they got
tired of maintaining and tried to fold back in, but could not. (also
Cray's secure data appliance that seems to have met a similar fate:
https://github.com/Cray/lustre-sda )

Yes, maintenance burden consideration is always there of course, so
there's some coordination nowadays (like reserving feature flags ahead
of time and such), but it's not outside of realm of possibility that if
what's perceived as "tip of the community tree" becomes inconvenient,
it'll be dropped.
In fact a similar thing happened to the staging lustre in the past I
guess, only before it even became the perceived tip (for a variety of
reasons).

> > Sure, only one of those trees is considered "community Lustre", but
> > if
> > it will detach too much from what majority of developers really
> > runs
> > and gets paid to do - the "community Lustre" contributions probably
> > would diminish greatly, I am afraid.
> 
> As long as the community Lustre development process is sane, I think
> most organizations will opt to continue deriving their releases from
> it and opt to continue contributing releases upstream. We just need
> to make sure we get buy-in from the people contributing to Lustre.

Well, there's another half of it, the kernel side. Previous run in with
other kernel maintainers had left a bit of a sour taste in people's
mouths.
Of course they have their own reasons to dictate whatever they want to
newcomers (And all coming patches), but on the other hand Lustre is a
mature product that could not just drop everything and rewrite
significant chunks of the code (several times at that) o better align
with the ever changed demands (bcachefs I think was a highly paraded
around example of that, and they could accommodate those often
conflicting demands because not many deployments in the wild).
I don't know how possible is it to overcome. Kernel maintainers don't
really care about Lustre (and rightfully so, we are but a blip to them)
and then we also have our own priorities.

And while for Lustre developers there's a benefit of "the adjusting to
new interfaces comes for free", there's no benefit to the kernel
maintainers, so they don't have much incentive.
(and again we saw this in the previous attempt)

And even imagine by some magic the actual inclusion and all the
relevant rework happened. Now HPE or DDN wants to add a new feature,
they implement it and then submit and a met with the usual "now rework
it in these other ways" demands.
Of course again from the kernel maintainers perspective this is
entirely reasonable and it's not their problem the development process
is wrong and backwards and instead of developing everything in the open
on the public branch with input from all parties interested there's
this closed development going on. But good luck convincing respective
management of those companies to agree.

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming
  2025-01-18 22:48     ` NeilBrown
  2025-01-19  6:37       ` Alexey Lyahkov
@ 2025-01-19 21:20       ` Oleg Drokin
  2025-01-24 23:12         ` NeilBrown
  1 sibling, 1 reply; 61+ messages in thread
From: Oleg Drokin @ 2025-01-19 21:20 UTC (permalink / raw)
  To: neilb@suse.de; +Cc: lustre-devel@lists.lustre.org

On Sun, 2025-01-19 at 09:48 +1100, NeilBrown wrote:
> On Sat, 18 Jan 2025, Oleg Drokin wrote:
> > On Sat, 2025-01-18 at 11:45 +1100, NeilBrown wrote:
> > > We need to demonstrate a process for, and commitment to, moving
> > > away
> > > from the dual-tree model.  We need patches to those parts of
> > > Lustre
> > > that are upstream to land in upstream first (mostly).
> > 
> > I think this is not very realistic.
> > Large chunk (100%?) of users do not run not only the latest kernel
> > release, they don't run the latest LTS either.
> Are you referring to lustre users or all Linux users?

Lustre users (very minuscle part of Linux users run Lustre anyway)

> If the latter, then xfs etc face the same problem and seem to manage.
> If lustre users: they can't because the latest kernel doesn't include
> lustre.  Maybe you are seeing a chicken-and-egg problem?

They are related. A regular person can run xfs at home (being a single
node local filesystem and all (I know cluster xfs exists but I am not
sure Linux actually supports it) and then when Fedora/Redhad decided
xfs is the default install fs in some cases - the adoption
understandably shot up too.
Now once we get into networked filesystems, even commonplace things
like NFS are barely run by anybody. Lustre? Would still remain in large
datacenters that are not exactly known for running on the bleeding edge
(rhel7 is still strong apparently)

> 
> > Of course some automatic infrastructure could be built up to make
> > it
> > somewhat better, but it does not remove the problem of "nobody
> > would
> > run this mainline tree", I am afraid.
> 
> We've never had a credible lustre in a mainline tree, so we cannot
> know
> how many people would use it.  Importantly developers would use it
> because that is where development would happen.

Well, I agree if we could have actual development happen there it would
change everything. But the problem here is this decision is outside the
hands of developers as I just wrote to Tim in the other email.
Too many management types are anti-open development and Lustre is kinda
niche, so only a handful of companies that control the actual
developers.

> > It does not hep that there are what 3? 4? trees, not "dual-tree" by
> > any
> > stretch of imagination.
> > 
> > There's DDN/whamcloud (that's really two trees), there's HPE, LLNL
> > keeps their fork still I think (thought it's mostly backports?).
> > There
> > are likely others I am less exposed to.
> 
> "dual-tree" maybe isn't the best way of describing what was wrong
> with
> the previous approach.  "upstream-first" is one way of describing how
> it
> should be run, though that needs to be in understood correctly.

Yes. I agree. And this is exactly what kernel maintainers demand (or
would if they don't yet). But in the land of "we must have this
differentiating feature in order to sell our product over the
competitor offering" it does not fly.
In the past where a lot of the market was controlled by various
government labs that mandated opensource it was easier, but with foray
into various commercial deployments and esp. with huge demand from AI
installations that don't care about anything beside "I want things to
work the best right now" suddenly this factor mostly evaporated and we
are descending into "closed hell" with increased speed I am afraid.

> > Sure, only one of those trees is considered "community Lustre", but
> > if
> > it will detach too much from what majority of developers really
> > runs
> > and gets paid to do - the "community Lustre" contributions probably
> > would diminish greatly, I am afraid.
> > 
> > The past situation of "oh, this new enterprise linux comes with a
> > community lustre version, so the first step to get something usable
> > is
> > to rip it entirely off and then apply the new good version" is not
> > exactly desirable either I am afraid.
> 
> Obviously that is not what we want, and clearly people aren't tempted
> to
> do that with any of FS so why do you think it will happen with
> lustre?

Already happened (several times in different ways)
I think I have a faint memory with other kernel components having a
similar problem.

I guess MOFED is the most current example.
"rip out in kernel ib stuff, replace with our greatest shiny"

> The "new good version" will simply be a few patches on top of
> whatever
> kernel you have.  Hopefully the distributor of that kernel will have
> applied those already if any of their customers care about the
> filesystem.

That is the ideal, anyway, but seems somewhat hard to reach.

> > > We need to quickly reach a point where a lustre release is:
> > > 
> > >  - a verbatim copy of relevant files from a chosen upstream
> > > release,
> > >    or just a dependency on that kernel source.
> > >  - a bunch of extra files that might one day go upstream: server
> > > code
> > >    and LNet protocol code
> > >  - a *few* patches to integrate that code
> > >  - some number of patches which have since gone upstream -
> > > bugfixes
> > > etc.
> > >  - libcfs which contains a compat layer for older kernels.
> > >  - user-space code, documentation, test scripts, etc for which
> > > there
> > >    is no expectation of upstreaming to linux kernel.
> > 
> > All these sound like an awful lot of dedicated developer-hours.
> > 
> > > Maybe the question for LSF is : what is a sufficient
> > > demonstration of
> > > commitment?
> > > 
> > > The big question for us is : how are we going to transition our
> > > infrastructure to this model?
> > 
> > and who would pay for it.
> Obviously there will be a cost to transition.  It seems someone is
> already willing to pay some of that because patches have been landing
> which are only there to make the ultimate transition easier.  Why do
> you
> think that will stop.

it won't, but at the current rate I am not even sure conversion is
happening faster than breakage ;)

> 
> Once the transition completes there will still be process
> difficulties,
> but there are plenty of of process difficulties now (gerrit: how do I
> hate thee, let me count the ways...) but people seem to simply
> include
> that in the cost of doing business.

it's been awhile since I did patch reviews by emails, but I think
gerrit is much more user-friendly (if you have internet, anyway)

> > This in the end was the downfall of the previous attempt. There
> > never
> > was any serious funding behind the effort so it became an
> > afterthought
> > for most.
> 
> I don't think funding is the big problem.  I think it is "buy-in".
> Individual people in positions of power - such as yourself - need to
> see
> the value and be willing to change they way they work.  If you,
> personally, are not willing to change then there is no point even
> talking about this any more.

While I see value, people in position of actual power (e.g. those that
pay my salary and get to dictate priorities) don't agree this is a good
idea to change the development process to the fully open model.

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming
  2025-01-18 22:21     ` NeilBrown
@ 2025-01-20  3:57       ` Day, Timothy
  2025-01-21 17:02         ` Patrick Farrell
  0 siblings, 1 reply; 61+ messages in thread
From: Day, Timothy @ 2025-01-20  3:57 UTC (permalink / raw)
  To: NeilBrown; +Cc: lustre-devel@lists.lustre.org



> On 1/18/25, 5:21 PM, "NeilBrown" <neilb@suse.de <mailto:neilb@suse.de>> wrote:
> > On Sun, 19 Jan 2025, Day, Timothy wrote:
> >
> > On the other hand, I wonder if we upstream the whole thing all at once. Beside
> > the code being a bit nicer, the client isn't really that much closer to being upstream
> > than the server is. And no one else can test the client without having a Lustre
> > server on-hand. So no-one can easily run xfstests or similar. And doing everything
> > all at once would preempt questions of client/server split or the server upstreaming
> > timeline. But upstreaming so much all at once is probably more unrealistic.
>
>
> The main difference I see between server and client in upstreaming terms
> is the storage backend. It would need to use un-patched ext4 - ideally
> using VFS interfaces though we might be able to negotiate with the ext4
> team to get some exports. I don't know much about the delta between
> ldiskfs and ext4 and understand it is much smaller than it once was, but
> it would need to be zero. I'm working towards getting the pdirop patch
> upstreamable. Andreas would know what else is needed better than I.

I've been working on a third storage backend [1]. It'll likely be done
well before we submit anything upstream. It's a just memory-only
target. That might be justification enough to keep the OSD APIs.

[1] https://review.whamcloud.com/c/fs/lustre-release/+/55594

> The other difference is that a lot of the "revise code to match upstream
> style" work has focused on client and ignored server-only code.
>
>
> It might be sensible to set the goal as "client and server" including
> only the ext4 backend and possibly only the socklnd network interface.
> It will be a big code drop either way. People aren't going to go over
> every line with a fine-tooth-comb. They will mostly look at whichever
> bit particularly interests them, and look at the process and community
> behind the code.
>
>
> Being able to build a pure upstream kernel, add a user-space tools
> package, and test would certainly be a plus. That would be something
> worth canvassing at LSF - is there any value in landing the client
> without the server?

Yeah, I'm leaning towards setting the goal as both client/server and
gathering opinions from LSF. The client and server are still pretty
intertwined. I think having the client go upstream and then basing
the server on top an in-tree client would make server development
noticeably more difficult. Thinking on it more - I don't think
upstreaming the server is more ambitious than the client. We
have more of a process problem than a code problem. And I don't
think the server is in particularly bad shape.

>
> NeilBrown
>

Tim Day

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming
  2025-01-19 20:46       ` Oleg Drokin
@ 2025-01-20  4:38         ` Day, Timothy
  2025-01-20  5:37           ` Oleg Drokin
  2025-01-23  9:00           ` Alexey Lyahkov
  0 siblings, 2 replies; 61+ messages in thread
From: Day, Timothy @ 2025-01-20  4:38 UTC (permalink / raw)
  To: Oleg Drokin, neilb@suse.de; +Cc: lustre-devel@lists.lustre.org



> On 1/19/25, 3:46 PM, "Oleg Drokin" <green@whamcloud.com <mailto:green@whamcloud.com>> wrote:
> > On Sat, 2025-01-18 at 21:46 +0000, Day, Timothy wrote:
> >
> >
> > > On 1/17/25, 10:17 PM, "Oleg Drokin"
> > > <green@whamcloud.com <mailto:green@whamcloud.com> <mailto:green@whamcloud.com <mailto:green@whamcloud.com>>> wrote:
> > > > On Sat, 2025-01-18 at 11:45 +1100, NeilBrown wrote:
> > > > We need to demonstrate a process for, and commitment to, moving
> > > > away
> > > > from the dual-tree model. We need patches to those parts of
> > > > Lustre
> > > > that are upstream to land in upstream first (mostly).
> > >
> > >
> > > I think this is not very realistic.
> > > Large chunk (100%?) of users do not run not only the latest kernel
> > > release, they don't run the latest LTS either.
> > >
> > >
> > > When we were in staging last this manifested in random patches
> > > being
> > > landed and breaking the client completely and nobody noticing for
> > > months.
> > >
> > >
> > > Of course some automatic infrastructure could be built up to make
> > > it
> > > somewhat better, but it does not remove the problem of "nobody
> > > would
> > > run this mainline tree", I am afraid.
> >
> > I think there's a decent chunk of users on newer kernels. Ubuntu
> > 22/24 is
> > on (a bit past latest) LTS 6.8 kernel [1], AL2023 is on previous LTS
> > 6.1 [2], and
> > working on upcoming LTS 6.12 [3].
>
>
> Well, I mostly mean in context of Lustre client use and sure there's
> some 6.8 LTS in use on those ubuntu clients, though I cannot assess the
> real numbers, majority of reports I see are still on 5.x even on
> Ubuntu.

Yeah, I'm not sure of the real numbers. It's just my personal experience
that newer kernels are getting a lot of traction.

> > When a patch lands in lustre-release/master, it could be around 1 -
> > 1.5 years
> > before it lands in a proper Lustre release. At that point, it might
> > see real
> > production usage.
>
>
> Well, not really.
> I guess it might not be seen as easily from the outside, but "lustre-
> release/master" patches are backports from "true production" branches.
> the number approaches 100% for features, but even a sizable number of
> fixes are backports.
> In particular anything that comes from HPE are backports, they run
> their production stuff, sometimes hit problems, create fixes, and the
> eventually determine that the problem is present in master as well (or
> sometimes b2_x branches) and submit their ports there.
>
>
> The actual lag between features being developed and then getting into
> the master branch could be rather long too.

I think every organization that uses Lustre has a model similar to
this. But I don't think this is uncommon for other subsystems. The
various OFED flavors come to mind (I think MOFED was mentioned
in another thread). Everything is ultimately rebased on the
community version, AFAIK.

> > So I think it's mostly a matter of convincing people to use an
> > upstream
> > client. I don't think people wanted to use the staging client because
> > it
> > didn't work well and wasn't stable. And vendors don't want to work on
> > something that no one uses. It the client is "good enough" and people
> > are confident it'll continue to be updated, I think they will use it.
> > The
> > staging client was neither of those things.
>
>
> I agree once you convince people (both users and developers) to use the
> upstream client things will move in this desirable direction, but right
> now I don't know how to convince them.
> on RHEL (and derivatives) front the time lag is huge in particular.

Strictly speaking, the proposal from Neil was to derive the client from
the upstream release. For example, say Lustre got merged in Linux 7.4.
To support RHEL 8, we'd copy the Linux 7.4 client and combine it with
some Lustre compatibility code to generate a working client on the
older RHEL kernel. This is exactly what AMDGPU is doing [1], based on
my research.

So in this case, everyone would eventually be running the upstream
client - since the clients for vendor kernel would be derived from it.

[1] https://github.com/geohot/amdgpu-dkms/tree/master

> > > It does not hep that there are what 3? 4? trees, not "dual-tree" by
> > > any
> > > stretch of imagination.
> > >
> > >
> > > There's DDN/whamcloud (that's really two trees), there's HPE, LLNL
> > > keeps their fork still I think (thought it's mostly backports?).
> > > There
> > > are likely others I am less exposed to.
> >
> > I think most non-community Lustre release are derived from the
> > community release and periodically rebased. I think AWS,
> > Whamcloud, LLNL, Microsoft would fall into that bucket. And I
> > doubt DDN and HPE significantly diverge from community Lustre. But
> > if someone is diverging significantly from community Lustre, I think
> > they are opting into a significant maintenance burden regardless of
> > what we do with lustre-release/master.
>
>
> Both DDN and HPE significantly diverge with new features and such.
> There's also a (now mostly dormant) Fujitsu "FEFS" fork that they got
> tired of maintaining and tried to fold back in, but could not. (also
> Cray's secure data appliance that seems to have met a similar fate:
> https://github.com/Cray/lustre-sda <https://github.com/Cray/lustre-sda> )
>
>
> Yes, maintenance burden consideration is always there of course, so
> there's some coordination nowadays (like reserving feature flags ahead
> of time and such), but it's not outside of realm of possibility that if
> what's perceived as "tip of the community tree" becomes inconvenient,
> it'll be dropped.
> In fact a similar thing happened to the staging lustre in the past I
> guess, only before it even became the perceived tip (for a variety of
> reasons).

Both DDN and HPE regularly contribute fixes/features back to the community
branch from their respective production branches. HPE seems to rebase
their branches fairly often on community Lustre [1]. You would have more
context if that's true for DDN - I couldn't find much online.

But Fujitsu and the SDA team in HPE were not contributing back as
much and eventually abandoned their forks. So based on those examples,
it seems most sustainable for organizations to contribute to the community
release. So I think the risk of contributions being lessened because Lustre
moves towards upstream is low, IMHO.

But I agree with your fundamental point - we can't make submitting patches
to community Lustre arduous.

[1] https://github.com/Cray/lustre

> > > Sure, only one of those trees is considered "community Lustre", but
> > > if
> > > it will detach too much from what majority of developers really
> > > runs
> > > and gets paid to do - the "community Lustre" contributions probably
> > > would diminish greatly, I am afraid.
> >
> > As long as the community Lustre development process is sane, I think
> > most organizations will opt to continue deriving their releases from
> > it and opt to continue contributing releases upstream. We just need
> > to make sure we get buy-in from the people contributing to Lustre.
>
>
> Well, there's another half of it, the kernel side. Previous run in with
> other kernel maintainers had left a bit of a sour taste in people's
> mouths.
> Of course they have their own reasons to dictate whatever they want to
> newcomers (And all coming patches), but on the other hand Lustre is a
> mature product that could not just drop everything and rewrite
> significant chunks of the code (several times at that) o better align
> with the ever changed demands (bcachefs I think was a highly paraded
> around example of that, and they could accommodate those often
> conflicting demands because not many deployments in the wild).
> I don't know how possible is it to overcome. Kernel maintainers don't
> really care about Lustre (and rightfully so, we are but a blip to them)
> and then we also have our own priorities.

LSF/MM could be a good opportunity to improve our
relationship with the upstream maintainers. :)

> And while for Lustre developers there's a benefit of "the adjusting to
> new interfaces comes for free", there's no benefit to the kernel
> maintainers, so they don't have much incentive.
> (and again we saw this in the previous attempt)
>
>
> And even imagine by some magic the actual inclusion and all the
> relevant rework happened. Now HPE or DDN wants to add a new feature,
> they implement it and then submit and a met with the usual "now rework
> it in these other ways" demands.
> Of course again from the kernel maintainers perspective this is
> entirely reasonable and it's not their problem the development process
> is wrong and backwards and instead of developing everything in the open
> on the public branch with input from all parties interested there's
> this closed development going on. But good luck convincing respective
> management of those companies to agree.

Backporting from production branches to the community release
already takes some work. Especially in the feature is based on an
older LTS. So I don't think porting to upstream Linux would be a huge
amount of extra work.

On the other hand, if Lustre was included in mainline properly rather
than in staging - I think we’d have more leverage to implement things
the way we want to. After all, the kernel maintainers don't really care
about Lustre. :)

Tim Day

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming
       [not found]   ` <E4481869-E21A-4941-9A97-8C59B7104528@ddn.com>
  2025-01-18 22:25     ` NeilBrown
@ 2025-01-20  4:54     ` Day, Timothy
  1 sibling, 0 replies; 61+ messages in thread
From: Day, Timothy @ 2025-01-20  4:54 UTC (permalink / raw)
  To: Andreas Dilger, NeilBrown; +Cc: lustre-devel@lists.lustre.org

> This will definitely need some reorganization of files and directories in the
> Lustre source tree to align with the Linux kernel (e.g. moving everything
> under fs/lustre and net/lnet).

I think the Lustre tree ought to have a clearer separation between the
kernel code and user space. It should be pretty easy to shuffle the
directory structure to achieve this. We'd have something like:

fs/lustre/
net/lnet/lnet/
net/lnet/libcfs/
lustre_compat/ <- This gets compiled into libcfs
tests/
utils/

And eventually the fs/ and net/ would live in Linux mainline.
lustre_compat/ would remain to allow us to compile mainline
clients for older kernel, similar to AMDGPU (that I mentioned in
the other thread).

Another benefit of a cleaner split between kernel and user space:
older version of Lustre could definitely benefit from newer user space
features. Patrick's parallel migrate work comes to mind.

> That would probably be a question to get answered, whether LNet is
> "too Lustre specific" to be in net/ and should live in the Lustre tree?
>
> That would make it harder to backport patches to maintenance releases,
> but I'm hoping a script to rework pathnames in patches would be enough.

Newer versions of git are pretty good at finding directories after a relocation.
For example, if you pull down the upstream client branch and Lustre master
into the same repo, you can git cherry-pick a patch from one to the other and it
will automatically find the directories that contains the changed files. This
works even though the two repos don't share a common history.

If I created a patch to reorg Lustre, we could easily test some cherry-picks to
make sure they work right.

Tim Day

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming
  2025-01-20  4:38         ` Day, Timothy
@ 2025-01-20  5:37           ` Oleg Drokin
  2025-01-23  9:00           ` Alexey Lyahkov
  1 sibling, 0 replies; 61+ messages in thread
From: Oleg Drokin @ 2025-01-20  5:37 UTC (permalink / raw)
  To: timday@amazon.com, neilb@suse.de; +Cc: lustre-devel@lists.lustre.org

On Mon, 2025-01-20 at 04:38 +0000, Day, Timothy wrote:

> I think every organization that uses Lustre has a model similar to
> this. But I don't think this is uncommon for other subsystems. The
> various OFED flavors come to mind (I think MOFED was mentioned
> in another thread). Everything is ultimately rebased on the
> community version, AFAIK.

My understanding of mofed is they are in exactly the same boat we are
hoping to avoid: "remove whatever pitiful stuff there is in the linux
kernel and then plug in our own superior stuff"

> > Both DDN and HPE significantly diverge with new features and such.
> > There's also a (now mostly dormant) Fujitsu "FEFS" fork that they
> > got
> > tired of maintaining and tried to fold back in, but could not.
> > (also
> > Cray's secure data appliance that seems to have met a similar fate:
> > https://github.com/Cray/lustre-sda <
> > https://github.com/Cray/lustre-sda> )
> > 
> > 
> > Yes, maintenance burden consideration is always there of course, so
> > there's some coordination nowadays (like reserving feature flags
> > ahead
> > of time and such), but it's not outside of realm of possibility
> > that if
> > what's perceived as "tip of the community tree" becomes
> > inconvenient,
> > it'll be dropped.
> > In fact a similar thing happened to the staging lustre in the past
> > I
> > guess, only before it even became the perceived tip (for a variety
> > of
> > reasons).
> 
> Both DDN and HPE regularly contribute fixes/features back to the
> community
> branch from their respective production branches. HPE seems to rebase
> their branches fairly often on community Lustre [1]. You would have
> more
> context if that's true for DDN - I couldn't find much online.

yes, there merges and rebases are relatively common for as long as it
remains convenient. At your Cray link you might also notice the rebases
are not on top of master.

> But Fujitsu and the SDA team in HPE were not contributing back as
> much and eventually abandoned their forks. So based on those
> examples,
> it seems most sustainable for organizations to contribute to the
> community
> release. So I think the risk of contributions being lessened because
> Lustre
> moves towards upstream is low, IMHO.
> 
> But I agree with your fundamental point - we can't make submitting
> patches
> to community Lustre arduous.

I skipped the preceding parts, but this is probably going to be the
main point of contention.
The reasons FEFS dropped out is because they did they development in
secret without talking to anyone making choices we (the "mainline"
Lustre people) found unwise or questionable.
So once Fujitsu came to us with "hey, we have this whole bunch of
awesome stuff", a lot of it had to be rejected because it was not done
in a good way or there was a competing implementation.

Now the tables are turning as I explained. We are doing development "in
secret" (as far as kernel maintainers are concerned, anyway).
> 
> [1] https://github.com/Cray/lustre
> 
> > > > Sure, only one of those trees is considered "community Lustre",
> > > > but
> > > > if
> > > > it will detach too much from what majority of developers really
> > > > runs
> > > > and gets paid to do - the "community Lustre" contributions
> > > > probably
> > > > would diminish greatly, I am afraid.
> > > 
> > > As long as the community Lustre development process is sane, I
> > > think
> > > most organizations will opt to continue deriving their releases
> > > from
> > > it and opt to continue contributing releases upstream. We just
> > > need
> > > to make sure we get buy-in from the people contributing to
> > > Lustre.
> > 
> > 
> > Well, there's another half of it, the kernel side. Previous run in
> > with
> > other kernel maintainers had left a bit of a sour taste in people's
> > mouths.
> > Of course they have their own reasons to dictate whatever they want
> > to
> > newcomers (And all coming patches), but on the other hand Lustre is
> > a
> > mature product that could not just drop everything and rewrite
> > significant chunks of the code (several times at that) o better
> > align
> > with the ever changed demands (bcachefs I think was a highly
> > paraded
> > around example of that, and they could accommodate those often
> > conflicting demands because not many deployments in the wild).
> > I don't know how possible is it to overcome. Kernel maintainers
> > don't
> > really care about Lustre (and rightfully so, we are but a blip to
> > them)
> > and then we also have our own priorities.
> 
> LSF/MM could be a good opportunity to improve our
> relationship with the upstream maintainers. :)

Absolutely. Though we did go there in the past, and had the discussions
and all, and there's no incentive for the kernel maintainers to accept
our way because obviously for them it's a bad process (and I don't
blame them!)

> > And while for Lustre developers there's a benefit of "the adjusting
> > to
> > new interfaces comes for free", there's no benefit to the kernel
> > maintainers, so they don't have much incentive.
> > (and again we saw this in the previous attempt)
> > 
> > 
> > And even imagine by some magic the actual inclusion and all the
> > relevant rework happened. Now HPE or DDN wants to add a new
> > feature,
> > they implement it and then submit and a met with the usual "now
> > rework
> > it in these other ways" demands.
> > Of course again from the kernel maintainers perspective this is
> > entirely reasonable and it's not their problem the development
> > process
> > is wrong and backwards and instead of developing everything in the
> > open
> > on the public branch with input from all parties interested there's
> > this closed development going on. But good luck convincing
> > respective
> > management of those companies to agree.
> 
> Backporting from production branches to the community release
> already takes some work. Especially in the feature is based on an
> older LTS. So I don't think porting to upstream Linux would be a huge
> amount of extra work.

Depends on how much extra friction the kernel acceptance adds (from
extra reviews by fs/mm maintainers) and I estimate it to be high
initially.
Don't forget all the "proprietary" features that are not in mainline,
but are fully developed otherwise. How many of those implementation
details would not be liked by the kernel maintainers is a big unknown.

> On the other hand, if Lustre was included in mainline properly rather
> than in staging - I think we’d have more leverage to implement things
> the way we want to. After all, the kernel maintainers don't really
> care
> about Lustre. :)

They care about the way interfaces are used, that was a pretty big
point of contention in the past and I sure still remains.
Have you seen all the "nice" string matching/userspace memory parsing
we do for the jobid determination for example? Yeah, I don't like it
either.
But they also don't like other things too.
Definitely talk to hch, he has choice words about many parts of Lustre
(if he still did not forget).

I really want to be optimistic about it, but I also still remember the
previous attempt vividly and majority of objections raised back then
are still pretty valid.


_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming
  2025-01-20  3:57       ` Day, Timothy
@ 2025-01-21 17:02         ` Patrick Farrell
  2025-01-22  6:57           ` Andreas Dilger
  0 siblings, 1 reply; 61+ messages in thread
From: Patrick Farrell @ 2025-01-21 17:02 UTC (permalink / raw)
  To: Day, Timothy, NeilBrown; +Cc: lustre-devel@lists.lustre.org

[-- Attachment #1.1: Type: text/plain, Size: 3704 bytes --]

I agree strongly here, and I think going upstream with both makes some things much easier.  It forces us to deal with ldiskfs but there's all of that shared code reorg, etc, which this can let us partially skip.  While there's probably some value in fully separating client and server code, it would be a fair bit of work and then the keeping in sync, etc...  All at once seems nicer to me.
________________________________
From: lustre-devel <lustre-devel-bounces@lists.lustre.org> on behalf of Day, Timothy <timday@amazon.com>
Sent: Sunday, January 19, 2025 9:57 PM
To: NeilBrown <neilb@suse.de>
Cc: lustre-devel@lists.lustre.org <lustre-devel@lists.lustre.org>
Subject: Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming

> On 1/18/25, 5:21 PM, "NeilBrown" <neilb@suse.de <mailto:neilb@suse.de>> wrote:
> > On Sun, 19 Jan 2025, Day, Timothy wrote:
> >
> > On the other hand, I wonder if we upstream the whole thing all at once. Beside
> > the code being a bit nicer, the client isn't really that much closer to being upstream
> > than the server is. And no one else can test the client without having a Lustre
> > server on-hand. So no-one can easily run xfstests or similar. And doing everything
> > all at once would preempt questions of client/server split or the server upstreaming
> > timeline. But upstreaming so much all at once is probably more unrealistic.
>
>
> The main difference I see between server and client in upstreaming terms
> is the storage backend. It would need to use un-patched ext4 - ideally
> using VFS interfaces though we might be able to negotiate with the ext4
> team to get some exports. I don't know much about the delta between
> ldiskfs and ext4 and understand it is much smaller than it once was, but
> it would need to be zero. I'm working towards getting the pdirop patch
> upstreamable. Andreas would know what else is needed better than I.

I've been working on a third storage backend [1]. It'll likely be done
well before we submit anything upstream. It's a just memory-only
target. That might be justification enough to keep the OSD APIs.

[1] https://review.whamcloud.com/c/fs/lustre-release/+/55594

> The other difference is that a lot of the "revise code to match upstream
> style" work has focused on client and ignored server-only code.
>
>
> It might be sensible to set the goal as "client and server" including
> only the ext4 backend and possibly only the socklnd network interface.
> It will be a big code drop either way. People aren't going to go over
> every line with a fine-tooth-comb. They will mostly look at whichever
> bit particularly interests them, and look at the process and community
> behind the code.
>
>
> Being able to build a pure upstream kernel, add a user-space tools
> package, and test would certainly be a plus. That would be something
> worth canvassing at LSF - is there any value in landing the client
> without the server?

Yeah, I'm leaning towards setting the goal as both client/server and
gathering opinions from LSF. The client and server are still pretty
intertwined. I think having the client go upstream and then basing
the server on top an in-tree client would make server development
noticeably more difficult. Thinking on it more - I don't think
upstreaming the server is more ambitious than the client. We
have more of a process problem than a code problem. And I don't
think the server is in particularly bad shape.

>
> NeilBrown
>

Tim Day

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

[-- Attachment #1.2: Type: text/html, Size: 5121 bytes --]

[-- Attachment #2: Type: text/plain, Size: 165 bytes --]

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming
  2025-01-16 21:25 [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming Day, Timothy
       [not found] ` <C9513675-3287-4784-90B7-AD133328C42A@ddn.com>
  2025-01-18  0:45 ` NeilBrown
@ 2025-01-22  6:35 ` Day, Timothy
  2025-01-22  7:09   ` Andreas Dilger
                     ` (2 more replies)
  2 siblings, 3 replies; 61+ messages in thread
From: Day, Timothy @ 2025-01-22  6:35 UTC (permalink / raw)
  To: lustre-devel@lists.lustre.org

I've created a second draft of the topic for LSF/MM. I tried
to include everyone's feedback. It's at the end of the email.

Before that, I wanted to elaborate on Neil's idea about updating
our development model to an upstream-focused model. For upstreaming
to work, the normal development flow has to generate patches to mainline
Linux - while still supporting the distro kernels that most people use
to run Lustre. I think we can get this point in stages. I've provided
a high-level overview in the next section. This won't be without
challenges - but the majority of the transition could happen without
interrupting feature work or normal development.

[I] Separate the kernel code, compatibility code, and userspace code

We should reorganize the Lustre tree to have a clear separation
of concerns:

fs/lustre/
net/lnet/
net/libcfs/
lustre_compat/
tests/
utils/

The functional components of libcfs/ would stay in that directory
and the compatibility components would live in lustre_compat/.
Centralizing the compatibility code makes it easier to maintain and
update and allows us to start removing the compatibility code from
the modules themselves. lustre_compat/ could still be compiled into
libcfs.ko, if we want to avoid creating even more modules.

[II] Get fs/ and net/ to compile on a mainline kernel

Once the compatibility code is isolated, we must get fs/ and net/
to compile on a mainline kernel - without any configuration or
lustre_compat/ layer.

We would validate this by adding build validation to each patch
submitted to Gerrit. The kernel version would be pinned (similar
to how we pin ZFS version) and we'd periodically update it and fix
any new build failures.

Once this is achieved, we'll have a native Linux client/server
that can be run on older distros via a compatibility layer.

[III] Move fs/ and net/ to a separate kernel tree

Transition to maintaining fs/ and net/ as a series on patches
on top of a mainline kernel release. At this point, we'll generating
patches to mainline Linux while retaining the ability to support
older distro kernels via lustre_compat/. Similar to the previous
step, we periodically rebase our Lustre patch series - fixing
lustre_compat/ as needed.

This is the only step that requires a change the Lustre development
workflow - patches would have to be split and sent to two
different repos. We can delay this step until we have some
confidence that Lustre has a path to be accepted to mainline.

[IV] Submit the patch series for inclusion

Once we are comfortable with the above process, we can submit the
initial patches to add Lustre support to the kernel. Our normal
development flow will generate a batch of patches to be submitted
during each merge window. After the merge window, we can focus
on testing and making sure that our backport to older distro
kernels it still working.

FAQ:

Q: Who will actually run the Lustre code in mainline Linux?
A: Everyone. Releases for older distros will be a combination
   of the upstream Lustre combined with lustre_compat/ and
   whatever stuff the kernel won't allow (like GPUDirect).

Q: What does a Lustre release look like?
A: We can generate a tarball by combining an upstream Lustre
   release from mainline along with lustre_compat/ and the
   userspace stuff. Vendors and third-parties can base
   their versions of Lustre on those tarballs. Every time a
   new kernel releases - a new Lustre release tarball will
   be created. LTS releases can center around the LTS kernel
   releases.

Q: How will we validate that fs/ and net/ build on mainline?
A: It would probably be easiest to create a minimalist mainline
   kernel build in Jenkins. This would allow us to reuse most
   of the existing lbuild scripting. The build would be
   non-enforced at first. Testing would remain on distro
   kernels, since most people use those.

Q: Will you create a wiki project tracking page for upstreaming
   Lustre?
A: Yes

Q: Does anyone else have a similar model? Does this even work?
A: AMD GPU seems to have a similar approach, at least [1]. I'm
   looking to get more feedback of LSF. We should talk to other
   developers working in a model similar to this.

This is still a high level sketch, but I think this is a feasible
path to upstreaming Lustre. We need to define a clear roadmap
with tangible milestones to have a hope of upstreaming working.

But it's important that we don't disrupt developers established
workflows. We don't want to complicate contributing to Lustre
and we don't want to discourage people from contributing their
changes upstream.

Please give me any feedback or criticisms on this proposal. If we
think this is workable, I'm going to create a wiki project page for
this and attach it to the LSF/MM email.

[1] AMD GPU DKMS: https://github.com/geohot/amdgpu-dkms

--------------------------------------------------------------------------------

Lustre is a high-performance parallel filesystem used for HPC
and AI/ML compute clusters available under GPLv2. Lustre is
currently used by 65% of the Top-500 (9 of Top-10) systems in
HPC [7]. Outside of HPC, Lustre is used by many of the largest
AI/ML clusters in the world, and is commercially supported by
numerous vendors and cloud service providers [1].

After 21 years and an ill-fated stint in staging, Lustre is still
maintained as an out-of-tree module [6]. The previous upstreaming
effort suffered from a lack of developer focus and user adoption,
which eventually led to Lustre being removed from staging
altogether [2].

However, the work to improve Lustre has continued regardless. In
the intervening years, the code improvements that previously
prevented a return to mainline have been steadily progressing. At
least 25% of patches accepted for Lustre 2.16 were related to the
upstreaming effort [3]. And all of the remaining work is
in-flight [4][5]. Our eventual goal is to a get both the Lustre
client and server (on ext4) along with at least TCP/IP networking to
an acceptable quality before submitting to mainline. The remaining
network support would follow soon afterwards.

I propose to discuss:

- As we alter our development model to support upstream development,
  what is a sufficient demonstration of commitment that our model works? [8]
- Should the client and server be submitted together? Or split?
- Expectations for a new filesystem to be accepted to mainline
- How to manage inclusion of a large code base (the client alone is
  200kLoC) without increasing the burden on fs/net maintainers

Lustre has already received a plethora of feedback in the past.
While much of that has been addressed since - the kernel is a
moving target. Several filesystems have been merged (or removed)
since Lustre left staging. We're aiming to avoid the mistakes of
the past and hope to address as many concerns as possible before
submitting for inclusion.

Thanks!

Timothy Day (Amazon Web Services - AWS)
James Simmons (Oak Ridge National Labs - ORNL)

[1] Wikipedia: https://en.wikipedia.org/wiki/Lustre_(file_system)#Commercial_technical_support
[2] Kicked out of staging: https://lwn.net/Articles/756565/
[3] This is a heuristic, based on the combined commit counts of
    ORNL, Aeon, SuSe, and AWS - which have been primarily working
    on upstreaming issues: https://youtu.be/BE--ySVQb2M?si=YMHitJfcE4ASWQcE&t=960
[4] LUG24 Upstreaming Update: https://www.depts.ttu.edu/hpcc/events/LUG24/slides/Day1/LUG_2024_Talk_02-Native_Linux_client_status.pdf
[5] Lustre Jira Upstream Progress: TODO
[6] Out-of-tree codebase: https://git.whamcloud.com/?p=fs/lustre-release.git;a=tree
[7] I couldn't find a link to this? TODO
[8] Include a link to a project wiki: TODO

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming
  2025-01-21 17:02         ` Patrick Farrell
@ 2025-01-22  6:57           ` Andreas Dilger
  2025-01-22 17:33             ` Day, Timothy
  2025-01-22 20:48             ` NeilBrown
  0 siblings, 2 replies; 61+ messages in thread
From: Andreas Dilger @ 2025-01-22  6:57 UTC (permalink / raw)
  To: Patrick Farrell; +Cc: lustre-devel@lists.lustre.org

[-- Attachment #1.1: Type: text/plain, Size: 5331 bytes --]

IMHO, there would be objections to Lustre changes to ext4 to allow it to be used
like ldiskfs.

We cannot use the VFS interface as-is, since Lustre needs to have
compound journaled transactions that are atomically committed.  Also, there are
some operations (e.g. DNE namespace operations) which do not have VFS
equivalents, so they would require poking through the VFS, and in general the
VFS does a lot of things we *don't* want it to do for Lustre.

If there are objections to patching ext4 to allow osd-ldiskfs to access transactions
directly, then the alternative would be to copy it to ldiskfs and patch it as we do
today, but I suspect that would also be frowned upon.

Don't get me wrong, it's not that I *want* to maintain ldiskfs forever out of tree,
but pretty much every patch we try to upstream to ext4 is rejected for one
reason or another, so I've stopped holding my breath that this will move forward.

Running osd-zfs doesn't need any kernel/ext4 patches, but that is an even
larger can of worms, and will never fly in a million years.

I think upstreaming the client is a realistic goal, but I think tying this to the
upstreaming of the server with ldiskfs support will derail the whole project.

Cheers, Andreas

On Jan 21, 2025, at 10:02, Patrick Farrell <pfarrell@ddn.com> wrote:

I agree strongly here, and I think going upstream with both makes some things much easier.  It forces us to deal with ldiskfs but there's all of that shared code reorg, etc, which this can let us partially skip.  While there's probably some value in fully separating client and server code, it would be a fair bit of work and then the keeping in sync, etc...  All at once seems nicer to me.
________________________________
From: lustre-devel <lustre-devel-bounces@lists.lustre.org> on behalf of Day, Timothy <timday@amazon.com>
Sent: Sunday, January 19, 2025 9:57 PM
To: NeilBrown <neilb@suse.de>
Cc: lustre-devel@lists.lustre.org <lustre-devel@lists.lustre.org>
Subject: Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming

> On 1/18/25, 5:21 PM, "NeilBrown" <neilb@suse.de <mailto:neilb@suse.de>> wrote:
> > On Sun, 19 Jan 2025, Day, Timothy wrote:
> >
> > On the other hand, I wonder if we upstream the whole thing all at once. Beside
> > the code being a bit nicer, the client isn't really that much closer to being upstream
> > than the server is. And no one else can test the client without having a Lustre
> > server on-hand. So no-one can easily run xfstests or similar. And doing everything
> > all at once would preempt questions of client/server split or the server upstreaming
> > timeline. But upstreaming so much all at once is probably more unrealistic.
>
>
> The main difference I see between server and client in upstreaming terms
> is the storage backend. It would need to use un-patched ext4 - ideally
> using VFS interfaces though we might be able to negotiate with the ext4
> team to get some exports. I don't know much about the delta between
> ldiskfs and ext4 and understand it is much smaller than it once was, but
> it would need to be zero. I'm working towards getting the pdirop patch
> upstreamable. Andreas would know what else is needed better than I.

I've been working on a third storage backend [1]. It'll likely be done
well before we submit anything upstream. It's a just memory-only
target. That might be justification enough to keep the OSD APIs.

[1] https://review.whamcloud.com/c/fs/lustre-release/+/55594

> The other difference is that a lot of the "revise code to match upstream
> style" work has focused on client and ignored server-only code.
>
>
> It might be sensible to set the goal as "client and server" including
> only the ext4 backend and possibly only the socklnd network interface.
> It will be a big code drop either way. People aren't going to go over
> every line with a fine-tooth-comb. They will mostly look at whichever
> bit particularly interests them, and look at the process and community
> behind the code.
>
>
> Being able to build a pure upstream kernel, add a user-space tools
> package, and test would certainly be a plus. That would be something
> worth canvassing at LSF - is there any value in landing the client
> without the server?

Yeah, I'm leaning towards setting the goal as both client/server and
gathering opinions from LSF. The client and server are still pretty
intertwined. I think having the client go upstream and then basing
the server on top an in-tree client would make server development
noticeably more difficult. Thinking on it more - I don't think
upstreaming the server is more ambitious than the client. We
have more of a process problem than a code problem. And I don't
think the server is in particularly bad shape.

>
> NeilBrown
>

Tim Day

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org
_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org<mailto:lustre-devel@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

Cheers, Andreas
—
Andreas Dilger
Lustre Principal Architect
Whamcloud/DDN

[-- Attachment #1.2: Type: text/html, Size: 12624 bytes --]

[-- Attachment #2: Type: text/plain, Size: 165 bytes --]

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming
  2025-01-22  6:35 ` Day, Timothy
@ 2025-01-22  7:09   ` Andreas Dilger
  2025-01-22 11:12   ` Alexey Lyahkov
  2025-01-24 15:53   ` Day, Timothy
  2 siblings, 0 replies; 61+ messages in thread
From: Andreas Dilger @ 2025-01-22  7:09 UTC (permalink / raw)
  To: Day, Timothy; +Cc: lustre-devel@lists.lustre.org

[-- Attachment #1.1: Type: text/plain, Size: 9083 bytes --]

On Jan 21, 2025, at 23:35, Day, Timothy <timday@amazon.com> wrote:

I've created a second draft of the topic for LSF/MM. I tried
to include everyone's feedback. It's at the end of the email.

Before that, I wanted to elaborate on Neil's idea about updating
our development model to an upstream-focused model. For upstreaming
to work, the normal development flow has to generate patches to mainline
Linux - while still supporting the distro kernels that most people use
to run Lustre. I think we can get this point in stages. I've provided
a high-level overview in the next section. This won't be without
challenges - but the majority of the transition could happen without
interrupting feature work or normal development.

[I] Separate the kernel code, compatibility code, and userspace code

We should reorganize the Lustre tree to have a clear separation
of concerns:

fs/lustre/
net/lnet/
net/libcfs/
lustre_compat/
tests/
utils/

The functional components of libcfs/ would stay in that directory
and the compatibility components would live in lustre_compat/.
Centralizing the compatibility code makes it easier to maintain and
update and allows us to start removing the compatibility code from
the modules themselves. lustre_compat/ could still be compiled into
libcfs.ko, if we want to avoid creating even more modules.

I think this proposal is pretty reasonable.  If the directory renaming
is done in the Lustre repo (say in the 2.17 timeframe) without also
restructuring all of the files and code at the same time,  then it
_should_ be reasonable to backport patches to older maintenance
branches without having to rewrite them completely.

This would essentially form an "overlay" to the existing kernel tree,
and should allow it to be built as a standalone project until such a
time that it is accepted.

Conversely, if the consensus from LSF is that Lustre will never get
into the upstream kernel (which will probably be Christoph Hellwig's
preference regardless of what we change) then this reorg won't have
broken the whole tree or current development process, with the
exception of years of muscle-memory for typing the old pathnames.
Maybe symlinks could be used to ease the transition?

Cheers, Andreas

[II] Get fs/ and net/ to compile on a mainline kernel

Once the compatibility code is isolated, we must get fs/ and net/
to compile on a mainline kernel - without any configuration or
lustre_compat/ layer.

We would validate this by adding build validation to each patch
submitted to Gerrit. The kernel version would be pinned (similar
to how we pin ZFS version) and we'd periodically update it and fix
any new build failures.

Once this is achieved, we'll have a native Linux client/server
that can be run on older distros via a compatibility layer.

[III] Move fs/ and net/ to a separate kernel tree

Transition to maintaining fs/ and net/ as a series on patches
on top of a mainline kernel release. At this point, we'll generating
patches to mainline Linux while retaining the ability to support
older distro kernels via lustre_compat/. Similar to the previous
step, we periodically rebase our Lustre patch series - fixing
lustre_compat/ as needed.

This is the only step that requires a change the Lustre development
workflow - patches would have to be split and sent to two
different repos. We can delay this step until we have some
confidence that Lustre has a path to be accepted to mainline.

[IV] Submit the patch series for inclusion

Once we are comfortable with the above process, we can submit the
initial patches to add Lustre support to the kernel. Our normal
development flow will generate a batch of patches to be submitted
during each merge window. After the merge window, we can focus
on testing and making sure that our backport to older distro
kernels it still working.

FAQ:

Q: Who will actually run the Lustre code in mainline Linux?
A: Everyone. Releases for older distros will be a combination
   of the upstream Lustre combined with lustre_compat/ and
   whatever stuff the kernel won't allow (like GPUDirect).

Q: What does a Lustre release look like?
A: We can generate a tarball by combining an upstream Lustre
   release from mainline along with lustre_compat/ and the
   userspace stuff. Vendors and third-parties can base
   their versions of Lustre on those tarballs. Every time a
   new kernel releases - a new Lustre release tarball will
   be created. LTS releases can center around the LTS kernel
   releases.

Q: How will we validate that fs/ and net/ build on mainline?
A: It would probably be easiest to create a minimalist mainline
   kernel build in Jenkins. This would allow us to reuse most
   of the existing lbuild scripting. The build would be
   non-enforced at first. Testing would remain on distro
   kernels, since most people use those.

Q: Will you create a wiki project tracking page for upstreaming
   Lustre?
A: Yes

Q: Does anyone else have a similar model? Does this even work?
A: AMD GPU seems to have a similar approach, at least [1]. I'm
   looking to get more feedback of LSF. We should talk to other
   developers working in a model similar to this.

This is still a high level sketch, but I think this is a feasible
path to upstreaming Lustre. We need to define a clear roadmap
with tangible milestones to have a hope of upstreaming working.

But it's important that we don't disrupt developers established
workflows. We don't want to complicate contributing to Lustre
and we don't want to discourage people from contributing their
changes upstream.

Please give me any feedback or criticisms on this proposal. If we
think this is workable, I'm going to create a wiki project page for
this and attach it to the LSF/MM email.

[1] AMD GPU DKMS: https://github.com/geohot/amdgpu-dkms

--------------------------------------------------------------------------------

Lustre is a high-performance parallel filesystem used for HPC
and AI/ML compute clusters available under GPLv2. Lustre is
currently used by 65% of the Top-500 (9 of Top-10) systems in
HPC [7]. Outside of HPC, Lustre is used by many of the largest
AI/ML clusters in the world, and is commercially supported by
numerous vendors and cloud service providers [1].

After 21 years and an ill-fated stint in staging, Lustre is still
maintained as an out-of-tree module [6]. The previous upstreaming
effort suffered from a lack of developer focus and user adoption,
which eventually led to Lustre being removed from staging
altogether [2].

However, the work to improve Lustre has continued regardless. In
the intervening years, the code improvements that previously
prevented a return to mainline have been steadily progressing. At
least 25% of patches accepted for Lustre 2.16 were related to the
upstreaming effort [3]. And all of the remaining work is
in-flight [4][5]. Our eventual goal is to a get both the Lustre
client and server (on ext4) along with at least TCP/IP networking to
an acceptable quality before submitting to mainline. The remaining
network support would follow soon afterwards.

I propose to discuss:

- As we alter our development model to support upstream development,
  what is a sufficient demonstration of commitment that our model works? [8]
- Should the client and server be submitted together? Or split?
- Expectations for a new filesystem to be accepted to mainline
- How to manage inclusion of a large code base (the client alone is
  200kLoC) without increasing the burden on fs/net maintainers

Lustre has already received a plethora of feedback in the past.
While much of that has been addressed since - the kernel is a
moving target. Several filesystems have been merged (or removed)
since Lustre left staging. We're aiming to avoid the mistakes of
the past and hope to address as many concerns as possible before
submitting for inclusion.

Thanks!

Timothy Day (Amazon Web Services - AWS)
James Simmons (Oak Ridge National Labs - ORNL)

[1] Wikipedia: https://en.wikipedia.org/wiki/Lustre_(file_system)#Commercial_technical_support
[2] Kicked out of staging: https://lwn.net/Articles/756565/
[3] This is a heuristic, based on the combined commit counts of
    ORNL, Aeon, SuSe, and AWS - which have been primarily working
    on upstreaming issues: https://youtu.be/BE--ySVQb2M?si=YMHitJfcE4ASWQcE&t=960
[4] LUG24 Upstreaming Update: https://www.depts.ttu.edu/hpcc/events/LUG24/slides/Day1/LUG_2024_Talk_02-Native_Linux_client_status.pdf
[5] Lustre Jira Upstream Progress: TODO
[6] Out-of-tree codebase: https://git.whamcloud.com/?p=fs/lustre-release.git;a=tree
[7] I couldn't find a link to this? TODO
[8] Include a link to a project wiki: TODO

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

Cheers, Andreas
—
Andreas Dilger
Lustre Principal Architect
Whamcloud/DDN

[-- Attachment #1.2: Type: text/html, Size: 11525 bytes --]

[-- Attachment #2: Type: text/plain, Size: 165 bytes --]

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming
  2025-01-22  6:35 ` Day, Timothy
  2025-01-22  7:09   ` Andreas Dilger
@ 2025-01-22 11:12   ` Alexey Lyahkov
  2025-01-22 17:17     ` Day, Timothy
  2025-01-24 15:53   ` Day, Timothy
  2 siblings, 1 reply; 61+ messages in thread
From: Alexey Lyahkov @ 2025-01-22 11:12 UTC (permalink / raw)
  To: Day, Timothy; +Cc: lustre-devel@lists.lustre.org

Timothy,

> 22 янв. 2025 г., в 09:35, Day, Timothy <timday@amazon.com> написал(а):
> 
> I've created a second draft of the topic for LSF/MM. I tried
> to include everyone's feedback. It's at the end of the email.
> 
> Before that, I wanted to elaborate on Neil's idea about updating
> our development model to an upstream-focused model. For upstreaming
> to work, the normal development flow has to generate patches to mainline
> Linux - while still supporting the distro kernels that most people use
> to run Lustre. I think we can get this point in stages. I've provided
> a high-level overview in the next section. This won't be without
> challenges - but the majority of the transition could happen without
> interrupting feature work or normal development.

Can you explain how Lustre platform fragmentation will avoid ?

I posted example early,
Distro have locked a Lustre version in release time. But Lustre server have a limited compatibility - in most cases +/- 1…2 releases guaratee to be connected. So stale and aged client will live in the distribution kernel. And it will don’t work for modern servers.
it’s very easy  Once distribution live time ~8y. So clients will be needs to drop in kernel lustre client support and install a lustre client from an external sources. Which have no differences with current state. 
Next step is sort of distributions which have a different lustre versions which not compatible each to other.
Both these increase a support cost - once large number versions needs supported, so development will drops and all time will spent to support.

It this is not enough - lets one more. Kernel API isn’t stable enough - so large number resources will be needs spent to solve each kernel change in lustre. Currently, it’s in the background and don’t interrupt primary work for supporting and development a new Lustre features.

So that is problems for Lustre world - what is benefits?

Alex
_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming
  2025-01-22 11:12   ` Alexey Lyahkov
@ 2025-01-22 17:17     ` Day, Timothy
  2025-01-22 17:48       ` Alexey Lyahkov
  0 siblings, 1 reply; 61+ messages in thread
From: Day, Timothy @ 2025-01-22 17:17 UTC (permalink / raw)
  To: Alexey Lyahkov; +Cc: lustre-devel@lists.lustre.org

> On 1/22/25, 6:14 AM, "Alexey Lyahkov" <alexey.lyashkov@gmail.com <mailto:alexey.lyashkov@gmail.com>> wrote:
>
> Timothy,
>
> > 22 янв. 2025 г., в 09:35, Day, Timothy <timday@amazon.com <mailto:timday@amazon.com>> написал(а):
> >
> > I've created a second draft of the topic for LSF/MM. I tried
> > to include everyone's feedback. It's at the end of the email.
> >
> > Before that, I wanted to elaborate on Neil's idea about updating
> > our development model to an upstream-focused model. For upstreaming
> > to work, the normal development flow has to generate patches to mainline
> > Linux - while still supporting the distro kernels that most people use
> > to run Lustre. I think we can get this point in stages. I've provided
> > a high-level overview in the next section. This won't be without
> > challenges - but the majority of the transition could happen without
> > interrupting feature work or normal development.
> > 
>
> Can you explain how Lustre platform fragmentation will avoid ?
>
>
> I posted example early,
> Distro have locked a Lustre version in release time. But Lustre server have a limited compatibility - in most cases +/- 1…2 releases guaratee to be connected. So stale and aged client will live in the distribution kernel. And it will don’t work for modern servers.
> it’s very easy Once distribution live time ~8y. So clients will be needs to drop in kernel lustre client support and install a lustre client from an external sources. Which have no differences with current state.
> Next step is sort of distributions which have a different lustre versions which not compatible each to other.
> Both these increase a support cost - once large number versions needs supported, so development will drops and all time will spent to support.

I think that's a reasonable concern. I spend a lot of time doing customer
support for Lustre; I definitely don't want to make that part of my job any
harder than it has to be.

I'm my personal experience, I've seen 2.10 and 2.15 interoperate well together.
That covers a gap of around ~6 years at least. If someone stuck with RHEL7, the
first client they could use is 2.7.0 and the last client they could use is 2.16.0 [1].
So if a customer didn't update either their distro or filesystem, they could use an
up-to-date Lustre version for around 10 years covering 9 versions. So I think these
large version gaps are possible today.

There is an issue if distros don't want to update their clients. That's why we'll
still support running latest Lustre on older distros. Specifically, it'll be the Lustre
code from a mainline kernel combined with our lustre_compat/ compatibility
code. So normal Lustre releases will be derived directly from the in-tree kernel
code. This provides a path for vendors to deploy bug fixes, custom features, and
allows users to optionally run the latest and greatest Lustre code.

[1] Lustre changelog: https://git.whamcloud.com/?p=fs/lustre-release.git;a=blob_plain;f=lustre/ChangeLog;hb=HEAD

> It this is not enough - lets one more. Kernel API isn’t stable enough - so large number resources will be needs spent to solve each kernel change in lustre. Currently, it’s in the background and don’t interrupt primary work for supporting and development a new Lustre features.
>
> So that is problems for Lustre world - what is benefits?

By upstreaming Lustre, we'll benefit from developers updating the kernel
API "for free". We Lustre was in staging/, there wasn't as much obligation
to keep Lustre in a working state. But if we get Lustre merged properly,
developer will not be able to merge changes that break Lustre. So we'll
get support for the latest and greatest kernels with less effort. That's one
of the main benefits of this effort.

We also get benefit from more say over the future of the kernel. A lot
of difficulty with updating Lustre for new kernels comes when upstream
kernel developers lock down symbols or features to in-tree modules. This
could get even worse in the future, with stuff like symbol namespaces get
more use [2].

Even if most users use the out-of-tree backported-from-mainline-Linux
Lustre release, I think we'll still be in a stronger position after
upstreaming.

[2] https://lwn.net/Articles/760045/

>
> Alex
>

Tim Day

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming
  2025-01-22  6:57           ` Andreas Dilger
@ 2025-01-22 17:33             ` Day, Timothy
  2025-01-22 20:48             ` NeilBrown
  1 sibling, 0 replies; 61+ messages in thread
From: Day, Timothy @ 2025-01-22 17:33 UTC (permalink / raw)
  To: Andreas Dilger, Patrick Farrell; +Cc: lustre-devel@lists.lustre.org

[-- Attachment #1.1: Type: text/plain, Size: 6581 bytes --]

Lustre itself going upstream might be strong enough justification for
ext4 to accept the needed ldiskfs patches. I agree that we shouldn’t
block upstreaming the client on upstreaming the server. But I think
we should advocate upstreaming both client and server together,
and then see what feedback we get. We should at least document
the objections. I don’t think we have to commit to one path or another
right now.

Regarding ZFS, I think the OSD will eventually have to live in the
normal openZFS repo. And it will have to live with whatever
interface ldiskfs/ext4 gets. But I don’t think we have to prioritize
that work until we have more confidence that Lustre is on a path
to upstream.

Tim Day

From: Andreas Dilger <adilger@ddn.com>
Date: Wednesday, January 22, 2025 at 1:58 AM
To: Patrick Farrell <pfarrell@ddn.com>
Cc: "Day, Timothy" <timday@amazon.com>, NeilBrown <neilb@suse.de>, "lustre-devel@lists.lustre.org" <lustre-devel@lists.lustre.org>
Subject: RE: [EXTERNAL] [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

IMHO, there would be objections to Lustre changes to ext4 to allow it to be used
like ldiskfs.

We cannot use the VFS interface as-is, since Lustre needs to have
compound journaled transactions that are atomically committed.  Also, there are
some operations (e.g. DNE namespace operations) which do not have VFS
equivalents, so they would require poking through the VFS, and in general the
VFS does a lot of things we *don't* want it to do for Lustre.

If there are objections to patching ext4 to allow osd-ldiskfs to access transactions
directly, then the alternative would be to copy it to ldiskfs and patch it as we do
today, but I suspect that would also be frowned upon.

Don't get me wrong, it's not that I *want* to maintain ldiskfs forever out of tree,
but pretty much every patch we try to upstream to ext4 is rejected for one
reason or another, so I've stopped holding my breath that this will move forward.

Running osd-zfs doesn't need any kernel/ext4 patches, but that is an even
larger can of worms, and will never fly in a million years.

I think upstreaming the client is a realistic goal, but I think tying this to the
upstreaming of the server with ldiskfs support will derail the whole project.

Cheers, Andreas

On Jan 21, 2025, at 10:02, Patrick Farrell <pfarrell@ddn.com> wrote:

I agree strongly here, and I think going upstream with both makes some things much easier.  It forces us to deal with ldiskfs but there's all of that shared code reorg, etc, which this can let us partially skip.  While there's probably some value in fully separating client and server code, it would be a fair bit of work and then the keeping in sync, etc...  All at once seems nicer to me.
________________________________
From: lustre-devel <lustre-devel-bounces@lists.lustre.org> on behalf of Day, Timothy <timday@amazon.com>
Sent: Sunday, January 19, 2025 9:57 PM
To: NeilBrown <neilb@suse.de>
Cc: lustre-devel@lists.lustre.org <lustre-devel@lists.lustre.org>
Subject: Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming

> On 1/18/25, 5:21 PM, "NeilBrown" <neilb@suse.de <mailto:neilb@suse.de>> wrote:
> > On Sun, 19 Jan 2025, Day, Timothy wrote:
> >
> > On the other hand, I wonder if we upstream the whole thing all at once. Beside
> > the code being a bit nicer, the client isn't really that much closer to being upstream
> > than the server is. And no one else can test the client without having a Lustre
> > server on-hand. So no-one can easily run xfstests or similar. And doing everything
> > all at once would preempt questions of client/server split or the server upstreaming
> > timeline. But upstreaming so much all at once is probably more unrealistic.
>
>
> The main difference I see between server and client in upstreaming terms
> is the storage backend. It would need to use un-patched ext4 - ideally
> using VFS interfaces though we might be able to negotiate with the ext4
> team to get some exports. I don't know much about the delta between
> ldiskfs and ext4 and understand it is much smaller than it once was, but
> it would need to be zero. I'm working towards getting the pdirop patch
> upstreamable. Andreas would know what else is needed better than I.

I've been working on a third storage backend [1]. It'll likely be done
well before we submit anything upstream. It's a just memory-only
target. That might be justification enough to keep the OSD APIs.

[1] https://review.whamcloud.com/c/fs/lustre-release/+/55594

> The other difference is that a lot of the "revise code to match upstream
> style" work has focused on client and ignored server-only code.
>
>
> It might be sensible to set the goal as "client and server" including
> only the ext4 backend and possibly only the socklnd network interface.
> It will be a big code drop either way. People aren't going to go over
> every line with a fine-tooth-comb. They will mostly look at whichever
> bit particularly interests them, and look at the process and community
> behind the code.
>
>
> Being able to build a pure upstream kernel, add a user-space tools
> package, and test would certainly be a plus. That would be something
> worth canvassing at LSF - is there any value in landing the client
> without the server?

Yeah, I'm leaning towards setting the goal as both client/server and
gathering opinions from LSF. The client and server are still pretty
intertwined. I think having the client go upstream and then basing
the server on top an in-tree client would make server development
noticeably more difficult. Thinking on it more - I don't think
upstreaming the server is more ambitious than the client. We
have more of a process problem than a code problem. And I don't
think the server is in particularly bad shape.

>
> NeilBrown
>

Tim Day

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org
_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org<mailto:lustre-devel@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

Cheers, Andreas
—
Andreas Dilger
Lustre Principal Architect
Whamcloud/DDN

[-- Attachment #1.2: Type: text/html, Size: 16989 bytes --]

[-- Attachment #2: Type: text/plain, Size: 165 bytes --]

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming
  2025-01-22 17:17     ` Day, Timothy
@ 2025-01-22 17:48       ` Alexey Lyahkov
  2025-01-24 17:06         ` Day, Timothy
  0 siblings, 1 reply; 61+ messages in thread
From: Alexey Lyahkov @ 2025-01-22 17:48 UTC (permalink / raw)
  To: Day, Timothy; +Cc: lustre-devel@lists.lustre.org



> On 22 Jan 2025, at 20:17, Day, Timothy <timday@amazon.com> wrote:
> 
>> On 1/22/25, 6:14 AM, "Alexey Lyahkov" <alexey.lyashkov@gmail.com <mailto:alexey.lyashkov@gmail.com>> wrote:
>> 
>> Timothy,
>> 
>>> 22 янв. 2025 г., в 09:35, Day, Timothy <timday@amazon.com <mailto:timday@amazon.com>> написал(а):
>>> 
>>> I've created a second draft of the topic for LSF/MM. I tried
>>> to include everyone's feedback. It's at the end of the email.
>>> 
>>> Before that, I wanted to elaborate on Neil's idea about updating
>>> our development model to an upstream-focused model. For upstreaming
>>> to work, the normal development flow has to generate patches to mainline
>>> Linux - while still supporting the distro kernels that most people use
>>> to run Lustre. I think we can get this point in stages. I've provided
>>> a high-level overview in the next section. This won't be without
>>> challenges - but the majority of the transition could happen without
>>> interrupting feature work or normal development.
>>> 
>> 
>> Can you explain how Lustre platform fragmentation will avoid ?
>> 
>> 
>> I posted example early,
>> Distro have locked a Lustre version in release time. But Lustre server have a limited compatibility - in most cases +/- 1…2 releases guaratee to be connected. So stale and aged client will live in the distribution kernel. And it will don’t work for modern servers.
>> it’s very easy Once distribution live time ~8y. So clients will be needs to drop in kernel lustre client support and install a lustre client from an external sources. Which have no differences with current state.
>> Next step is sort of distributions which have a different lustre versions which not compatible each to other.
>> Both these increase a support cost - once large number versions needs supported, so development will drops and all time will spent to support.
> 
> I think that's a reasonable concern. I spend a lot of time doing customer
> support for Lustre; I definitely don't want to make that part of my job any
> harder than it has to be.
> 
> I'm my personal experience, I've seen 2.10 and 2.15 interoperate well together.
> That covers a gap of around ~6 years at least. If someone stuck with RHEL7, the
> first client they could use is 2.7.0 and the last client they could use is 2.16.0 [1].
> So if a customer didn't update either their distro or filesystem, they could use an
> up-to-date Lustre version for around 10 years covering 9 versions. So I think these
> large version gaps are possible today.
> 
Customer expect to update an server side part, but it not always true for client side part.
They expect to stick for RHEL7 version until EOL, because old HW can don’t support with new version.
(Look to the RHEL HW support reduction between releases. RHE7->RHEL8 many raid cards had dropped from support).

> There is an issue if distros don't want to update their clients.
It is not “if don’t want update”, Ubuntu don’t update own lustre code in past. 
I don’t expect it will be changed. Because distro owner will needs to hire more developers to have extra support.
But have no money from it.


> That's why we'll
> still support running latest Lustre on older distros. Specifically, it'll be the Lustre
> code from a mainline kernel combined with our lustre_compat/ compatibility
> code. So normal Lustre releases will be derived directly from the in-tree kernel
> code. This provides a path for vendors to deploy bug fixes, custom features, and
> allows users to optionally run the latest and greatest Lustre code.
And OOPS. Both codes (in-kernel and out-of-tree) have a same sort of defines in config.h which have conflicts with building for out-of-free Lustre.
Some examples for MOFED hacks to solve same problem  you can see in the o2iblnd:
>>>
#if defined(EXTERNAL_OFED_BUILD) && !defined(HAVE_OFED_IB_DMA_MAP_SG_SANE)
#undef CONFIG_INFINIBAND_VIRT_DMA
#endif
>>>
As I remember this problem broke an ability to build a lustre as out-of-tree kernel on the ubuntu 18.06 with lustre in staging/.


> 
> [1] Lustre changelog: https://git.whamcloud.com/?p=fs/lustre-release.git;a=blob_plain;f=lustre/ChangeLog;hb=HEAD
> 
>> It this is not enough - lets one more. Kernel API isn’t stable enough - so large number resources will be needs spent to solve each kernel change in lustre. Currently, it’s in the background and don’t interrupt primary work for supporting and development a new Lustre features.
>> 
>> So that is problems for Lustre world - what is benefits?
> 
> By upstreaming Lustre, we'll benefit from developers updating the kernel
> API "for free".
It’s not a “for free” - did you really think any of kernel developers have a cluster to run lustre client to test a changes?
I think not, so testing will be “just compile with proposed/default config”.
Once it will be lack of proper testing (don’t remember it’s full run for lustre test suite ~12-24h) - lustre developers needs review each change in the lustre code. 
And it needs to back port all these changes in the out-of-free version. Once lustre part needs changes also.
Best example is ‘folio’ - this need changes for both sides.


> We Lustre was in staging/, there wasn't as much obligation
> to keep Lustre in a working state. But if we get Lustre merged properly,
> developer will not be able to merge changes that break Lustre. So we'll
> get support for the latest and greatest kernels with less effort. That's one
> of the main benefits of this effort.

> 
> We also get benefit from more say over the future of the kernel. A lot
> of difficulty with updating Lustre for new kernels comes when upstream
> kernel developers lock down symbols or features to in-tree modules. This
> could get even worse in the future, with stuff like symbol namespaces get
> more use [2].
> 
> Even if most users use the out-of-tree backported-from-mainline-Linux
> Lustre release, I think we'll still be in a stronger position after
> upstreaming.
> 
> [2] https://lwn.net/Articles/760045/
> 
>> 
>> Alex
>> 
> 
> Tim Day
> 

PS. Lustre able to run a server with very very light modified ext4 code. Mostly some exports / callbacks from core.

Alex
> 

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming
  2025-01-22  6:57           ` Andreas Dilger
  2025-01-22 17:33             ` Day, Timothy
@ 2025-01-22 20:48             ` NeilBrown
  1 sibling, 0 replies; 61+ messages in thread
From: NeilBrown @ 2025-01-22 20:48 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: lustre-devel@lists.lustre.org

On Wed, 22 Jan 2025, Andreas Dilger wrote:
> 
> Don't get me wrong, it's not that I *want* to maintain ldiskfs forever
> out of tree,
> but pretty much every patch we try to upstream to ext4 is rejected for
> one
> reason or another, so I've stopped holding my breath that this will
> move forward.

This might be a useful subtopic to raise at LSF if an invite is
obtained. A succinct list of the needs and approaches would be useful to
whoever tries to lead the discussion.
There are 50 EXPORT_SYMBOLs added to ext4.  Maybe grouping and
explaining those might be a good place to start.
I don't suppose there is a document already somewhere that explains the
extension in ldiskfs?

Thanks,
NeilBrown
_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming
  2025-01-19 16:12           ` Alexey Lyahkov
@ 2025-01-22 20:54             ` NeilBrown
  2025-01-22 21:44               ` Oleg Drokin
  2025-01-23  4:51               ` Alexey Lyahkov
  0 siblings, 2 replies; 61+ messages in thread
From: NeilBrown @ 2025-01-22 20:54 UTC (permalink / raw)
  To: Alexey Lyahkov; +Cc: lustre-devel@lists.lustre.org

On Mon, 20 Jan 2025, Alexey Lyahkov wrote:
> 
> > On 19 Jan 2025, at 11:03, NeilBrown <neilb@suse.de> wrote:
> > 
> > On Sun, 19 Jan 2025, Alexey Lyahkov wrote:
> >> Neil,
> >>> 
> >>> 
> >>>> 
> >>>> It does not hep that there are what 3? 4? trees, not "dual-tree" by any
> >>>> stretch of imagination.
> >>>> 
> >>>> There's DDN/whamcloud (that's really two trees), there's HPE, LLNL
> >>>> keeps their fork still I think (thought it's mostly backports?). There
> >>>> are likely others I am less exposed to.
> >>> 
> >>> "dual-tree" maybe isn't the best way of describing what was wrong with
> >>> the previous approach.  "upstream-first" is one way of describing how it
> >>> should be run, though that needs to be in understood correctly.
> >>> 
> >>> Patches should always flow upstream first, then flow downstream into
> >>> distro.  So I write a patch in my own devel tree.  I post it or submit a
> >>> pull request and eventually it is accepted into the maintainers
> >>> "testing" tree (upsream from me).  There it gets more testing and moves
> >>> to the maintainers "next" tree from which it is pulled into linux-next
> >>> for integration testing.  Then it goes upstream to Linus (possibly
> >>> through an intermediary).  From Linus it goes to -stable and to various
> >>> distros etc.  Individual patches are selected for further backporting to
> >>> all sorts of different LTS tree.
> >>> 
> >>> Occasionally there are short-cuts.  I might submit a patch from my tree
> >>> to a SUSE kernel before it is accepted upstream, or maybe even before it
> >>> is sent if it is urgent.  But these are not the norm.
> >>> 
> >>> But you know all this I expect.  It isn't about the total number of
> >>> trees. It is about the flow of patches which must all flow through Linus.
> >>> And developers must develop against current linus, or something very
> >>> close to that.  Developing against an older kernel is simply making more
> >>> work for yourself.
> >> 
> >> This will don’t work. Let explain situation in past.
> > 
> > No.  I'm not at all interested in explanations of why it won't work.
> > 
> > I'm only interested in suggestions of how to make it work, and offers of
> > help.
> > 
> If you have a good way how to solve such situation, which had crazy brain in past.
> Please share a way how to avoid lustre source code fragmentation because of freeze a code at the different stage in different distributions.
> Ubuntu might have a modern lustre with 6.5 kernel, but Redhat had frozen a lustre version in 3 releases in past and these clients not compatible each other.
> And not compatible with installed server.
> So question - who will do such support ? Did you have an ideas how to solve this problem?
> 

sorry - I didn't mean to go quite on you - I've been busy :-)

It's not entirely clear to me what the problem is.
You talk about clients not being compatible with each other or with the
server, but Andreas has said that there is good compatibility between
different versions so I wonder if that is really (still) an issue.

Keeping different kernels up to date with new updates is something that
the linux-stable team does all the time.  We do it at SUSE to.  It isn't
that hard.
You identify which patches *need* to be backported (ideally when the
patch is created but that isn't always easy) and you use tools to help
you backport them.

Certainly there is effort involved, but maintaining a package that works
on a large set of kernels also involves effort.  It isn't clear to me
that it is *more* effort, just *different* effort.

NeilBrown
_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming
  2025-01-22 20:54             ` NeilBrown
@ 2025-01-22 21:44               ` Oleg Drokin
  2025-01-23  4:51               ` Alexey Lyahkov
  1 sibling, 0 replies; 61+ messages in thread
From: Oleg Drokin @ 2025-01-22 21:44 UTC (permalink / raw)
  To: neilb@suse.de, alexey.lyashkov@gmail.com; +Cc: lustre-devel@lists.lustre.org

On Thu, 2025-01-23 at 07:54 +1100, NeilBrown wrote:

> It's not entirely clear to me what the problem is.
> You talk about clients not being compatible with each other or with
> the
> server, but Andreas has said that there is good compatibility between
> different versions so I wonder if that is really (still) an issue.

Problems are multifaceted.
One is that yes we have all sorts of compatibility in the protocol, but
as you pointed out elsewhere, there's no formal protocol definition, so
it's just "whatever the code does" which does differ from version to
version at times and we do have occasional interop issues. We tend to
catch those in testing (sometimes late, and sometimes patch is client
side too), but the obvious limitation here is we only see it where we
tests and the official guarantee is a pretty narrow window, and we
cannot have an unlimited test matrix explosion to test against every
mainline kernel going back 7 or whatever years.

Then there are all the features people want and stuff, and having an
outdated in tree client interferes with an up to date out of tree
client building. And creating patches to random old kernels to bring
them up to date is not very exciting, and of course there's the extra
fun of then getting distro people to even accept those patches.

> Keeping different kernels up to date with new updates is something
> that
> the linux-stable team does all the time.  We do it at SUSE to.  It
> isn't
> that hard.
> You identify which patches *need* to be backported (ideally when the
> patch is created but that isn't always easy) and you use tools to
> help
> you backport them.

This probably makes backporting features not very convenient (and
distro people would push back with "oh no, stability/bug fixes only
please!"
(there are benefits too of course, once you manage to push something
through to the distro people, people are often much better at following
the distro kernel updates)

> Certainly there is effort involved, but maintaining a package that
> works
> on a large set of kernels also involves effort.  It isn't clear to me
> that it is *more* effort, just *different* effort.

yes.
Also "works" is such a nebulous word in this context as I just got
rhel8.10 and rhel9.5 with all sorts of extra debug not normally enabled
by distro people into my test rigs and fireworks ensued (different ones
for those different versions).
_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming
  2025-01-22 20:54             ` NeilBrown
  2025-01-22 21:44               ` Oleg Drokin
@ 2025-01-23  4:51               ` Alexey Lyahkov
  2025-01-24 23:24                 ` NeilBrown
  1 sibling, 1 reply; 61+ messages in thread
From: Alexey Lyahkov @ 2025-01-23  4:51 UTC (permalink / raw)
  To: NeilBrown; +Cc: lustre-devel@lists.lustre.org


[-- Attachment #1.1: Type: text/plain, Size: 4322 bytes --]



> On 22 Jan 2025, at 23:54, NeilBrown <neilb@suse.de> wrote:
> 
> On Mon, 20 Jan 2025, Alexey Lyahkov wrote:
>> 
>>> On 19 Jan 2025, at 11:03, NeilBrown <neilb@suse.de> wrote:
>>> 
>>> On Sun, 19 Jan 2025, Alexey Lyahkov wrote:
>>>> Neil,
>>>>> 
>>>>> 
>>>>>> 
>>>>>> It does not hep that there are what 3? 4? trees, not "dual-tree" by any
>>>>>> stretch of imagination.
>>>>>> 
>>>>>> There's DDN/whamcloud (that's really two trees), there's HPE, LLNL
>>>>>> keeps their fork still I think (thought it's mostly backports?). There
>>>>>> are likely others I am less exposed to.
>>>>> 
>>>>> "dual-tree" maybe isn't the best way of describing what was wrong with
>>>>> the previous approach.  "upstream-first" is one way of describing how it
>>>>> should be run, though that needs to be in understood correctly.
>>>>> 
>>>>> Patches should always flow upstream first, then flow downstream into
>>>>> distro.  So I write a patch in my own devel tree.  I post it or submit a
>>>>> pull request and eventually it is accepted into the maintainers
>>>>> "testing" tree (upsream from me).  There it gets more testing and moves
>>>>> to the maintainers "next" tree from which it is pulled into linux-next
>>>>> for integration testing.  Then it goes upstream to Linus (possibly
>>>>> through an intermediary).  From Linus it goes to -stable and to various
>>>>> distros etc.  Individual patches are selected for further backporting to
>>>>> all sorts of different LTS tree.
>>>>> 
>>>>> Occasionally there are short-cuts.  I might submit a patch from my tree
>>>>> to a SUSE kernel before it is accepted upstream, or maybe even before it
>>>>> is sent if it is urgent.  But these are not the norm.
>>>>> 
>>>>> But you know all this I expect.  It isn't about the total number of
>>>>> trees. It is about the flow of patches which must all flow through Linus.
>>>>> And developers must develop against current linus, or something very
>>>>> close to that.  Developing against an older kernel is simply making more
>>>>> work for yourself.
>>>> 
>>>> This will don’t work. Let explain situation in past.
>>> 
>>> No.  I'm not at all interested in explanations of why it won't work.
>>> 
>>> I'm only interested in suggestions of how to make it work, and offers of
>>> help.
>>> 
>> If you have a good way how to solve such situation, which had crazy brain in past.
>> Please share a way how to avoid lustre source code fragmentation because of freeze a code at the different stage in different distributions.
>> Ubuntu might have a modern lustre with 6.5 kernel, but Redhat had frozen a lustre version in 3 releases in past and these clients not compatible each other.
>> And not compatible with installed server.
>> So question - who will do such support ? Did you have an ideas how to solve this problem?
>> 
> 
> sorry - I didn't mean to go quite on you - I've been busy :-)
> 
> It's not entirely clear to me what the problem is.
> You talk about clients not being compatible with each other or with the
> server, but Andreas has said that there is good compatibility between
> different versions so I wonder if that is really (still) an issue.

Lustre have good compatibility, but not at all cases. It have tested with +/-1 release only.
I remember a several cases then it caused a probles with clients more than one release old.
As about Andreas, Andreas don’t work with client support for long time so don’t know about all problems in this area.

> 
> Keeping different kernels up to date with new updates is something that
> the linux-stable team does all the time.  We do it at SUSE to.  It isn't
> that hard.
> You identify which patches *need* to be backported (ideally when the
> patch is created but that isn't always easy) and you use tools to help
> you backport them.
So Lustre developers needs control all stable kernels and think which patch needs back ported and send it to Distro owner 
And for each LTS kernels on the kernel.org <http://kernel.org/>.. I think it increase a work dramatically.

> 
> Certainly there is effort involved, but maintaining a package that works
> on a large set of kernels also involves effort.  It isn't clear to me
> that it is *more* effort, just *different* effort.
> 
> NeilBrown

Alex

[-- Attachment #1.2: Type: text/html, Size: 5000 bytes --]

[-- Attachment #2: Type: text/plain, Size: 165 bytes --]

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming
  2025-01-20  4:38         ` Day, Timothy
  2025-01-20  5:37           ` Oleg Drokin
@ 2025-01-23  9:00           ` Alexey Lyahkov
  1 sibling, 0 replies; 61+ messages in thread
From: Alexey Lyahkov @ 2025-01-23  9:00 UTC (permalink / raw)
  To: Day, Timothy; +Cc: lustre-devel@lists.lustre.org



> 20 янв. 2025 г., в 07:38, Day, Timothy <timday@amazon.com> написал(а):
> 
> 
> 
> > On 1/19/25, 3:46 PM, "Oleg Drokin" <green@whamcloud.com <mailto:green@whamcloud.com>> wrote:
>>> On Sat, 2025-01-18 at 21:46 +0000, Day, Timothy wrote:
>>> 
>>> 
>>>> On 1/17/25, 10:17 PM, "Oleg Drokin"
>>>> <green@whamcloud.com <mailto:green@whamcloud.com> <mailto:green@whamcloud.com <mailto:green@whamcloud.com>>> wrote:
>>>>> On Sat, 2025-01-18 at 11:45 +1100, NeilBrown wrote:
>>>>> We need to demonstrate a process for, and commitment to, moving
>>>>> away
>>>>> from the dual-tree model. We need patches to those parts of
>>>>> Lustre
>>>>> that are upstream to land in upstream first (mostly).
>>>> 
>>>> 
>>>> I think this is not very realistic.
>>>> Large chunk (100%?) of users do not run not only the latest kernel
>>>> release, they don't run the latest LTS either.
>>>> 
>>>> 
>>>> When we were in staging last this manifested in random patches
>>>> being
>>>> landed and breaking the client completely and nobody noticing for
>>>> months.
>>>> 
>>>> 
>>>> Of course some automatic infrastructure could be built up to make
>>>> it
>>>> somewhat better, but it does not remove the problem of "nobody
>>>> would
>>>> run this mainline tree", I am afraid.
>>> 
>>> I think there's a decent chunk of users on newer kernels. Ubuntu
>>> 22/24 is
>>> on (a bit past latest) LTS 6.8 kernel [1], AL2023 is on previous LTS
>>> 6.1 [2], and
>>> working on upcoming LTS 6.12 [3].
>> 
>> 
>> Well, I mostly mean in context of Lustre client use and sure there's
>> some 6.8 LTS in use on those ubuntu clients, though I cannot assess the
>> real numbers, majority of reports I see are still on 5.x even on
>> Ubuntu.
> 
> Yeah, I'm not sure of the real numbers. It's just my personal experience
> that newer kernels are getting a lot of traction.


https://www.eofs.eu/wp-content/uploads/2024/02/1.1-community_release_update.pdf

https://wiki.opensfs.org/images/1/1d/OpenSFS_Survey_Results_March_2024.pdf

And similar presentations from LAD/LUG.


Alex
_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming
  2025-01-22  6:35 ` Day, Timothy
  2025-01-22  7:09   ` Andreas Dilger
  2025-01-22 11:12   ` Alexey Lyahkov
@ 2025-01-24 15:53   ` Day, Timothy
  2 siblings, 0 replies; 61+ messages in thread
From: Day, Timothy @ 2025-01-24 15:53 UTC (permalink / raw)
  To: lustre-devel@lists.lustre.org

I've created a wiki version of the outline below: https://wiki.lustre.org/Upstream_contributing. I'm
trying to consolidate all material related to the upstreaming effort on that page. If you know of anything,
feel free to add it to that page - or respond to this thread and I can add it.

Tim Day

On 1/22/25, 1:35 AM, "Day, Timothy" <timday@amazon.com <mailto:timday@amazon.com>> wrote:

I've created a second draft of the topic for LSF/MM. I tried
to include everyone's feedback. It's at the end of the email.

Before that, I wanted to elaborate on Neil's idea about updating
our development model to an upstream-focused model. For upstreaming
to work, the normal development flow has to generate patches to mainline
Linux - while still supporting the distro kernels that most people use
to run Lustre. I think we can get this point in stages. I've provided
a high-level overview in the next section. This won't be without
challenges - but the majority of the transition could happen without
interrupting feature work or normal development.

[I] Separate the kernel code, compatibility code, and userspace code

We should reorganize the Lustre tree to have a clear separation
of concerns:

fs/lustre/
net/lnet/
net/libcfs/
lustre_compat/
tests/
utils/

The functional components of libcfs/ would stay in that directory
and the compatibility components would live in lustre_compat/.
Centralizing the compatibility code makes it easier to maintain and
update and allows us to start removing the compatibility code from
the modules themselves. lustre_compat/ could still be compiled into
libcfs.ko, if we want to avoid creating even more modules.

[II] Get fs/ and net/ to compile on a mainline kernel

Once the compatibility code is isolated, we must get fs/ and net/
to compile on a mainline kernel - without any configuration or
lustre_compat/ layer.

We would validate this by adding build validation to each patch
submitted to Gerrit. The kernel version would be pinned (similar
to how we pin ZFS version) and we'd periodically update it and fix
any new build failures.

Once this is achieved, we'll have a native Linux client/server
that can be run on older distros via a compatibility layer.

[III] Move fs/ and net/ to a separate kernel tree

Transition to maintaining fs/ and net/ as a series on patches
on top of a mainline kernel release. At this point, we'll generating
patches to mainline Linux while retaining the ability to support
older distro kernels via lustre_compat/. Similar to the previous
step, we periodically rebase our Lustre patch series - fixing
lustre_compat/ as needed.

This is the only step that requires a change the Lustre development
workflow - patches would have to be split and sent to two
different repos. We can delay this step until we have some
confidence that Lustre has a path to be accepted to mainline.

[IV] Submit the patch series for inclusion

Once we are comfortable with the above process, we can submit the
initial patches to add Lustre support to the kernel. Our normal
development flow will generate a batch of patches to be submitted
during each merge window. After the merge window, we can focus
on testing and making sure that our backport to older distro
kernels it still working.

FAQ:

Q: Who will actually run the Lustre code in mainline Linux?
A: Everyone. Releases for older distros will be a combination
of the upstream Lustre combined with lustre_compat/ and
whatever stuff the kernel won't allow (like GPUDirect).

Q: What does a Lustre release look like?
A: We can generate a tarball by combining an upstream Lustre
release from mainline along with lustre_compat/ and the
userspace stuff. Vendors and third-parties can base
their versions of Lustre on those tarballs. Every time a
new kernel releases - a new Lustre release tarball will
be created. LTS releases can center around the LTS kernel
releases.

Q: How will we validate that fs/ and net/ build on mainline?
A: It would probably be easiest to create a minimalist mainline
kernel build in Jenkins. This would allow us to reuse most
of the existing lbuild scripting. The build would be
non-enforced at first. Testing would remain on distro
kernels, since most people use those.

Q: Will you create a wiki project tracking page for upstreaming
Lustre?
A: Yes

Q: Does anyone else have a similar model? Does this even work?
A: AMD GPU seems to have a similar approach, at least [1]. I'm
looking to get more feedback of LSF. We should talk to other
developers working in a model similar to this.

This is still a high level sketch, but I think this is a feasible
path to upstreaming Lustre. We need to define a clear roadmap
with tangible milestones to have a hope of upstreaming working.

But it's important that we don't disrupt developers established
workflows. We don't want to complicate contributing to Lustre
and we don't want to discourage people from contributing their
changes upstream.

Please give me any feedback or criticisms on this proposal. If we
think this is workable, I'm going to create a wiki project page for
this and attach it to the LSF/MM email.

[1] AMD GPU DKMS: https://github.com/geohot/amdgpu-dkms <https://github.com/geohot/amdgpu-dkms>

--------------------------------------------------------------------------------

Lustre is a high-performance parallel filesystem used for HPC
and AI/ML compute clusters available under GPLv2. Lustre is
currently used by 65% of the Top-500 (9 of Top-10) systems in
HPC [7]. Outside of HPC, Lustre is used by many of the largest
AI/ML clusters in the world, and is commercially supported by
numerous vendors and cloud service providers [1].

After 21 years and an ill-fated stint in staging, Lustre is still
maintained as an out-of-tree module [6]. The previous upstreaming
effort suffered from a lack of developer focus and user adoption,
which eventually led to Lustre being removed from staging
altogether [2].

However, the work to improve Lustre has continued regardless. In
the intervening years, the code improvements that previously
prevented a return to mainline have been steadily progressing. At
least 25% of patches accepted for Lustre 2.16 were related to the
upstreaming effort [3]. And all of the remaining work is
in-flight [4][5]. Our eventual goal is to a get both the Lustre
client and server (on ext4) along with at least TCP/IP networking to
an acceptable quality before submitting to mainline. The remaining
network support would follow soon afterwards.

I propose to discuss:

- As we alter our development model to support upstream development,
what is a sufficient demonstration of commitment that our model works? [8]
- Should the client and server be submitted together? Or split?
- Expectations for a new filesystem to be accepted to mainline
- How to manage inclusion of a large code base (the client alone is
200kLoC) without increasing the burden on fs/net maintainers

Lustre has already received a plethora of feedback in the past.
While much of that has been addressed since - the kernel is a
moving target. Several filesystems have been merged (or removed)
since Lustre left staging. We're aiming to avoid the mistakes of
the past and hope to address as many concerns as possible before
submitting for inclusion.

Thanks!

Timothy Day (Amazon Web Services - AWS)
James Simmons (Oak Ridge National Labs - ORNL)

[1] Wikipedia: https://en.wikipedia.org/wiki/Lustre_ <https://en.wikipedia.org/wiki/Lustre_>(file_system)#Commercial_technical_support
[2] Kicked out of staging: https://lwn.net/Articles/756565/ <https://lwn.net/Articles/756565/>
[3] This is a heuristic, based on the combined commit counts of
ORNL, Aeon, SuSe, and AWS - which have been primarily working
on upstreaming issues: https://youtu.be/BE--ySVQb2M?si=YMHitJfcE4ASWQcE&t=960 <https://youtu.be/BE--ySVQb2M?si=YMHitJfcE4ASWQcE&amp;t=960>
[4] LUG24 Upstreaming Update: https://www.depts.ttu.edu/hpcc/events/LUG24/slides/Day1/LUG_2024_Talk_02-Native_Linux_client_status.pdf <https://www.depts.ttu.edu/hpcc/events/LUG24/slides/Day1/LUG_2024_Talk_02-Native_Linux_client_status.pdf>
[5] Lustre Jira Upstream Progress: TODO
[6] Out-of-tree codebase: https://git.whamcloud.com/?p=fs/lustre-release.git;a=tree <https://git.whamcloud.com/?p=fs/lustre-release.git;a=tree>
[7] I couldn't find a link to this? TODO
[8] Include a link to a project wiki: TODO

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming
  2025-01-22 17:48       ` Alexey Lyahkov
@ 2025-01-24 17:06         ` Day, Timothy
  2025-01-24 19:23           ` Alexey Lyahkov
  0 siblings, 1 reply; 61+ messages in thread
From: Day, Timothy @ 2025-01-24 17:06 UTC (permalink / raw)
  To: Alexey Lyahkov; +Cc: lustre-devel@lists.lustre.org



> On 1/22/25, 12:48 PM, "Alexey Lyahkov" <alexey.lyashkov@gmail.com <mailto:alexey.lyashkov@gmail.com>> wrote:
>> On 22 Jan 2025, at 20:17, Day, Timothy <timday@amazon.com <mailto:timday@amazon.com>> wrote:
>>> On 1/22/25, 6:14 AM, "Alexey Lyahkov" <alexey.lyashkov@gmail.com <mailto:alexey.lyashkov@gmail.com> <mailto:alexey.lyashkov@gmail.com <mailto:alexey.lyashkov@gmail.com>>> wrote:
>>>
>>> Timothy,
>>>
>>>> 22 янв. 2025 г., в 09:35, Day, Timothy <timday@amazon.com <mailto:timday@amazon.com> <mailto:timday@amazon.com <mailto:timday@amazon.com>>> написал(а):
>>>>
>>>> I've created a second draft of the topic for LSF/MM. I tried
>>>> to include everyone's feedback. It's at the end of the email.
>>>>
>>>> Before that, I wanted to elaborate on Neil's idea about updating
>>>> our development model to an upstream-focused model. For upstreaming
>>>> to work, the normal development flow has to generate patches to mainline
>>>> Linux - while still supporting the distro kernels that most people use
>>>> to run Lustre. I think we can get this point in stages. I've provided
>>>> a high-level overview in the next section. This won't be without
>>>> challenges - but the majority of the transition could happen without
>>>> interrupting feature work or normal development.
>>>>
>>>
>>> Can you explain how Lustre platform fragmentation will avoid ?
>>>
>>>
>>> I posted example early,
>>> Distro have locked a Lustre version in release time. But Lustre server have a limited compatibility - in most cases +/- 1…2 releases guaratee to be connected. So stale and aged client will live in the distribution kernel. And it will don’t work for modern servers.
>>> it’s very easy Once distribution live time ~8y. So clients will be needs to drop in kernel lustre client support and install a lustre client from an external sources. Which have no differences with current state.
>>> Next step is sort of distributions which have a different lustre versions which not compatible each to other.
>>> Both these increase a support cost - once large number versions needs supported, so development will drops and all time will spent to support.
>>
>> I think that's a reasonable concern. I spend a lot of time doing customer
>> support for Lustre; I definitely don't want to make that part of my job any
>> harder than it has to be.
>>
>> I'm my personal experience, I've seen 2.10 and 2.15 interoperate well together.
>> That covers a gap of around ~6 years at least. If someone stuck with RHEL7, the
>> first client they could use is 2.7.0 and the last client they could use is 2.16.0 [1].
>> So if a customer didn't update either their distro or filesystem, they could use an
>> up-to-date Lustre version for around 10 years covering 9 versions. So I think these
>> large version gaps are possible today.
>
> Customer expect to update an server side part, but it not always true for client side part.
> They expect to stick for RHEL7 version until EOL, because old HW can don’t support with new version.
> (Look to the RHEL HW support reduction between releases. RHE7->RHEL8 many raid cards had dropped from support).
>
>
>> There is an issue if distros don't want to update their clients.
>
> It is not “if don’t want update”, Ubuntu don’t update own lustre code in past.
> I don’t expect it will be changed. Because distro owner will needs to hire more developers to have extra support.
> But have no money from it.
>
>
>
>
>> That's why we'll
>> still support running latest Lustre on older distros. Specifically, it'll be the Lustre
>> code from a mainline kernel combined with our lustre_compat/ compatibility
>> code. So normal Lustre releases will be derived directly from the in-tree kernel
>> code. This provides a path for vendors to deploy bug fixes, custom features, and
>> allows users to optionally run the latest and greatest Lustre code.
>
>And OOPS. Both codes (in-kernel and out-of-tree) have a same sort of defines in config.h which have conflicts with building for out-of-free Lustre.
>Some examples for MOFED hacks to solve same problem you can see in the o2iblnd:
>>>>
>#if defined(EXTERNAL_OFED_BUILD) && !defined(HAVE_OFED_IB_DMA_MAP_SG_SANE)
>#undef CONFIG_INFINIBAND_VIRT_DMA
>#endif
>>>>
>As I remember this problem broke an ability to build a lustre as out-of-tree kernel on the ubuntu 18.06 with lustre in staging/.

I think we should be able to validate the Lustre still builds as an
out-of-tree module by re-using a lot of the testing we already
do today in Jenkins/Maloo. All we'd need to do it kick off test/build
sessions once the merge window closes. Based on the MOFED
example you gave, it seems like this is solvable.

>>
>> [1] Lustre changelog: https://git.whamcloud.com/?p=fs/lustre-release.git;a=blob_plain;f=lustre/ChangeLog;hb=HEAD <https://git.whamcloud.com/?p=fs/lustre-release.git;a=blob_plain;f=lustre/ChangeLog;hb=HEAD>
>>
>>> It this is not enough - lets one more. Kernel API isn’t stable enough - so large number resources will be needs spent to solve each kernel change in lustre. Currently, it’s in the background and don’t interrupt primary work for supporting and development a new Lustre features.
>>>
>>> So that is problems for Lustre world - what is benefits?
>>
>> By upstreaming Lustre, we'll benefit from developers updating the kernel
>> API "for free".
> It’s not a “for free” - did you really think any of kernel developers have a cluster to run lustre client to test a changes?
> I think not, so testing will be “just compile with proposed/default config”.
> Once it will be lack of proper testing (don’t remember it’s full run for lustre test suite ~12-24h) - lustre developers needs review each change in the lustre code.

That's why a put "for free" in quotes. We need to make it easier for
upstream developers to test their changes so they don't completely
break Lustre. If we upstream the client and server concurrently, we
can implement xfstests support [1]. This would provide at least basic
validation. NFS does something similar. We could even copy over a
subset of Lustre specific tests from sanity.sh into xfstests.

It's not perfect - but it'd be a much better situation compared to the
previous attempt in staging.

[1] https://github.com/kdave/xfstests

> And it needs to back port all these changes in the out-of-free version. Once lustre part needs changes also.
> Best example is ‘folio’ - this need changes for both sides.

If the out-of-tree version is derived from the in-tree version of
Lustre - I don't think the backporting will be that burdensome.
We're essentially do the same work now, but in reverse. Instead
of porting an upstream driver to old kernels, we are porting an
older driver to new kernels.

>> We Lustre was in staging/, there wasn't as much obligation
>> to keep Lustre in a working state. But if we get Lustre merged properly, 
>> developer will not be able to merge changes that break Lustre. So we'll
>> get support for the latest and greatest kernels with less effort. That's one
>> of the main benefits of this effort.
>>
>>
>> We also get benefit from more say over the future of the kernel. A lot
>> of difficulty with updating Lustre for new kernels comes when upstream
>> kernel developers lock down symbols or features to in-tree modules. This
>> could get even worse in the future, with stuff like symbol namespaces get
>> more use [2].
>>
>> Even if most users use the out-of-tree backported-from-mainline-Linux
>> Lustre release, I think we'll still be in a stronger position after
>> upstreaming.
>>
>> [2] https://lwn.net/Articles/760045/ <https://lwn.net/Articles/760045/>
>>

>
> PS. Lustre able to run a server with very very light modified ext4 code. Mostly some exports / callbacks from core.
>

That's good to hear - I think that'll make it easier to convince
upstream to accept the ext4 patches needed to run the server.

>
> Alex
>

Tim Day

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming
  2025-01-24 17:06         ` Day, Timothy
@ 2025-01-24 19:23           ` Alexey Lyahkov
  2025-01-29 19:00             ` Day, Timothy
  0 siblings, 1 reply; 61+ messages in thread
From: Alexey Lyahkov @ 2025-01-24 19:23 UTC (permalink / raw)
  To: Day, Timothy; +Cc: lustre-devel@lists.lustre.org



> On 24 Jan 2025, at 20:06, Day, Timothy <timday@amazon.com> wrote:
> 
>> 
>> 
>>> There is an issue if distros don't want to update their clients.
>> 
>> It is not “if don’t want update”, Ubuntu don’t update own lustre code in past.
>> I don’t expect it will be changed. Because distro owner will needs to hire more developers to have extra support.
>> But have no money from it.
>> 
>> 
>> 
>> 
>>> That's why we'll
>>> still support running latest Lustre on older distros. Specifically, it'll be the Lustre
>>> code from a mainline kernel combined with our lustre_compat/ compatibility
>>> code. So normal Lustre releases will be derived directly from the in-tree kernel
>>> code. This provides a path for vendors to deploy bug fixes, custom features, and
>>> allows users to optionally run the latest and greatest Lustre code.
>> 
>> And OOPS. Both codes (in-kernel and out-of-tree) have a same sort of defines in config.h which have conflicts with building for out-of-free Lustre.
>> Some examples for MOFED hacks to solve same problem you can see in the o2iblnd:
>>>>> 
>> #if defined(EXTERNAL_OFED_BUILD) && !defined(HAVE_OFED_IB_DMA_MAP_SG_SANE)
>> #undef CONFIG_INFINIBAND_VIRT_DMA
>> #endif
>>>>> 
>> As I remember this problem broke an ability to build a lustre as out-of-tree kernel on the ubuntu 18.06 with lustre in staging/.
> 
> I think we should be able to validate the Lustre still builds as an
> out-of-tree module by re-using a lot of the testing we already
> do today in Jenkins/Maloo.
Yes. Me do. But it needs many extra resources. Did Amazon ready to provide such HW resources for it?
Or who will be pay for it? It’s cost of the moving to the kernel.


> All we'd need to do it kick off test/build
> sessions once the merge window closes. Based on the MOFED
> example you gave, it seems like this is solvable.
> 
Sure, All can be solved. But what are cost for this and cost for support these changes?
And next question - who will pay for this cost? Who will provide an HW for extra testing?
So second face of “no cost for kernel API changes” - it will be a problems with back porting these changes and extra testing.



>>> 
>>> [1] Lustre changelog: https://git.whamcloud.com/?p=fs/lustre-release.git;a=blob_plain;f=lustre/ChangeLog;hb=HEAD <https://git.whamcloud.com/?p=fs/lustre-release.git;a=blob_plain;f=lustre/ChangeLog;hb=HEAD>
>>> 
>>>> It this is not enough - lets one more. Kernel API isn’t stable enough - so large number resources will be needs spent to solve each kernel change in lustre. Currently, it’s in the background and don’t interrupt primary work for supporting and development a new Lustre features.
>>>> 
>>>> So that is problems for Lustre world - what is benefits?
>>> 
>>> By upstreaming Lustre, we'll benefit from developers updating the kernel
>>> API "for free".
>> It’s not a “for free” - did you really think any of kernel developers have a cluster to run lustre client to test a changes?
>> I think not, so testing will be “just compile with proposed/default config”.
>> Once it will be lack of proper testing (don’t remember it’s full run for lustre test suite ~12-24h) - lustre developers needs review each change in the lustre code.
> 
> That's why a put "for free" in quotes. We need to make it easier for
> upstream developers to test their changes so they don't completely
> break Lustre.
Ah.. so Lustre will have a vote to stop any landing in kernel until Lustre testing will done? 
Did you understand how many tests will be needs to run?
Full testing needs a ~24h of run time for single node.
How many HW resources Amazon may share to run these tests?
Did you understand - if lustre code changed by someone in upstream that change can’t be backported to the main tree because compatibility code can’t be handle it.
Sometimes needs to stay with old behavior which re-implemented with new kernel code.


> If we upstream the client and server concurrently, we
> can implement xfstests support [1]. This would provide at least basic
> validation. NFS does something similar. We could even copy over a
> subset of Lustre specific tests from sanity.sh into xfstests.
NFS server don’t have a many Lustre features and it don’t expect to be build as out-of-tree module for different kernels.

> 
> It's not perfect - but it'd be a much better situation compared to the
> previous attempt in staging.
> 
> [1] https://github.com/kdave/xfstests
I’m sorry. This is very simple test cases. Lustre much complex FS.


> 
>> And it needs to back port all these changes in the out-of-free version. Once lustre part needs changes also.
>> Best example is ‘folio’ - this need changes for both sides.
> 
> If the out-of-tree version is derived from the in-tree version of
> Lustre - I don't think the backporting will be that burdensome.
> We're essentially do the same work now, but in reverse. Instead
> of porting an upstream driver to old kernels, we are porting an
> older driver to new kernels.
Except some notes.
1) lustre release cycles. Now it’s not a refined with kernel one. None situation when senior developer should stop own work to review kernel changes because it might affects a lustre stability. But with lustre-in-kernel - any change in kernel affects a lustre - need reviewed / tested ungent. 
So extra developers positions/HW needs.

2) no problem with have a custom patches in upstream.
Someone may think something needs cleaned in the lustre code and this patch will accepted.
So it generate a conflict in code changed in the same place for lustre main repository. 
Moving a whole lustre development in the kernel not possible because no server part, but servers have an “client” code on own side, sometimes.

Not so small cost for “updates for free” ?

> 
>>> We Lustre was in staging/, there wasn't as much obligation
>>> to keep Lustre in a working state. But if we get Lustre merged properly, 
>>> developer will not be able to merge changes that break Lustre. So we'll
>>> get support for the latest and greatest kernels with less effort. That's one
>>> of the main benefits of this effort.
>>> 
>>> 
>>> We also get benefit from more say over the future of the kernel. A lot
>>> of difficulty with updating Lustre for new kernels comes when upstream
>>> kernel developers lock down symbols or features to in-tree modules. This
>>> could get even worse in the future, with stuff like symbol namespaces get
>>> more use [2].
>>> 
>>> Even if most users use the out-of-tree backported-from-mainline-Linux
>>> Lustre release, I think we'll still be in a stronger position after
>>> upstreaming.
>>> 
>>> [2] https://lwn.net/Articles/760045/ <https://lwn.net/Articles/760045/>
>>> 
> 
>> 
>> PS. Lustre able to run a server with very very light modified ext4 code. Mostly some exports / callbacks from core.
>> 
> 
> That's good to hear - I think that'll make it easier to convince
> upstream to accept the ext4 patches needed to run the server.
> 
>> 
>> Alex
>> 
> 
> Tim Day
> 

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming
  2025-01-19 21:20       ` Oleg Drokin
@ 2025-01-24 23:12         ` NeilBrown
  2025-01-25  6:40           ` Oleg Drokin
  0 siblings, 1 reply; 61+ messages in thread
From: NeilBrown @ 2025-01-24 23:12 UTC (permalink / raw)
  To: Oleg Drokin; +Cc: lustre-devel@lists.lustre.org

On Mon, 20 Jan 2025, Oleg Drokin wrote:
> On Sun, 2025-01-19 at 09:48 +1100, NeilBrown wrote:
> > 
> > Once the transition completes there will still be process
> > difficulties,
> > but there are plenty of of process difficulties now (gerrit: how do I
> > hate thee, let me count the ways...) but people seem to simply
> > include
> > that in the cost of doing business.
> 
> it's been awhile since I did patch reviews by emails, but I think
> gerrit is much more user-friendly (if you have internet, anyway)

I guess it isn't exactly the gerrit interface but more the workflow that
it encourages, or at least enables.
The current workflow seems to be "patch at a time" rather than "patchset
at a time".
The fact that you cherry-pick patches into master is completely
different to how most (all?) of the upstream community works.  It means
that whole series isn't visible in the final git tree so we lose
context.  And it seems to mean that a long series takes a loooooong time
to land as it dribbles in to master.

I would MUCH rather that a whole series was accepted or rejected as a
whole - and was merged rather than cherry-picked to keep commit ids
stable.
There are times when I would like the first few patches of a series to
land earlier, but that should be up to the submitter to split the
series. 

And the automatic testing is a real pane.  Certainly it is valuable but
it has a real cost too.  The false positives are a major pane.  I would
rather any test that wasn't reliable were disabled (of fixed) as a
priority.  Or at least made non-fatal.
Also, it would be much nicer if the last in a series were tested first
and if that failed then don't wasted resources testing all the others.
Bonus points for a "bisect" to find where the failure starts, but that
can be up to the develop to explicitly request testing at some points in
the series.

NeilBrown
_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming
  2025-01-23  4:51               ` Alexey Lyahkov
@ 2025-01-24 23:24                 ` NeilBrown
  2025-01-25  9:09                   ` Alexey Lyahkov
  0 siblings, 1 reply; 61+ messages in thread
From: NeilBrown @ 2025-01-24 23:24 UTC (permalink / raw)
  To: Alexey Lyahkov; +Cc: lustre-devel@lists.lustre.org

On Thu, 23 Jan 2025, Alexey Lyahkov wrote:
> 
> 
>    
>     Keeping different kernels up to date with new updates is something
>     that
>     the linux-stable team does all the time.  We do it at SUSE to.  It
>     isn't
>     that hard.
>     You identify which patches *need* to be backported (ideally when
>     the
>     patch is created but that isn't always easy) and you use tools to
>     help
>     you backport them.
> 
> So Lustre developers needs control all stable kernels and think which
> patch needs back ported and send it to Distro owner 
> And for each LTS kernels on the kernel.org.. I think it increase a work
> dramatically.

No.  Lustre developers don't need to care about the stable kernels at
all.  The stable team does that an explicitly say they don't want it to
be a burden on maintainers.
The lustre team *can* decide to have some involvement - adding Fixes
tags, adding Cc: stable, even submitting backports which don't apply
trivially.  But there is no requirement from anywhere.

The lustre community only need to focus on one upstream.
Lustre develops who work for employers who sell support for older
kernels might need to handle backports to those kernels and it is in
everybody's interest not to make that unduly different e.g.  by
separating bug fixes from features etc.

The lustre community may well choose to host and share those backports,
and maybe even include them in testing.  But I suspect that would be
driven by vendors who sell support.  It certainly wouldn't be imposed by
the upstream community.

Exactly how we work with distros like Redhat, SUSE, Ubuntu would depend
on what can be negotiated with them.
Some might be willing to accept backports and release them in
maintenance updates.  Some might not.
In that case the way to support their kernel for your customers would be
to start with the source for a particular maint update, add the missing
patches, build, and distribute the result.
You probably would only need to do this for each servie-pack, not for
each update.

It isn't really different from what it done today, but it would be done
in a different way.

Thanks,
NeilBrown
_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming
  2025-01-24 23:12         ` NeilBrown
@ 2025-01-25  6:40           ` Oleg Drokin
  2025-02-01 22:19             ` NeilBrown
  0 siblings, 1 reply; 61+ messages in thread
From: Oleg Drokin @ 2025-01-25  6:40 UTC (permalink / raw)
  To: neilb@suse.de; +Cc: lustre-devel@lists.lustre.org

On Sat, 2025-01-25 at 10:12 +1100, NeilBrown wrote:
> On Mon, 20 Jan 2025, Oleg Drokin wrote:
> > On Sun, 2025-01-19 at 09:48 +1100, NeilBrown wrote:
> > > 
> > > Once the transition completes there will still be process
> > > difficulties,
> > > but there are plenty of of process difficulties now (gerrit: how
> > > do I
> > > hate thee, let me count the ways...) but people seem to simply
> > > include
> > > that in the cost of doing business.
> > 
> > it's been awhile since I did patch reviews by emails, but I think
> > gerrit is much more user-friendly (if you have internet, anyway)
> 
> I guess it isn't exactly the gerrit interface but more the workflow
> that
> it encourages, or at least enables.
> The current workflow seems to be "patch at a time" rather than
> "patchset
> at a time".
> The fact that you cherry-pick patches into master is completely
> different to how most (all?) of the upstream community works.  It
> means
> that whole series isn't visible in the final git tree so we lose
> context.  And it seems to mean that a long series takes a loooooong
> time
> to land as it dribbles in to master.

In fact the whole series is visible in gerrit if you submit it as such.
But as you noted later the testing is unfortunately much les reliable
than we want, and because we only land things in order they are
submitted in the series - if you bunch unrelated stuff all together
suddenly it might take much longer to land.

> I would MUCH rather that a whole series was accepted or rejected as a
> whole - and was merged rather than cherry-picked to keep commit ids
> stable.

Gerrit has such a mode, but we decided it does not work well for us for
a variety of reasons.

> There are times when I would like the first few patches of a series
> to
> land earlier, but that should be up to the submitter to split the
> series. 

But you cannot if these later patches do depend on the earlier ones?

> And the automatic testing is a real pane.  Certainly it is valuable
> but
> it has a real cost too.  The false positives are a major pane.  I
> would
> rather any test that wasn't reliable were disabled (of fixed) as a
> priority.  Or at least made non-fatal.
> Also, it would be much nicer if the last in a series were tested
> first
> and if that failed then don't wasted resources testing all the
> others.
> Bonus points for a "bisect" to find where the failure starts, but
> that
> can be up to the develop to explicitly request testing at some points
> in
> the series.

Yes,  I agree there's much to be improved testing-wise. I did not come
up with my own parallel testing system because I was happy with the
default one after all.
And then my own system deteriorated (At least it does not set -1s left
and right, though that means people totally ignore those results too)

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming
  2025-01-24 23:24                 ` NeilBrown
@ 2025-01-25  9:09                   ` Alexey Lyahkov
  2025-01-25 23:25                     ` NeilBrown
  0 siblings, 1 reply; 61+ messages in thread
From: Alexey Lyahkov @ 2025-01-25  9:09 UTC (permalink / raw)
  To: NeilBrown; +Cc: lustre-devel@lists.lustre.org


[-- Attachment #1.1: Type: text/plain, Size: 3817 bytes --]



> 25 янв. 2025 г., в 02:24, NeilBrown <neilb@suse.de> написал(а):
> 
> On Thu, 23 Jan 2025, Alexey Lyahkov wrote:
>> 
>> 
>> 
>>    Keeping different kernels up to date with new updates is something
>>    that
>>    the linux-stable team does all the time.  We do it at SUSE to.  It
>>    isn't
>>    that hard.
>>    You identify which patches *need* to be backported (ideally when
>>    the
>>    patch is created but that isn't always easy) and you use tools to
>>    help
>>    you backport them.
>> 
>> So Lustre developers needs control all stable kernels and think which
>> patch needs back ported and send it to Distro owner 
>> And for each LTS kernels on the kernel.org.. I think it increase a work
>> dramatically.
> 
> No.  Lustre developers don't need to care about the stable kernels at
> all.  The stable team does that an explicitly say they don't want it to
> be a burden on maintainers.
Lustre maintainers don’t needs review code which affects a Lustre? It’s something new for me.
I understand a drivers changes should don’t reviewed by lustre team.
But arch/… fs/ .. mm/... kernel/ needs attention.
Lack to review will cause a very large quality degradation after short time.
As I point early - I think none of linux maintainers have a lustre cluster to test a patches before land.
They can do test own part and how to build at all. But how it affects a Lustre? A specially in performance area.
Small examples from past.
Small optimisation for page_accessed() and LRU lists fixes a problem with ext4 bitmap in memory and improve lustre performace for 10%. Due lack of read during write. (https://lwn.net/Articles/548830/)
Small change in jbd2 code - like replace list_add to list_add_tail - improve performance for 5-15% due journal handle starvation solved.
(https://www.spinics.net/lists/linux-ext4/msg84888.html)

So yes, Lustre developers can move  LTS kernels as unsupported area and if it’s broken just suggest to install an out-of-tree module with supported kernel. But did linux kernel needs a broken code in tree really?


> The lustre team *can* decide to have some involvement - adding Fixes
> tags, adding Cc: stable, even submitting backports which don't apply
> trivially.  But there is no requirement from anywhere.
> 
> The lustre community only need to focus on one upstream.
And have a broken lustre client once it don’t tested. Or lustre client will hit a performance degradation.


Neil, Tim, 

> Lustre develops who work for employers who sell support for older
> kernels might need to handle backports to those kernels and it is in
> everybody's interest not to make that unduly different e.g.  by
> separating bug fixes from features etc.
> 
Lustre primary area is ‘older’ kernels. As I point early half of customers uses a RHEL7, second 30% is RHEL8.
And just 2% uses a modern kernels.


> The lustre community may well choose to host and share those backports,
> and maybe even include them in testing.  But I suspect that would be
> driven by vendors who sell support.  It certainly wouldn't be imposed by
> the upstream community.
> 
> Exactly how we work with distros like Redhat, SUSE, Ubuntu would depend
> on what can be negotiated with them.
> Some might be willing to accept backports and release them in
> maintenance updates.  Some might not.
> In that case the way to support their kernel for your customers would be
> to start with the source for a particular maint update, add the missing
> patches, build, and distribute the result.
> You probably would only need to do this for each servie-pack, not for
> each update.
> 
> It isn't really different from what it done today, but it would be done
> in a different way.
> 
> Thanks,
> NeilBrown


[-- Attachment #1.2: Type: text/html, Size: 4987 bytes --]

[-- Attachment #2: Type: text/plain, Size: 165 bytes --]

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming
  2025-01-25  9:09                   ` Alexey Lyahkov
@ 2025-01-25 23:25                     ` NeilBrown
  0 siblings, 0 replies; 61+ messages in thread
From: NeilBrown @ 2025-01-25 23:25 UTC (permalink / raw)
  To: Alexey Lyahkov; +Cc: lustre-devel@lists.lustre.org

On Sat, 25 Jan 2025, Alexey Lyahkov wrote:
> 
> 
>     25 янв. 2025 г., в 02:24, NeilBrown <neilb@suse.de> написал(а):
>    
>     On Thu, 23 Jan 2025, Alexey Lyahkov wrote:
>    
>        
>        
>        
>            Keeping different kernels up to date with new updates is
>         something
>            that
>            the linux-stable team does all the time.  We do it at SUSE
>         to.  It
>            isn't
>            that hard.
>            You identify which patches *need* to be backported (ideally
>         when
>            the
>            patch is created but that isn't always easy) and you use
>         tools to
>            help
>            you backport them.
>        
>         So Lustre developers needs control all stable kernels and think
>         which
>         patch needs back ported and send it to Distro owner
>         And for each LTS kernels on the kernel.org.. I think it
>         increase a work
>         dramatically.
>    
>    
>     No.  Lustre developers don't need to care about the stable kernels
>     at
>     all.  The stable team does that an explicitly say they don't want
>     it to
>     be a burden on maintainers.
> 
> Lustre maintainers don’t needs review code which affects a Lustre? It’s
> something new for me.

It may be new, but it is still true - partly.
The stable team only apply patches which have already landed in
upstream, so they have already be reviewed by upstream maintainers.
They do sometimes apply other patches when a fix is needed but it cannot
be achieved with a simple backport - but those will only be accepted
from maintainers.

So the patches *have* been reviewed.  They've been reviewed in a different
context so the review might not still apply and certainly patches do
land in -stable which break things in all sorts of different ways.  It
is generally thought that this cost is small compared to the benefit
of getting lots of fixes.

> I understand a drivers changes should don’t reviewed by lustre team.
> But arch/… fs/ .. mm/... kernel/ needs attention.
> Lack to review will cause a very large quality degradation after short
> time.
> As I point early - I think none of linux maintainers have a lustre
> cluster to test a patches before land.

They don't today, partly because lustre is not mainline.  There are a
number of testing efforts around the kernel which run all sort of
different tests.  If we made it easy to spin up a virtual lustre cluster
for testing and publicised that, I think there is a reasonable chance
that some people will run it.  It doesn't need to be the upstream
maintainers.  It can be anyone with the relevant resources.

> They can do test own part and how to build at all. But how it affects a
> Lustre? A specially in performance area.
> Small examples from past.
> Small optimisation for page_accessed() and LRU lists fixes a problem
> with ext4 bitmap in memory and improve lustre performace for 10%. Due
> lack of read during write. (https://lwn.net/Articles/548830/)
> Small change in jbd2 code - like replace list_add to list_add_tail -
> improve performance for 5-15% due journal handle starvation solved.
> (https://www.spinics.net/lists/linux-ext4/msg84888.html)
> 
> So yes, Lustre developers can move  LTS kernels as unsupported area and
> if it’s broken just suggest to install an out-of-tree module with
> supported kernel. But did linux kernel needs a broken code in tree
> really?

If someone reports that an LTS kernel is broken they should be directed
to whoever supports it, not just told to rip out the code.
If they cannot find anyone to support it they could be guided to use a
different kernel that does have support.

> 
> 
> 
>     The lustre team *can* decide to have some involvement - adding
>     Fixes
>     tags, adding Cc: stable, even submitting backports which don't
>     apply
>     trivially.  But there is no requirement from anywhere.
>    
> 
>     The lustre community only need to focus on one upstream.
> 
> And have a broken lustre client once it don’t tested. Or lustre client
> will hit a performance degradation.

Yes, code that is not maintained will suffer regressions.  This is how
various vendors make money - by selling support and having expertise to
fix regressions.

> 
> 
> Neil, Tim, 
> 
> 
>     Lustre develops who work for employers who sell support for older
>     kernels might need to handle backports to those kernels and it is
>     in
>     everybody's interest not to make that unduly different e.g.  by
>     separating bug fixes from features etc.
>    
> 
> Lustre primary area is ‘older’ kernels. As I point early half of
> customers uses a RHEL7, second 30% is RHEL8.
> And just 2% uses a modern kernels.

RHEL's primary area is older kernels.  SLES's primary area is older kernels.
We still contribute primarily upstream.  
We know upstream is not suitable for our customers.  Before we choose a
kernel for a new release we run a lot of testing on a range of
candidates and pick the one that seems to have the least problems.  Then
we work it identify and fix the problems that are most likely to affect
our customers.

There is no reason that lustre vendors couldn't use and benefit from the
same model.  New work goes upstream, Backport the bits needed by your
customers to the kernels that your customers are using.

It isn't clear to me that the "Community edition" of lustre needs to
support older kernels at all (though I don't object to that).  Each
vendor can choose the kernel or kernels that they want to support and
select the relevant patches from upstream to make it fit their needs.
Bugfixes should be easy as they should be tagged as bug fixes.  Features
might be a little harder but not enormously so.

Thanks,
NeilBrown

> 
> 
> 
>     The lustre community may well choose to host and share those
>     backports,
>     and maybe even include them in testing.  But I suspect that would
>     be
>     driven by vendors who sell support.  It certainly wouldn't be
>     imposed by
>     the upstream community.
>    
>     Exactly how we work with distros like Redhat, SUSE, Ubuntu would
>     depend
>     on what can be negotiated with them.
>     Some might be willing to accept backports and release them in
>     maintenance updates.  Some might not.
>     In that case the way to support their kernel for your customers
>     would be
>     to start with the source for a particular maint update, add the
>     missing
>     patches, build, and distribute the result.
>     You probably would only need to do this for each servie-pack, not
>     for
>     each update.
>    
>     It isn't really different from what it done today, but it would be
>     done
>     in a different way.
>    
>     Thanks,
>     NeilBrown
> 
> 
> 
> 
> 

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming
  2025-01-24 19:23           ` Alexey Lyahkov
@ 2025-01-29 19:00             ` Day, Timothy
  2025-01-29 19:32               ` Alexey Lyahkov
  0 siblings, 1 reply; 61+ messages in thread
From: Day, Timothy @ 2025-01-29 19:00 UTC (permalink / raw)
  To: Alexey Lyahkov; +Cc: lustre-devel@lists.lustre.org

>>>> That's why we'll
>>>> still support running latest Lustre on older distros. Specifically, it'll be the Lustre
>>>> code from a mainline kernel combined with our lustre_compat/ compatibility
>>>> code. So normal Lustre releases will be derived directly from the in-tree kernel
>>>> code. This provides a path for vendors to deploy bug fixes, custom features, and
>>>> allows users to optionally run the latest and greatest Lustre code.
>>>
>>> And OOPS. Both codes (in-kernel and out-of-tree) have a same sort of defines in config.h which have conflicts with building for out-of-free Lustre.
>>> Some examples for MOFED hacks to solve same problem you can see in the o2iblnd:
>>>>>>
>>> #if defined(EXTERNAL_OFED_BUILD) && !defined(HAVE_OFED_IB_DMA_MAP_SG_SANE)
>>> #undef CONFIG_INFINIBAND_VIRT_DMA
>>> #endif
>>>>>>
>>> As I remember this problem broke an ability to build a lustre as out-of-tree kernel on the ubuntu 18.06 with lustre in staging/.
>>
>> I think we should be able to validate the Lustre still builds as an
>> out-of-tree module by re-using a lot of the testing we already
>> do today in Jenkins/Maloo.
>
> Yes. Me do. But it needs many extra resources. Did Amazon ready to provide such HW resources for it?
> Or who will be pay for it? It’s cost of the moving to the kernel.

I suppose I disagree that this testing requires many extra
resources. This is just validate the same things we validate
today (i.e. that Lustre is functional on RHEL kernels). But the
build process looks different.

>> All we'd need to do it kick off test/build
>> sessions once the merge window closes. Based on the MOFED
>> example you gave, it seems like this is solvable.
>
> Sure, All can be solved. But what are cost for this and cost for support these changes?
> And next question - who will pay for this cost? Who will provide an HW for extra testing?
> So second face of “no cost for kernel API changes” - it will be a problems with back porting these changes and extra testing.

I don't think the backporting will be more burdensome
than porting Lustre to new kernels. And we don't have to
urgently backport each upstream release to older kernels.

>>>>
>>>> [1] Lustre changelog: https://git.whamcloud.com/?p=fs/lustre-release.git;a=blob_plain;f=lustre/ChangeLog;hb=HEAD <https://git.whamcloud.com/?p=fs/lustre->release.git;a=blob_plain;f=lustre/ChangeLog;hb=HEAD> <https://git.whamcloud.com/?p=fs/lustre-release.git;a=blob_plain;f=lustre/ChangeLog;hb=HEAD> <https://git.whamcloud.com/?p=fs/lustre->release.git;a=blob_plain;f=lustre/ChangeLog;hb=HEAD&gt;>
>>>>
>>>>> It this is not enough - lets one more. Kernel API isn’t stable enough - so large number resources will be needs spent to solve each kernel change in lustre. Currently, it’s in the background and don’t interrupt primary work for supporting and development a new Lustre features.
>>>>>
>>>>> So that is problems for Lustre world - what is benefits?
>>>>
>>>> By upstreaming Lustre, we'll benefit from developers updating the kernel
>>>> API "for free".
>>> It’s not a “for free” - did you really think any of kernel developers have a cluster to run lustre client to test a changes?
>>> I think not, so testing will be “just compile with proposed/default config”.
>>> Once it will be lack of proper testing (don’t remember it’s full run for lustre test suite ~12-24h) - lustre developers needs review each change in the lustre code.
>>
>> That's why a put "for free" in quotes. We need to make it easier for
>> upstream developers to test their changes so they don't completely
>> break Lustre.
>
>Ah.. so Lustre will have a vote to stop any landing in kernel until Lustre testing will done?
>Did you understand how many tests will be needs to run?
>Full testing needs a ~24h of run time for single node.
>How many HW resources Amazon may share to run these tests?

We can't stop vendors from breaking Lustre with kernel updates
either. This seems to happen with some regularity in my
experience [1].

[1] Recent example with sockets: https://review.whamcloud.com/c/fs/lustre-release/+/56737

>Did you understand - if lustre code changed by someone in upstream that change can’t be backported to the main tree because compatibility code can’t be handle it.
>Sometimes needs to stay with old behavior which re-implemented with new kernel code.

I'm not sure what you mean. We can't backport a change
because compatibility code can’t handle it? So we have to
re-implement old behavior with compatibility code? Do you
have a specific example?

>> If we upstream the client and server concurrently, we
>> can implement xfstests support [1]. This would provide at least basic
>> validation. NFS does something similar. We could even copy over a
>> subset of Lustre specific tests from sanity.sh into xfstests.
>
> NFS server don’t have a many Lustre features and it don’t expect to be build as out-of-tree module for different kernels.
>
>> It's not perfect - but it'd be a much better situation compared to the
>> previous attempt in staging.
>>
>> [1] https://github.com/kdave/xfstests <https://github.com/kdave/xfstests>
>
> I’m sorry. This is very simple test cases. Lustre much complex FS.

Yeah, I know. But we can easily enough replicate "Test-Parameters: trivial"
with xfstests. It's something I plan to do. Ideally I'll be able to
draft up something before LSF.

>>> And it needs to back port all these changes in the out-of-free version. Once lustre part needs changes also.
>>> Best example is ‘folio’ - this need changes for both sides.
>>
>> If the out-of-tree version is derived from the in-tree version of
>> Lustre - I don't think the backporting will be that burdensome.
>> We're essentially do the same work now, but in reverse. Instead
>> of porting an upstream driver to old kernels, we are porting an
>> older driver to new kernels.
>
> Except some notes.
> 1) lustre release cycles. Now it’s not a refined with kernel one. None situation when senior developer should stop own work to review kernel changes because it might affects a lustre stability. But with lustre-in-kernel - any change in kernel affects a lustre - need reviewed / tested ungent.
> So extra developers positions/HW needs.

Changes to Lustre itself can be delayed (to some extent) until
reviewers have time to review. And if we provide some easy way
for developers to test their own changes, the demand on our
side to test everything will lessen, IMO.

> 2) no problem with have a custom patches in upstream.
> Someone may think something needs cleaned in the lustre code and this patch will accepted.
> So it generate a conflict in code changed in the same place for lustre main repository.
> Moving a whole lustre development in the kernel not possible because no server part, but servers have an “client” code on own side, sometimes.
>
> Not so small cost for “updates for free” ?

Ideally, both client and server will go upstream together. Then
we don't have to deal with client/server separation issues.

In another thread, you mention that Lustre is primarily used with
older kernels. While that's definitely true for many sectors, in my
experience - the demand for the latest kernel is robust and the
production usage of 6.x series kernels (with Lustre) is real. If no 
ne was using Lustre with up-to-date kernels - I'd be less enthusiastic
about upstreaming Lustre. But that's not the case.

Tim Day

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming
  2025-01-29 19:00             ` Day, Timothy
@ 2025-01-29 19:32               ` Alexey Lyahkov
  2025-02-01 22:58                 ` NeilBrown
  0 siblings, 1 reply; 61+ messages in thread
From: Alexey Lyahkov @ 2025-01-29 19:32 UTC (permalink / raw)
  To: Day, Timothy; +Cc: lustre-devel@lists.lustre.org



> On 29 Jan 2025, at 22:00, Day, Timothy <timday@amazon.com> wrote:
> 
>>>>> That's why we'll
>>>>> still support running latest Lustre on older distros. Specifically, it'll be the Lustre
>>>>> code from a mainline kernel combined with our lustre_compat/ compatibility
>>>>> code. So normal Lustre releases will be derived directly from the in-tree kernel
>>>>> code. This provides a path for vendors to deploy bug fixes, custom features, and
>>>>> allows users to optionally run the latest and greatest Lustre code.
>>>> 
>>>> And OOPS. Both codes (in-kernel and out-of-tree) have a same sort of defines in config.h which have conflicts with building for out-of-free Lustre.
>>>> Some examples for MOFED hacks to solve same problem you can see in the o2iblnd:
>>>>>>> 
>>>> #if defined(EXTERNAL_OFED_BUILD) && !defined(HAVE_OFED_IB_DMA_MAP_SG_SANE)
>>>> #undef CONFIG_INFINIBAND_VIRT_DMA
>>>> #endif
>>>>>>> 
>>>> As I remember this problem broke an ability to build a lustre as out-of-tree kernel on the ubuntu 18.06 with lustre in staging/.
>>> 
>>> I think we should be able to validate the Lustre still builds as an
>>> out-of-tree module by re-using a lot of the testing we already
>>> do today in Jenkins/Maloo.
>> 
>> Yes. Me do. But it needs many extra resources. Did Amazon ready to provide such HW resources for it?
>> Or who will be pay for it? It’s cost of the moving to the kernel.
> 
> I suppose I disagree that this testing requires many extra
> resources. This is just validate the same things we validate
> today (i.e. that Lustre is functional on RHEL kernels). But the
> build process looks different.
> 
Ah. So you don’t expect to do any performance testing?
Performance testing needs a 20 nodes cluster with IB HDR network (400G) and E1000 with NVMe drives as minimal. 
Otherwise servers / network will be bottleneck.
And week or so for load to be sure no regression exist. Some problems can found just with 48h continuas load.
And that is minimal performance testing.
I don’t say about scale testing with 100+ client nodes.
You think we needs to drop it? If no - who will provide HW for such testing.


>>> All we'd need to do it kick off test/build
>>> sessions once the merge window closes. Based on the MOFED
>>> example you gave, it seems like this is solvable.
>> 
>> Sure, All can be solved. But what are cost for this and cost for support these changes?
>> And next question - who will pay for this cost? Who will provide an HW for extra testing?
>> So second face of “no cost for kernel API changes” - it will be a problems with back porting these changes and extra testing.
> 
> I don't think the backporting will be more burdensome
> than porting Lustre to new kernels. And we don't have to
> urgently backport each upstream release to older kernels.
Neil B, say we needs to move all development to the mainstream. It’s mean kernel upstream will be same as ‘master’ branch now.
So each change needs to be back ported to older kernels to sync with servers work and make ready for lustre release.
Otherwise we will have a ton changes needs to be backported on each lustre release.
I see no differences with porting to upstream, except this porting from mainline to old kernels should be handled ASAP do avoid lustre release delay, while porting to the mainstream may delayed as it not critical for customers.

> 
>>>>> 
>>>>> [1] Lustre changelog: https://git.whamcloud.com/?p=fs/lustre-release.git;a=blob_plain;f=lustre/ChangeLog;hb=HEAD <https://git.whamcloud.com/?p=fs/lustre->release.git;a=blob_plain;f=lustre/ChangeLog;hb=HEAD> <https://git.whamcloud.com/?p=fs/lustre-release.git;a=blob_plain;f=lustre/ChangeLog;hb=HEAD> <https://git.whamcloud.com/?p=fs/lustre->release.git;a=blob_plain;f=lustre/ChangeLog;hb=HEAD&gt;>
>>>>> 
>>>>>> It this is not enough - lets one more. Kernel API isn’t stable enough - so large number resources will be needs spent to solve each kernel change in lustre. Currently, it’s in the background and don’t interrupt primary work for supporting and development a new Lustre features.
>>>>>> 
>>>>>> So that is problems for Lustre world - what is benefits?
>>>>> 
>>>>> By upstreaming Lustre, we'll benefit from developers updating the kernel
>>>>> API "for free".
>>>> It’s not a “for free” - did you really think any of kernel developers have a cluster to run lustre client to test a changes?
>>>> I think not, so testing will be “just compile with proposed/default config”.
>>>> Once it will be lack of proper testing (don’t remember it’s full run for lustre test suite ~12-24h) - lustre developers needs review each change in the lustre code.
>>> 
>>> That's why a put "for free" in quotes. We need to make it easier for
>>> upstream developers to test their changes so they don't completely
>>> break Lustre.
>> 
>> Ah.. so Lustre will have a vote to stop any landing in kernel until Lustre testing will done?
>> Did you understand how many tests will be needs to run?
>> Full testing needs a ~24h of run time for single node.
>> How many HW resources Amazon may share to run these tests?
> 
> We can't stop vendors from breaking Lustre with kernel updates
> either. This seems to happen with some regularity in my
> experience [1].
> 
> [1] Recent example with sockets: https://review.whamcloud.com/c/fs/lustre-release/+/56737
> 
Kernel updates does much rare than updates in the kernel mainline. And much controlled.


>> Did you understand - if lustre code changed by someone in upstream that change can’t be backported to the main tree because compatibility code can’t be handle it.
>> Sometimes needs to stay with old behavior which re-implemented with new kernel code.
> 
> I'm not sure what you mean. We can't backport a change
> because compatibility code can’t handle it? So we have to
> re-implement old behavior with compatibility code? Do you
> have a specific example?
> 
>>> If we upstream the client and server concurrently, we
>>> can implement xfstests support [1]. This would provide at least basic
>>> validation. NFS does something similar. We could even copy over a
>>> subset of Lustre specific tests from sanity.sh into xfstests.
>> 
>> NFS server don’t have a many Lustre features and it don’t expect to be build as out-of-tree module for different kernels.
>> 
>>> It's not perfect - but it'd be a much better situation compared to the
>>> previous attempt in staging.
>>> 
>>> [1] https://github.com/kdave/xfstests <https://github.com/kdave/xfstests>
>> 
>> I’m sorry. This is very simple test cases. Lustre much complex FS.
> 
> Yeah, I know. But we can easily enough replicate "Test-Parameters: trivial"
> with xfstests. It's something I plan to do. Ideally I'll be able to
> draft up something before LSF.
> 
And kill a lustre code quality completely due remove large amount of testing.
Did you know "Test-Parameters: trivial” should be don’t used except complete time changes?
I looks you really think to kill a lustre and add more and more problems with creating a good product.


>>>> And it needs to back port all these changes in the out-of-free version. Once lustre part needs changes also.
>>>> Best example is ‘folio’ - this need changes for both sides.
>>> 
>>> If the out-of-tree version is derived from the in-tree version of
>>> Lustre - I don't think the backporting will be that burdensome.
>>> We're essentially do the same work now, but in reverse. Instead
>>> of porting an upstream driver to old kernels, we are porting an
>>> older driver to new kernels.
>> 
>> Except some notes.
>> 1) lustre release cycles. Now it’s not a refined with kernel one. None situation when senior developer should stop own work to review kernel changes because it might affects a lustre stability. But with lustre-in-kernel - any change in kernel affects a lustre - need reviewed / tested ungent.
>> So extra developers positions/HW needs.
> 
> Changes to Lustre itself can be delayed (to some extent) until
> reviewers have time to review. And if we provide some easy way
> for developers to test their own changes, the demand on our
> side to test everything will lessen, IMO.
So Lustre customers should be wait. Lustre needs to take a more HW for testing. OK. What is benefit ?

> 
>> 2) no problem with have a custom patches in upstream.
>> Someone may think something needs cleaned in the lustre code and this patch will accepted.
>> So it generate a conflict in code changed in the same place for lustre main repository.
>> Moving a whole lustre development in the kernel not possible because no server part, but servers have an “client” code on own side, sometimes.
>> 
>> Not so small cost for “updates for free” ?
> 
> Ideally, both client and server will go upstream together. Then
> we don't have to deal with client/server separation issues.
> 
> In another thread, you mention that Lustre is primarily used with
> older kernels. While that's definitely true for many sectors, in my
> experience - the demand for the latest kernel is robust and the
> production usage of 6.x series kernels (with Lustre) is real.
I may say different - 6.x kernels used on very small installations where it might failed or not.. fail not a critical.
But systems from TOP500 and from some Oil companies uses an RHEL kernel.


Alex
_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming
  2025-01-25  6:40           ` Oleg Drokin
@ 2025-02-01 22:19             ` NeilBrown
  2025-02-01 23:25               ` Oleg Drokin
  0 siblings, 1 reply; 61+ messages in thread
From: NeilBrown @ 2025-02-01 22:19 UTC (permalink / raw)
  To: Oleg Drokin; +Cc: lustre-devel@lists.lustre.org

On Sat, 25 Jan 2025, Oleg Drokin wrote:
> On Sat, 2025-01-25 at 10:12 +1100, NeilBrown wrote:
> > On Mon, 20 Jan 2025, Oleg Drokin wrote:
> > > On Sun, 2025-01-19 at 09:48 +1100, NeilBrown wrote:
> > > > 
> > > > Once the transition completes there will still be process
> > > > difficulties,
> > > > but there are plenty of of process difficulties now (gerrit: how
> > > > do I
> > > > hate thee, let me count the ways...) but people seem to simply
> > > > include
> > > > that in the cost of doing business.
> > > 
> > > it's been awhile since I did patch reviews by emails, but I think
> > > gerrit is much more user-friendly (if you have internet, anyway)
> > 
> > I guess it isn't exactly the gerrit interface but more the workflow
> > that
> > it encourages, or at least enables.
> > The current workflow seems to be "patch at a time" rather than
> > "patchset
> > at a time".
> > The fact that you cherry-pick patches into master is completely
> > different to how most (all?) of the upstream community works.  It
> > means
> > that whole series isn't visible in the final git tree so we lose
> > context.  And it seems to mean that a long series takes a loooooong
> > time
> > to land as it dribbles in to master.
> 
> In fact the whole series is visible in gerrit if you submit it as such.

True, but not as helpful as I might like.  I cannot see a way to add
someone as a reviewer for a whole series, and there is no way for them
to then give a +1 for the whole series.  These are trivial actions when
using unstructured email.

> But as you noted later the testing is unfortunately much les reliable
> than we want, and because we only land things in order they are
> submitted in the series - if you bunch unrelated stuff all together
> suddenly it might take much longer to land.

Certainly only land them in the order they appear in the series, but my
memory of my experience is that even when they are related they don't
all land in master at once.  This might be due to the constant fight
against false positives with testing, and the need to get people to
re-review every patch when a small update is needed in an earlier patch.
But it seems to be more than that.

> 
> > I would MUCH rather that a whole series was accepted or rejected as a
> > whole - and was merged rather than cherry-picked to keep commit ids
> > stable.
> 
> Gerrit has such a mode, but we decided it does not work well for us for
> a variety of reasons.

I wonder if it might be time to review those reasons, particularly as we
need to review a lot of process if we are to move toward working
upstream.
We don't *have* to follow what other subsystems do (and they aren't all
the same), but we would want to have clear reasons, understood by all,
for deviating significantly

> 
> > There are times when I would like the first few patches of a series
> > to
> > land earlier, but that should be up to the submitter to split the
> > series. 
> 
> But you cannot if these later patches do depend on the earlier ones?

Only because gerrit cannot cope.
Using email, I might submit a set of 10 patches, get some discussion,
decide that the first 5 really are good-to-go but the later ones need
some work. So I resubmit the first 5.  They can easily be applied by the
maintainer.
With gerrit I could "abandon" the latter 5 but I might not want to do
that - I might want to revise.  But probably abandoning would be ok.
Still not quite as easy as simply not resubmitting them.

> 
> > And the automatic testing is a real pane.  Certainly it is valuable
> > but
> > it has a real cost too.  The false positives are a major pane.  I
> > would
> > rather any test that wasn't reliable were disabled (of fixed) as a
> > priority.  Or at least made non-fatal.
> > Also, it would be much nicer if the last in a series were tested
> > first
> > and if that failed then don't wasted resources testing all the
> > others.
> > Bonus points for a "bisect" to find where the failure starts, but
> > that
> > can be up to the develop to explicitly request testing at some points
> > in
> > the series.
> 
> Yes,  I agree there's much to be improved testing-wise. I did not come
> up with my own parallel testing system because I was happy with the
> default one after all.
> And then my own system deteriorated (At least it does not set -1s left
> and right, though that means people totally ignore those results too)
> 

I think your parallel system that adds comments to gerrit is sometimes
very helpful.  What bothers me is that you have another test setup that
you run on master-next before and won't land things until that passes.
It means that I can pass all the auto testing and get all the reviews
and still there is a question mark over if/when it will land.  For me
that created a sense of helplessness that was quite demotivating.
Obviously it is good to find bugs, and those bugs need to be fixed, but
they don't need to prevent landing.
I would feel more motivated if there were a clear process that was
followed most of the time whereby once a series passed auto-testing and
had sufficient reviews it would be expected to land e.g.  the next
Monday.  Maybe it doesn't land in master, maybe in something else that
gets merged after a stabilisation window, but it should land.

If asynchronous testing then reports a problem that needs to be
addressed.  Worst-case the patch might be reverted but that would be
rare.  Often a small fix will make the problem go away.

Thanks,
NeilBrown
_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming
  2025-01-29 19:32               ` Alexey Lyahkov
@ 2025-02-01 22:58                 ` NeilBrown
  2025-02-01 23:23                   ` NeilBrown
                                     ` (2 more replies)
  0 siblings, 3 replies; 61+ messages in thread
From: NeilBrown @ 2025-02-01 22:58 UTC (permalink / raw)
  To: Alexey Lyahkov; +Cc: lustre-devel@lists.lustre.org

On Thu, 30 Jan 2025, Alexey Lyahkov wrote:
> 
> 
> > On 29 Jan 2025, at 22:00, Day, Timothy <timday@amazon.com> wrote:
> > 
> >>>>> That's why we'll
> >>>>> still support running latest Lustre on older distros. Specifically, it'll be the Lustre
> >>>>> code from a mainline kernel combined with our lustre_compat/ compatibility
> >>>>> code. So normal Lustre releases will be derived directly from the in-tree kernel
> >>>>> code. This provides a path for vendors to deploy bug fixes, custom features, and
> >>>>> allows users to optionally run the latest and greatest Lustre code.
> >>>> 
> >>>> And OOPS. Both codes (in-kernel and out-of-tree) have a same sort of defines in config.h which have conflicts with building for out-of-free Lustre.
> >>>> Some examples for MOFED hacks to solve same problem you can see in the o2iblnd:
> >>>>>>> 
> >>>> #if defined(EXTERNAL_OFED_BUILD) && !defined(HAVE_OFED_IB_DMA_MAP_SG_SANE)
> >>>> #undef CONFIG_INFINIBAND_VIRT_DMA
> >>>> #endif
> >>>>>>> 
> >>>> As I remember this problem broke an ability to build a lustre as out-of-tree kernel on the ubuntu 18.06 with lustre in staging/.
> >>> 
> >>> I think we should be able to validate the Lustre still builds as an
> >>> out-of-tree module by re-using a lot of the testing we already
> >>> do today in Jenkins/Maloo.
> >> 
> >> Yes. Me do. But it needs many extra resources. Did Amazon ready to provide such HW resources for it?
> >> Or who will be pay for it? It’s cost of the moving to the kernel.
> > 
> > I suppose I disagree that this testing requires many extra
> > resources. This is just validate the same things we validate
> > today (i.e. that Lustre is functional on RHEL kernels). But the
> > build process looks different.
> > 
> Ah. So you don’t expect to do any performance testing?
> Performance testing needs a 20 nodes cluster with IB HDR network (400G) and E1000 with NVMe drives as minimal. 
> Otherwise servers / network will be bottleneck.
> And week or so for load to be sure no regression exist. Some problems can found just with 48h continuas load.
> And that is minimal performance testing.
> I don’t say about scale testing with 100+ client nodes.
> You think we needs to drop it? If no - who will provide HW for such testing.

We at SUSE have performance team.  We do some testing on upstream
because that guards our future, but (I believe) we do most testing on
our own releases to ensure we don't regress and to find problems before
our customers.  The key observation is that Linus' upstream kernel
doesn't have to be perfect.  There are regressions all the time.  That
is why we have the -stable trees.  That is how distros like SUSE and
Redhat make money.

Yes, we want to be doing correctness and performance testing on
mainline, but we don't need to ensure we block any regressions.  We only
need to eventually find regressions and then fix them.  Hopefully we
find and fix before the regression gets to any of our customers (though
it reality our customers find quite a few of our regressions).

> 
> 
> >>> All we'd need to do it kick off test/build
> >>> sessions once the merge window closes. Based on the MOFED
> >>> example you gave, it seems like this is solvable.
> >> 
> >> Sure, All can be solved. But what are cost for this and cost for support these changes?
> >> And next question - who will pay for this cost? Who will provide an HW for extra testing?
> >> So second face of “no cost for kernel API changes” - it will be a problems with back porting these changes and extra testing.
> > 
> > I don't think the backporting will be more burdensome
> > than porting Lustre to new kernels. And we don't have to
> > urgently backport each upstream release to older kernels.
> Neil B, say we needs to move all development to the mainstream.  It’s
> mean kernel upstream will be same as ‘master’ branch now.
> So each change needs to be back ported to older kernels to sync with
> servers work and make ready for lustre release.
> Otherwise we will have a ton changes needs to be backported on each
> lustre release. 
> I see no differences with porting to upstream, except this porting
> from mainline to old kernels should be handled ASAP do avoid lustre
> release delay, while porting to the mainstream may delayed as it not
> critical for customers.

Porting to upstream doesn't work.  The motivation isn't strong enough
and people leave it then forget it and you get too much divergence and
it become harder so people do it even less.  People have tried.  People
have failed.

Backporting from upstream to an older kernel isn't that hard.  I do a
lot of it and with the right tools it is mostly easy.  One of the
biggest difficulties is when we try to backport only a selection of
patches because we might miss an important dependency.  Sometimes it is
worth it to avoid churn, sometime it is best to apply everything
relevant.  I assume that for the selection of kernels that whamcloud (or
whoever) want to support, they would backport everything that could
apply.  I think that would be largely mechanical.

Maybe it would be good for me to paint a more details picture of what I
imagine would happen - assuming we do take the path of landing all of
lustre, both client and server, upstream.

- we would change the kernel code in lustre-release so that it was
  exactly what we plan to submit upstream.
- we submit it and once accepted we have identical code in upstream
  linux and lustre-release
- we fork lustre-release to a new package called (e.g.) lustre-tools and 
  remove all kernel code leaving just utils and documentation and 
  test code.  The kinode.c kernel module that is in lustre/tests/kernel/
  would need to go upstream with the rest of the kernel code I think.
  lustre-tools would be easily accessible and buildable by anyone who
  wants to test lustre
- we fork lustre-release to another new package lustre-backports
  and remove all non-kernel code from there.  We configure it to build
  out-of-tree modules with names like "backport-lustre" "backport-lnet"
  and provide modprobe.conf files that alias the standard names to
  these.  That should allow to over-ride the distributed modules (if
  any) when people choose to use backports.
- upstream commits which touch lustre or lnet are automatically add to
  lustre-backports and someone is notified to help when they don't apply
  
With this:
 Anyone who wants to test or use the lustre included with a particular
 kernel can do with with only the lustre-tools package.  Anyone who
 wants to use the latest lustre code with an older kernel can build and
 use lustre-backports.

There are probably rough-edges with this but I suspect they can be filed
down.

thanks,
NeilBrown

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming
  2025-02-01 22:58                 ` NeilBrown
@ 2025-02-01 23:23                   ` NeilBrown
  2025-02-02  7:33                   ` Alexey Lyahkov
  2025-02-03 17:33                   ` Day, Timothy
  2 siblings, 0 replies; 61+ messages in thread
From: NeilBrown @ 2025-02-01 23:23 UTC (permalink / raw)
  To: Alexey Lyahkov; +Cc: lustre-devel@lists.lustre.org

On Sun, 02 Feb 2025, NeilBrown wrote:
> 
> Maybe it would be good for me to paint a more details picture of what I
> imagine would happen - assuming we do take the path of landing all of
> lustre, both client and server, upstream.
> 
> - we would change the kernel code in lustre-release so that it was
>   exactly what we plan to submit upstream.
> - we submit it and once accepted we have identical code in upstream
>   linux and lustre-release
> - we fork lustre-release to a new package called (e.g.) lustre-tools and 
>   remove all kernel code leaving just utils and documentation and 
>   test code.  The kinode.c kernel module that is in lustre/tests/kernel/
>   would need to go upstream with the rest of the kernel code I think.
>   lustre-tools would be easily accessible and buildable by anyone who
>   wants to test lustre
> - we fork lustre-release to another new package lustre-backports
>   and remove all non-kernel code from there.  We configure it to build
>   out-of-tree modules with names like "backport-lustre" "backport-lnet"
>   and provide modprobe.conf files that alias the standard names to
>   these.  That should allow to over-ride the distributed modules (if
>   any) when people choose to use backports.
> - upstream commits which touch lustre or lnet are automatically add to
>   lustre-backports and someone is notified to help when they don't apply
>   
> With this:
>  Anyone who wants to test or use the lustre included with a particular
>  kernel can do with with only the lustre-tools package.  Anyone who
>  wants to use the latest lustre code with an older kernel can build and
>  use lustre-backports.
> 

Sorry, I completely forgot the other part of the picture.

We would have a git tree with upstream linux and branches called
"lustre-next" and "lustre-fixes" and "lustre-testing" or whatever the
lead maintainers preferred.
Developer would be encouraged to work against lustre-next (or one of the
others when necessary) but submit patches or pull-requests against that.
Patches destined for the next upstream merge window would be be pulled
into lustre-next once that have enough reviews and verification.  Bug
fixes that need to go before the merge window would go to lustre-fixes.
Any patches that are reasonably mature might be merged into
lustre-testing which is where the more heavy-weight testing might focus.
lustre-testing would likely be rebased often.

Obviously if people want to develop features against exactly the kernel
their paying customer is using that would be fine.  But the features
would need to go to lustre-next and then upstream before they could land
in the lustre-backports release.

One consequence of all this that is worth highlighting is that when a
feature needs changes to the kernel and to tools, it will need separate
patches for separate packages.  I think we already aim to ensure new
tools work with old kernels and vice-versa.  Having this split would
help keep the focus on that.

NeilBrown
_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming
  2025-02-01 22:19             ` NeilBrown
@ 2025-02-01 23:25               ` Oleg Drokin
  2025-02-03 17:24                 ` Day, Timothy
  0 siblings, 1 reply; 61+ messages in thread
From: Oleg Drokin @ 2025-02-01 23:25 UTC (permalink / raw)
  To: neilb@suse.de; +Cc: lustre-devel@lists.lustre.org

On Sun, 2025-02-02 at 09:19 +1100, NeilBrown wrote:
> On Sat, 25 Jan 2025, Oleg Drokin wrote:
> > On Sat, 2025-01-25 at 10:12 +1100, NeilBrown wrote:
> > > On Mon, 20 Jan 2025, Oleg Drokin wrote:
> > > > On Sun, 2025-01-19 at 09:48 +1100, NeilBrown wrote:
> > > > > 
> > > > > Once the transition completes there will still be process
> > > > > difficulties,
> > > > > but there are plenty of of process difficulties now (gerrit:
> > > > > how
> > > > > do I
> > > > > hate thee, let me count the ways...) but people seem to
> > > > > simply
> > > > > include
> > > > > that in the cost of doing business.
> > > > 
> > > > it's been awhile since I did patch reviews by emails, but I
> > > > think
> > > > gerrit is much more user-friendly (if you have internet,
> > > > anyway)
> > > 
> > > I guess it isn't exactly the gerrit interface but more the
> > > workflow
> > > that
> > > it encourages, or at least enables.
> > > The current workflow seems to be "patch at a time" rather than
> > > "patchset
> > > at a time".
> > > The fact that you cherry-pick patches into master is completely
> > > different to how most (all?) of the upstream community works.  It
> > > means
> > > that whole series isn't visible in the final git tree so we lose
> > > context.  And it seems to mean that a long series takes a
> > > loooooong
> > > time
> > > to land as it dribbles in to master.
> > 
> > In fact the whole series is visible in gerrit if you submit it as
> > such.
> 
> True, but not as helpful as I might like.  I cannot see a way to add
> someone as a reviewer for a whole series, and there is no way for
> them
> to then give a +1 for the whole series.  These are trivial actions
> when
> using unstructured email.

But on the other hand it's a bit easier to assume they have seen all
the patches at least superficially instead of rubbestamping the whole
series for whatever reason (I remember Linus was comlaining of people
just rubberstamping rebased series that ovbiously did not work)

> > But as you noted later the testing is unfortunately much les
> > reliable
> > than we want, and because we only land things in order they are
> > submitted in the series - if you bunch unrelated stuff all together
> > suddenly it might take much longer to land.
> 
> Certainly only land them in the order they appear in the series, but
> my
> memory of my experience is that even when they are related they don't
> all land in master at once.  This might be due to the constant fight
> against false positives with testing, and the need to get people to
> re-review every patch when a small update is needed in an earlier
> patch.
> But it seems to be more than that.

Basically what could happen is you submit a long series of say 20
patches. Patch 7 fails testing but first 6 patches are fine.
So the "list of patches ready to land" would only show the first 6 and
they might get landed (sure it's not the whole series which might be
less ideal, but as we ensure nothing individually breaks things - not
fatal?)

> > > I would MUCH rather that a whole series was accepted or rejected
> > > as a
> > > whole - and was merged rather than cherry-picked to keep commit
> > > ids
> > > stable.
> > 
> > Gerrit has such a mode, but we decided it does not work well for us
> > for
> > a variety of reasons.
> 
> I wonder if it might be time to review those reasons, particularly as
> we
> need to review a lot of process if we are to move toward working
> upstream.
> We don't *have* to follow what other subsystems do (and they aren't
> all
> the same), but we would want to have clear reasons, understood by
> all,
> for deviating significantly

One of the bigger problems I remember was merges. You merge a series of
patches, and the merge commit itself is not empty, but might contain
"invisible" unreviewed code.
The other is all those automatically added commit bits - where the
patch was reviewed, by whom, ... you cannot do this with a merge
commit.

> > > There are times when I would like the first few patches of a
> > > series
> > > to
> > > land earlier, but that should be up to the submitter to split the
> > > series. 
> > 
> > But you cannot if these later patches do depend on the earlier
> > ones?
> 
> Only because gerrit cannot cope.
> Using email, I might submit a set of 10 patches, get some discussion,
> decide that the first 5 really are good-to-go but the later ones need
> some work. So I resubmit the first 5.  They can easily be applied by
> the
> maintainer.

This already works perfectly and in fact even in unintended ways as you
noted earlier.

> With gerrit I could "abandon" the latter 5 but I might not want to do
> that - I might want to revise.  But probably abandoning would be ok.
> Still not quite as easy as simply not resubmitting them.

I think it's the other way around that requires resubmitting - when you
abandon first 5 patches. (you don't have to resubmit the latter 5
patches, but there needs to be some way of communicating it to the
gatekeeper that it's ok these patches have unmet dependencies because
the automatic tooling does not have this knowledge (and the tooling is
not even in gerrit, it's homegrown stuff, gerrit would happily let oyu
lend out of order patches even when they totally break everything as
long as they clearly merge)

> > > And the automatic testing is a real pane.  Certainly it is
> > > valuable
> > > but
> > > it has a real cost too.  The false positives are a major pane.  I
> > > would
> > > rather any test that wasn't reliable were disabled (of fixed) as
> > > a
> > > priority.  Or at least made non-fatal.
> > > Also, it would be much nicer if the last in a series were tested
> > > first
> > > and if that failed then don't wasted resources testing all the
> > > others.
> > > Bonus points for a "bisect" to find where the failure starts, but
> > > that
> > > can be up to the develop to explicitly request testing at some
> > > points
> > > in
> > > the series.
> > 
> > Yes,  I agree there's much to be improved testing-wise. I did not
> > come
> > up with my own parallel testing system because I was happy with the
> > default one after all.
> > And then my own system deteriorated (At least it does not set -1s
> > left
> > and right, though that means people totally ignore those results
> > too)
> > 
> 
> I think your parallel system that adds comments to gerrit is
> sometimes
> very helpful.  What bothers me is that you have another test setup
> that
> you run on master-next before and won't land things until that
> passes.

That's integration testing to make sure all patches work nice together,
that's why they whole -next thing exists in Linux as well.

Now I do have an additional testsuite for master-next - the boilpot.
I fully realize rejecting things late is super expensive on everybody
and that's why all the things I could do automatic feedback on
individual patches super early on (the gerrit janitor and such).

But the boilpot is expensive. It takes days to weeks of intensive load
testing. While I'd love to subject every patch to this stress testing,
I only have one (now two) systems capable of this.

> It means that I can pass all the auto testing and get all the reviews
> and still there is a question mark over if/when it will land.  For me
> that created a sense of helplessness that was quite demotivating.
> Obviously it is good to find bugs, and those bugs need to be fixed,
> but
> they don't need to prevent landing.

Landing known buggy code sounds counter-productive. It increases the
number of failed tests and that makes new breakage harder to see and
hides some further breakage that would happen later in the already
broken codepath but ended up not being reached due to an earlier known
problem.

The other problem (you also see it spades with static code analysis) is
once the buggy problem has landed - the motivation on the part of
original developer suddenly vanes as they have other things to move to
now (not always the case, of course, but a very real concern).
Sure if there's no action for some time the patch could be reverted,
but that's even more expensive esp. as there could be other patches
building up not top preventing clear reverts and such.

> I would feel more motivated if there were a clear process that was
> followed most of the time whereby once a series passed auto-testing
> and
> had sufficient reviews it would be expected to land e.g.  the next
> Monday.  Maybe it doesn't land in master, maybe in something else
> that
> gets merged after a stabilisation window, but it should land.

that's master-next, right? Granted we don't do specific stabilization
there and I do not have a good solution for predictable boilpot runs.

If you want to avoid surprises for your patches, I can publish my
boilpot scripts and you can run your own instance if you have the
hardware.
Or we can find some sponsors to have some sort of a shared public
instance where people could drop their patches to?

> If asynchronous testing then reports a problem that needs to be
> addressed.  Worst-case the patch might be reverted but that would be
> rare.  Often a small fix will make the problem go away.

The "small fix" better be reviewed in the full context though. Sure,
usually it's fine, but usually all patches are fine, the kind of
problems the boilpot finds are convoluted though.

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming
  2025-02-01 22:58                 ` NeilBrown
  2025-02-01 23:23                   ` NeilBrown
@ 2025-02-02  7:33                   ` Alexey Lyahkov
  2025-02-03 17:33                   ` Day, Timothy
  2 siblings, 0 replies; 61+ messages in thread
From: Alexey Lyahkov @ 2025-02-02  7:33 UTC (permalink / raw)
  To: NeilBrown; +Cc: lustre-devel@lists.lustre.org


[-- Attachment #1.1: Type: text/plain, Size: 9127 bytes --]



> On 2 Feb 2025, at 01:58, NeilBrown <neilb@suse.de> wrote:
> 
> On Thu, 30 Jan 2025, Alexey Lyahkov wrote:
>> 
>> 
>>> On 29 Jan 2025, at 22:00, Day, Timothy <timday@amazon.com> wrote:
>>> 
>>>>>>> That's why we'll
>>>>>>> still support running latest Lustre on older distros. Specifically, it'll be the Lustre
>>>>>>> code from a mainline kernel combined with our lustre_compat/ compatibility
>>>>>>> code. So normal Lustre releases will be derived directly from the in-tree kernel
>>>>>>> code. This provides a path for vendors to deploy bug fixes, custom features, and
>>>>>>> allows users to optionally run the latest and greatest Lustre code.
>>>>>> 
>>>>>> And OOPS. Both codes (in-kernel and out-of-tree) have a same sort of defines in config.h which have conflicts with building for out-of-free Lustre.
>>>>>> Some examples for MOFED hacks to solve same problem you can see in the o2iblnd:
>>>>>>>>> 
>>>>>> #if defined(EXTERNAL_OFED_BUILD) && !defined(HAVE_OFED_IB_DMA_MAP_SG_SANE)
>>>>>> #undef CONFIG_INFINIBAND_VIRT_DMA
>>>>>> #endif
>>>>>>>>> 
>>>>>> As I remember this problem broke an ability to build a lustre as out-of-tree kernel on the ubuntu 18.06 with lustre in staging/.
>>>>> 
>>>>> I think we should be able to validate the Lustre still builds as an
>>>>> out-of-tree module by re-using a lot of the testing we already
>>>>> do today in Jenkins/Maloo.
>>>> 
>>>> Yes. Me do. But it needs many extra resources. Did Amazon ready to provide such HW resources for it?
>>>> Or who will be pay for it? It’s cost of the moving to the kernel.
>>> 
>>> I suppose I disagree that this testing requires many extra
>>> resources. This is just validate the same things we validate
>>> today (i.e. that Lustre is functional on RHEL kernels). But the
>>> build process looks different.
>>> 
>> Ah. So you don’t expect to do any performance testing?
>> Performance testing needs a 20 nodes cluster with IB HDR network (400G) and E1000 with NVMe drives as minimal. 
>> Otherwise servers / network will be bottleneck.
>> And week or so for load to be sure no regression exist. Some problems can found just with 48h continuas load.
>> And that is minimal performance testing.
>> I don’t say about scale testing with 100+ client nodes.
>> You think we needs to drop it? If no - who will provide HW for such testing.
> 
> We at SUSE have performance team.  We do some testing on upstream
> because that guards our future, but (I believe) we do most testing on
> our own releases to ensure we don't regress and to find problems before
> our customers.  The key observation is that Linus' upstream kernel
> doesn't have to be perfect.  There are regressions all the time.  That
> is why we have the -stable trees.  That is how distros like SUSE and
> Redhat make money.
> 
Thanks, I know it. RedHat and SuSe makes unusable kernel in upstream and take a money to make it better.


> Yes, we want to be doing correctness and performance testing on
> mainline, but we don't need to ensure we block any regressions.  We only
> need to eventually find regressions and then fix them.  Hopefully we
> find and fix before the regression gets to any of our customers (though
> it reality our customers find quite a few of our regressions).

So
1) Lustre perf team needs more work - to have run performance testing on mainstream, and LTS branches. To find a regressions.
2) Lustre developers needs to look to these regressions, and fix this time to time.
3) Once it’s not blocking for quality, Lustre client quality is poor for linux kernel, bugs and performance issues. It blocked to use this from kernel mainline.
4) and Lustre client and server separation introduce a problems with development.

(1) - mean extra peoples needs hired and extra HW needs to involve.
(2,4) - extra peoples needs hire.
(3) - any real customer still needs to use Lustre from repository outside of linux kernel, so lustre code in the linux kernel don’t used in the real production.
If no 

What is benefits to spent these money ? Just avoid some non priority task with porting to the new kernel which needs to be done onece per several years.
(When new SuSe or RedHat release created).

It looks anyone don’t understand - Lustre porting to the new kernel not so hard work and not a priority. Single problem remember from begin - just change page to folio.
Lustre have inside own MM stack, once it designed to work on the different platforms - MacOS, Windows /yes, in some private branch had Windows native client but it for Lustre 2.1 as I remember/, FreeBSD…. Currently large portion of compatibility and portability code had removed after first  step to moving to kernel upstream.
But again - Lustre have own page tree, own paging daemon, for server side it have own ‘VFS’ stack on MD servers, and own data path with preallocated page buffer on OST. It have own network stack (LNet) with own routing / forwarding and protocol conversions (LNet router).
But problems related to cache coherency between clients, distributed transactions for MD - much harder.


>> 
>> 
>>>>> All we'd need to do it kick off test/build
>>>>> sessions once the merge window closes. Based on the MOFED
>>>>> example you gave, it seems like this is solvable.
>>>> 
>>>> Sure, All can be solved. But what are cost for this and cost for support these changes?
>>>> And next question - who will pay for this cost? Who will provide an HW for extra testing?
>>>> So second face of “no cost for kernel API changes” - it will be a problems with back porting these changes and extra testing.
>>> 
>>> I don't think the backporting will be more burdensome
>>> than porting Lustre to new kernels. And we don't have to
>>> urgently backport each upstream release to older kernels.
>> Neil B, say we needs to move all development to the mainstream.  It’s
>> mean kernel upstream will be same as ‘master’ branch now.
>> So each change needs to be back ported to older kernels to sync with
>> servers work and make ready for lustre release.
>> Otherwise we will have a ton changes needs to be backported on each
>> lustre release. 
>> I see no differences with porting to upstream, except this porting
>> from mainline to old kernels should be handled ASAP do avoid lustre
>> release delay, while porting to the mainstream may delayed as it not
>> critical for customers.
> 
> Porting to upstream doesn't work.  The motivation isn't strong enough
> and people leave it then forget it and you get too much divergence and
> it become harder so people do it even less.  People have tried.  People
> have failed.
> 
Porting to upstream had work for last 20years from product start.
Yes, this is not for each kernel release, but it’s don’t needs.



> Backporting from upstream to an older kernel isn't that hard.  
> I do a
> lot of it and with the right tools it is mostly easy.  One of the
> biggest difficulties is when we try to backport only a selection of
> patches because we might miss an important dependency.  Sometimes it is
> worth it to avoid churn, sometime it is best to apply everything
> relevant.  I assume that for the selection of kernels that whamcloud (or
> whoever) want to support, they would backport everything that could
> apply.  I think that would be largely mechanical.
> 
> Maybe it would be good for me to paint a more details picture of what I
> imagine would happen - assuming we do take the path of landing all of
> lustre, both client and server, upstream.
> 
> - we would change the kernel code in lustre-release so that it was
>  exactly what we plan to submit upstream.
> - we submit it and once accepted we have identical code in upstream
>  linux and lustre-release
> - we fork lustre-release to a new package called (e.g.) lustre-tools and 
>  remove all kernel code leaving just utils and documentation and 
>  test code.  The kinode.c kernel module that is in lustre/tests/kernel/
>  would need to go upstream with the rest of the kernel code I think.
>  lustre-tools would be easily accessible and buildable by anyone who
>  wants to test lustre
> - we fork lustre-release to another new package lustre-backports
>  and remove all non-kernel code from there.  We configure it to build
>  out-of-tree modules with names like "backport-lustre" "backport-lnet"
>  and provide modprobe.conf files that alias the standard names to
>  these.  That should allow to over-ride the distributed modules (if
>  any) when people choose to use backports.
> - upstream commits which touch lustre or lnet are automatically add to
>  lustre-backports and someone is notified to help when they don't apply
> 
> With this:
> Anyone who wants to test or use the lustre included with a particular
> kernel can do with with only the lustre-tools package.  Anyone who
> wants to use the latest lustre code with an older kernel can build and
> use lustre-backports.
> 
> There are probably rough-edges with this but I suspect they can be filed
> down.
> 
> thanks,
> NeilBrown


[-- Attachment #1.2: Type: text/html, Size: 51749 bytes --]

[-- Attachment #2: Type: text/plain, Size: 165 bytes --]

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming
  2025-02-01 23:25               ` Oleg Drokin
@ 2025-02-03 17:24                 ` Day, Timothy
  2025-02-03 19:42                   ` Oleg Drokin
  0 siblings, 1 reply; 61+ messages in thread
From: Day, Timothy @ 2025-02-03 17:24 UTC (permalink / raw)
  To: Oleg Drokin, neilb@suse.de; +Cc: lustre-devel@lists.lustre.org

> If you want to avoid surprises for your patches, I can publish my
> boilpot scripts and you can run your own instance if you have the
> hardware.
> Or we can find some sponsors to have some sort of a shared public
> instance where people could drop their patches to?

If you could publish the boilpot scripts, I think that'd be super helpful.
It'd be a lot easier to understand how to reproduce these failures.
Plus, writing the orchestration to run it in the cloud would be
straightforward, I think.

Tim Day

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming
  2025-02-01 22:58                 ` NeilBrown
  2025-02-01 23:23                   ` NeilBrown
  2025-02-02  7:33                   ` Alexey Lyahkov
@ 2025-02-03 17:33                   ` Day, Timothy
  2025-02-03 17:43                     ` Alexey Lyahkov
  2 siblings, 1 reply; 61+ messages in thread
From: Day, Timothy @ 2025-02-03 17:33 UTC (permalink / raw)
  To: NeilBrown, Alexey Lyahkov; +Cc: lustre-devel@lists.lustre.org

> Porting to upstream doesn't work. The motivation isn't strong enough
> and people leave it then forget it and you get too much divergence and
> it become harder so people do it even less. People have tried. People
> have failed.
>
> Backporting from upstream to an older kernel isn't that hard. I do a
> lot of it and with the right tools it is mostly easy. One of the
> biggest difficulties is when we try to backport only a selection of
> patches because we might miss an important dependency. Sometimes it is
> worth it to avoid churn, sometime it is best to apply everything
> relevant. I assume that for the selection of kernels that whamcloud (or
> whoever) want to support, they would backport everything that could
> apply. I think that would be largely mechanical.
>
> Maybe it would be good for me to paint a more details picture of what I
> imagine would happen - assuming we do take the path of landing all of
> lustre, both client and server, upstream.
>
> - we would change the kernel code in lustre-release so that it was
> exactly what we plan to submit upstream.
> - we submit it and once accepted we have identical code in upstream
> linux and lustre-release
> - we fork lustre-release to a new package called (e.g.) lustre-tools and
> remove all kernel code leaving just utils and documentation and
> test code. The kinode.c kernel module that is in lustre/tests/kernel/
> would need to go upstream with the rest of the kernel code I think.
> lustre-tools would be easily accessible and buildable by anyone who
> wants to test lustre
> - we fork lustre-release to another new package lustre-backports
> and remove all non-kernel code from there. We configure it to build
> out-of-tree modules with names like "backport-lustre" "backport-lnet"
> and provide modprobe.conf files that alias the standard names to
> these. That should allow to over-ride the distributed modules (if
> any) when people choose to use backports.
> - upstream commits which touch lustre or lnet are automatically add to
> lustre-backports and someone is notified to help when they don't apply
>
> With this:
> Anyone who wants to test or use the lustre included with a particular
> kernel can do with with only the lustre-tools package. Anyone who
> wants to use the latest lustre code with an older kernel can build and
> use lustre-backports.
>
> There are probably rough-edges with this but I suspect they can be filed
> down.

I found an interesting data point. VAST seems to use an upstream NFS
client from an LTS kernel [1]. They have a compat layer to run that client
on older kernels. That's essentially what Lustre would be doing. They
also support Mellanox/GDS with this client. You can see exactly how
they did it by downloading the tarball. Even a large change like folio
didn’t seem to have a huge impact on the code. Just a little bit of
#ifdef'ing.

So an approach like this is feasible.

[1] https://vastnfs.vastdata.com/docs/4.0/download.html

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming
  2025-02-03 17:33                   ` Day, Timothy
@ 2025-02-03 17:43                     ` Alexey Lyahkov
  0 siblings, 0 replies; 61+ messages in thread
From: Alexey Lyahkov @ 2025-02-03 17:43 UTC (permalink / raw)
  To: Day, Timothy; +Cc: lustre-devel@lists.lustre.org



> On 3 Feb 2025, at 20:33, Day, Timothy <timday@amazon.com> wrote:
> 
>>  Even a large change like folio
> didn’t seem to have a huge impact on the code. Just a little bit of
> #ifdef'ing.
> 

Please don’t forget - NFS don’t have a coherency control for nodes.
Better say - coherency control is lazy and had based on LEASE locks.
And, much harder part - NFS don’t have a network raid for files. So they don’t needs to split a multi page folio on lock conflict.

Alex
_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming
  2025-02-03 17:24                 ` Day, Timothy
@ 2025-02-03 19:42                   ` Oleg Drokin
  2025-02-03 20:10                     ` Day, Timothy
  0 siblings, 1 reply; 61+ messages in thread
From: Oleg Drokin @ 2025-02-03 19:42 UTC (permalink / raw)
  To: timday@amazon.com, neilb@suse.de; +Cc: lustre-devel@lists.lustre.org

On Mon, 2025-02-03 at 17:24 +0000, Day, Timothy wrote:
> > If you want to avoid surprises for your patches, I can publish my
> > boilpot scripts and you can run your own instance if you have the
> > hardware.
> > Or we can find some sponsors to have some sort of a shared public
> > instance where people could drop their patches to?
> 
> If you could publish the boilpot scripts, I think that'd be super
> helpful.
> It'd be a lot easier to understand how to reproduce these failures.
> Plus, writing the orchestration to run it in the cloud would be
> straightforward, I think.

Unfortunately cloud is not very conductive to the way boilpot operates,
the whole idea is to instantiate a gazillion of virtual machines that
are run on a single physical host to overcommit the cpu (a lot!)

so I have this 2T RAM AMD box and I instantiate 240 virtual machines on
it, each gets 15G RAM and 15 CPU cores (this is the important part, if
you do not have cpu overcommit, nothing works)

inside based on the node id (hostnames are just numbered for
simplicity)one of several scripts is run:

LOCALNUM=$(basename $(hostname) .localnet | sed 's/^centos-//')


if [ $LOCALNUM -eq 300 ] ; then # impossible to hit
	FSTYPE=zfs
	MDSSIZE=600000
	MDSCOUNT=3
	OSTCOUNT=4

	export FSTYPE
	export MDSSIZE
	export MDSCOUNT
	export OSTCOUNT
	
	export ONLY=300
#	exec /etc/rc.d/tests-sanity
exit
fi

FSTYPE=ldiskfs
MDSSIZE=400000
MDSCOUNT=1
OSTCOUNT=4
# 50% probability - ZFS
test $((RANDOM % 2)) -eq 0 && FSTYPE=zfs MDSSIZE=600000

# 33% probability - DNE
test $((RANDOM % 3)) -eq 0 && MDSCOUNT=3

export FSTYPE
export MDSSIZE
export MDSCOUNT
export OSTCOUNT

#if [ $LOCALNUM -eq 100 ] ; then
#	exec /etc/rc.d/zfs-only-mount
#fi

case $((LOCALNUM % 5)) in

0) exec /etc/rc.d/tests-racer $LOCALNUM ;;
1) exec /etc/rc.d/tests-replay $LOCALNUM ;;
2) exec /etc/rc.d/tests-recovery $LOCALNUM ;;
3) exec /etc/rc.d/tests-sanity $LOCALNUM ;;
4) exec /etc/rc.d/tests-confsanity $LOCALNUM ;;

esac


and then each tests-* does what it says.

They all begin the same:
#!/bin/bash

. /etc/rc.d/tests-config
TESTDIR=${TESTDIR:-"/home/green/git/lustre-release/lustre/tests"}
cd "$TESTDIR"
while [ ! -e ../utils/mount.lustre ] ; do sleep 10 ; done

bash /etc/rc.d/tests-common &

and then split as below:

screen -d -m bash -c 'while :; do rm -rf /tmp/* ; TIMEST=$(date +'%s')
; SLOW=yes REFORMAT=yes DURATION=$((900*3)) PTLDEBUG="vfstrace rpctrace
dlmtrace neterror ha config ioctl super cache" DEBUG_SIZE=100 bash
racer.sh ; TIMEEN=$(date +'%s') ; if [ $((TIMEEN - TIMEST)) -le 60 ] ;
then echo Cycling too fast > /dev/kmsg ; echo c >/proc/sysrq-trigger ;
fi ; sh llmountcleanup.sh ; done'

screen -d -m bash -c 'while :; do rm -rf /tmp/* ; TIMEST=$(date +'%s')
; EXCEPT="51f 60a 101 200l 300k" SLOW=yes REFORMAT=yes bash sanity.sh ;
TIMEEN=$(date +'%s') ; if [ $((TIMEEN - TIMEST)) -le 60 ] ; then echo
Cycling too fast > /dev/kmsg ; echo c >/proc/sysrq-trigger ; fi ; bash
llmountcleanup.sh ; rm -rf /tmp/* ; SLOW=yes REFORMAT=yes bash
sanityn.sh ; bash llmountcleanup.sh ; SLOW=yes REFORMAT=yes bash
sanity-pfl.sh ; bash llmountcleanup.sh ; SLOW=yes REFORMAT=yes bash
sanity-flr.sh ; bash llmountcleanup.sh ; SLOW=yes REFORMAT=yes bash
sanity-dom.sh ; bash llmountcleanup.sh ; done'

screen -d -m bash -c 'while :; do rm -rf /tmp/* ; TIMEST=$(date +'%s')
; EXCEPT="32 36 67 76 78 102 69 106" SLOW=yes REFORMAT=yes bash conf-
sanity.sh ; TIMEEN=$(date +'%s') ; if [ $((TIMEEN - TIMEST)) -le 60 ] ;
then echo Cycling too fast > /dev/kmsg ; echo c >/proc/sysrq-trigger ;
fi ; bash llmountcleanup.sh ; for i in `seq 0 7` ; do losetup -d
/dev/loop$i ; done ; done'

screen -d -m bash -c 'while :; do rm -rf /tmp/* ; TIMEST=$(date +'%s')
; EXCEPT=101 SLOW=yes REFORMAT=yes bash recovery-small.sh ;
TIMEEN=$(date +'%s') ; if [ $((TIMEEN - TIMEST)) -le 60 ] ; then echo
Cycling too fast > /dev/kmsg ; echo c >/proc/sysrq-trigger ; fi ; bash
llmountcleanup.sh ; done'

screen -d -m bash -c 'while :; do rm -rf /tmp/* ; TIMEST=$(date +'%s')
; SLOW=yes REFORMAT=yes bash replay-single.sh ; TIMEEN=$(date +'%s') ;
if [ $((TIMEEN - TIMEST)) -le 60 ] ; then echo Cycling too fast >
/dev/kmsg ; echo c >/proc/sysrq-trigger ; fi ; bash llmountcleanup.sh ;
SLOW=yes REFORMAT=yes bash replay-ost-single.sh ; bash
llmountcleanup.sh ; SLOW=yes REFORMAT=yes bash replay-dual.sh ; bash
llmountcleanup.sh ; done'

The common scaffolding is to just catch stuck tests that have no
progress for too long:
# Seconds
TMOUT=3600
TMOUT_SHORT=2400 # 40 minutes - for ldiskfs
TMOUT_LONG=3600 # 60 minutes - for zfs
WDFILE=/tmp/watchdog.file
TOUTFILE=/tmp/test_output_file_rnd

# Initial rampup
sleep 10

while :; do
	touch ${WDFILE}
	sleep ${TMOUT}

	if [ -e ${WDFILE} ] ; then
		# Just a long test? Give it another try
		dmesg | grep 'DEBUG MARKER: ==' | tail -1 >
${TOUTFILE}_1
		if [ $FSTYPE = zfs ] ; then
			sleep ${TMOUT_LONG}
		else
			sleep ${TMOUT_SHORT}
		fi

		if [ -e ${TOUTFILE}_1 ] ; then
			dmesg | grep 'DEBUG MARKER: ==' | tail -1 >
${TOUTFILE}_2

			# If no subtest changed - force crash
			if cmp ${TOUTFILE}_1 ${TOUTFILE}_2 ; then
				
				# extra zfs debug
				if [ $FSTYPE = zfs ] ; then
					(echo "zpool stats on hang" ;
zpool iostat 1 10 ) >/dev/kmsg 2>&1
				fi

				# and crash
				echo c >/proc/sysrq-trigger
			fi

			# We only get here if the test was different
			# Since the progress is there - just keep
monitoring
		fi
	fi
done

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming
  2025-02-03 19:42                   ` Oleg Drokin
@ 2025-02-03 20:10                     ` Day, Timothy
  2025-02-03 20:24                       ` Oleg Drokin
  0 siblings, 1 reply; 61+ messages in thread
From: Day, Timothy @ 2025-02-03 20:10 UTC (permalink / raw)
  To: Oleg Drokin, neilb@suse.de; +Cc: lustre-devel@lists.lustre.org

> Unfortunately cloud is not very conductive to the way boilpot operates,
> the whole idea is to instantiate a gazillion of virtual machines that
> are run on a single physical host to overcommit the cpu (a lot!)
>
> so I have this 2T RAM AMD box and I instantiate 240 virtual machines on
> it, each gets 15G RAM and 15 CPU cores (this is the important part, if
> you do not have cpu overcommit, nothing works)

You can do a similar thing in the cloud with bare metal instances. Normally,
you can't do nested virtualization (i.e. QEMU/KVM inside EC2). But a bare
metal instance avoids that issue. That's how I run ktest [1], which uses
QEMU/KVM. Something like m7a.metal-48xl has 192 CPU and 768G of
memory, so similar to the size you mention. What ratio of overcommit
do you have? For RAM, it seems to be 2:1. What about for CPU?

Tim Day

[1] https://github.com/koverstreet/ktest/tree/master

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming
  2025-02-03 20:10                     ` Day, Timothy
@ 2025-02-03 20:24                       ` Oleg Drokin
  2025-02-03 20:29                         ` Oleg Drokin
  2025-02-06 18:24                         ` Day, Timothy
  0 siblings, 2 replies; 61+ messages in thread
From: Oleg Drokin @ 2025-02-03 20:24 UTC (permalink / raw)
  To: timday@amazon.com, neilb@suse.de; +Cc: lustre-devel@lists.lustre.org

On Mon, 2025-02-03 at 20:10 +0000, Day, Timothy wrote:
> > Unfortunately cloud is not very conductive to the way boilpot
> > operates,
> > the whole idea is to instantiate a gazillion of virtual machines
> > that
> > are run on a single physical host to overcommit the cpu (a lot!)
> > 
> > so I have this 2T RAM AMD box and I instantiate 240 virtual
> > machines on
> > it, each gets 15G RAM and 15 CPU cores (this is the important part,
> > if
> > you do not have cpu overcommit, nothing works)
> 
> You can do a similar thing in the cloud with bare metal instances.
> Normally,
> you can't do nested virtualization (i.e. QEMU/KVM inside EC2). But a
> bare
> metal instance avoids that issue. That's how I run ktest [1], which
> uses
> QEMU/KVM. Something like m7a.metal-48xl has 192 CPU and 768G of
> memory, so similar to the size you mention. What ratio of overcommit
> do you have? For RAM, it seems to be 2:1. What about for CPU?

don't really need memory overcommit (in fact it's somewhat
counterproductive), but since VMs typically don't use all RAM I wing it
and run somewhat more VMs than what memory permits.

as for CPU - the more overcommit is the better (my box has 96 cores).

if this is to be deployed in the cloud at will, some robust
orchestration is needed host-side - I create 240 libvirt driven VMs
with their own storage in LVM, dhcp-driven autoconf, nfs export host-
side with the right distro - just once per box lifetime and compiled
lustre every time I run testing (so a resh checkout of master-next
usually).
Then configure crashdumping and an inotifywatch-based script to catch
cores and do some light processign and ship results to the central data
collector. (might be more efficient to do using in-vm crashdumping
instead?)

at $11/hour the m7a.metal-48xl would take $264 to run for just one day,
a week is an eye-watering $1848, so running this for every patch is not
super economical I'd say.
_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming
  2025-02-03 20:24                       ` Oleg Drokin
@ 2025-02-03 20:29                         ` Oleg Drokin
  2025-02-04 17:33                           ` Andreas Dilger
  2025-02-06 18:24                         ` Day, Timothy
  1 sibling, 1 reply; 61+ messages in thread
From: Oleg Drokin @ 2025-02-03 20:29 UTC (permalink / raw)
  To: timday@amazon.com, neilb@suse.de; +Cc: lustre-devel@lists.lustre.org

On Mon, 2025-02-03 at 20:24 +0000, Oleg Drokin wrote:

> at $11/hour the m7a.metal-48xl would take $264 to run for just one
> day,
> a week is an eye-watering $1848, so running this for every patch is
> not
> super economical I'd say.

x2gd metal at $5.34 per hour makes more sense as it has more RAM (and
64 CPUs is adequate I'd say) but still quite pricey if you want to run
this at any sort of scale.
_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming
  2025-02-03 20:29                         ` Oleg Drokin
@ 2025-02-04 17:33                           ` Andreas Dilger
  2025-02-04 18:38                             ` Oleg Drokin
  0 siblings, 1 reply; 61+ messages in thread
From: Andreas Dilger @ 2025-02-04 17:33 UTC (permalink / raw)
  To: Oleg Drokin; +Cc: lustre-devel@lists.lustre.org

You overlook that Tim works for AWS, so he would not actually pay to run these nodes. He could run in machine idle times while no external customer is paying for them. 

I suspect with the random nature of the boilpot that it is the total number of hours runtime that matter, not whether they are contiguous or not.  So running 24x boilpot nodes for 1h during off-peak times would likely produce the same result as 24h continuous on one node. 

Cheers, Andreas

> On Feb 3, 2025, at 15:30, Oleg Drokin <green@whamcloud.com> wrote:
> 
> On Mon, 2025-02-03 at 20:24 +0000, Oleg Drokin wrote:
> 
>> at $11/hour the m7a.metal-48xl would take $264 to run for just one
>> day,
>> a week is an eye-watering $1848, so running this for every patch is
>> not
>> super economical I'd say.
> 
> x2gd metal at $5.34 per hour makes more sense as it has more RAM (and
> 64 CPUs is adequate I'd say) but still quite pricey if you want to run
> this at any sort of scale.
> _______________________________________________
> lustre-devel mailing list
> lustre-devel@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org
_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming
  2025-02-04 17:33                           ` Andreas Dilger
@ 2025-02-04 18:38                             ` Oleg Drokin
  2025-02-04 23:43                               ` Patrick Farrell
  2025-02-05 12:05                               ` Andreas Dilger
  0 siblings, 2 replies; 61+ messages in thread
From: Oleg Drokin @ 2025-02-04 18:38 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: lustre-devel@lists.lustre.org

On Tue, 2025-02-04 at 17:33 +0000, Andreas Dilger wrote:
> You overlook that Tim works for AWS, so he would not actually pay to
> run these nodes. He could run in machine idle times while no external
> customer is paying for them. 

If this could be arranged that would be great of course, but I don't
want to assume something of this nature unless explicitly stated. And
who knows what sort of internal accounting there might be in place to
keep track (and approve) uses like this too.

> I suspect with the random nature of the boilpot that it is the total
> number of hours runtime that matter, not whether they are contiguous
> or not.  So running 24x boilpot nodes for 1h during off-peak times
> would likely produce the same result as 24h continuous on one node. 

Well, not exactly true. There need to be continuous chunks of at least
1x the longest testrun and preferably much more (2x is better as the
minimum?).
If conf-sanity takes 5 hours in this setup (cpu overcommit making
things slow and whatnot) and you always only run for an hour - we never
get to try most of conf-sanity.

Also 50 sessions of conf-sanity running in parallel 1x vs
10 sessions running conf-sanity in parallel 5x - the latter probably
wins coverage wise because over time the other conflicting VMs would
deviate more so the stress points in the code would fall more and more
differently, I suspect (but we can probably test this by running both
setups for long enough in parallel on the same code and see how much of
a crash rate difference it makes)

> 
> Cheers, Andreas
> 
> > On Feb 3, 2025, at 15:30, Oleg Drokin <green@whamcloud.com> wrote:
> > 
> > On Mon, 2025-02-03 at 20:24 +0000, Oleg Drokin wrote:
> > 
> > > at $11/hour the m7a.metal-48xl would take $264 to run for just
> > > one
> > > day,
> > > a week is an eye-watering $1848, so running this for every patch
> > > is
> > > not
> > > super economical I'd say.
> > 
> > x2gd metal at $5.34 per hour makes more sense as it has more RAM
> > (and
> > 64 CPUs is adequate I'd say) but still quite pricey if you want to
> > run
> > this at any sort of scale.
> > _______________________________________________
> > lustre-devel mailing list
> > lustre-devel@lists.lustre.org
> > http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming
  2025-02-04 18:38                             ` Oleg Drokin
@ 2025-02-04 23:43                               ` Patrick Farrell
  2025-02-05 12:05                               ` Andreas Dilger
  1 sibling, 0 replies; 61+ messages in thread
From: Patrick Farrell @ 2025-02-04 23:43 UTC (permalink / raw)
  To: Oleg Drokin, Andreas Dilger; +Cc: lustre-devel@lists.lustre.org


[-- Attachment #1.1: Type: text/plain, Size: 3354 bytes --]

Obviously Tim would have to speak to this if he can, but that's not the way things worked at OCI and I would think it's the same at all the hyperscalers - there's no such thing as idle time, not really, or at least not like this.  They work very hard to minimize idle across the (many, many) datacenters/nodes and time is absolutely charged for internal use (perhaps charged differently, but still).  Plenty of people would love "idle" time, so there isn't any.

-Patrick
________________________________
From: lustre-devel <lustre-devel-bounces@lists.lustre.org> on behalf of Oleg Drokin <green@whamcloud.com>
Sent: Tuesday, February 4, 2025 12:38 PM
To: Andreas Dilger <adilger@ddn.com>
Cc: lustre-devel@lists.lustre.org <lustre-devel@lists.lustre.org>
Subject: Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming

On Tue, 2025-02-04 at 17:33 +0000, Andreas Dilger wrote:
> You overlook that Tim works for AWS, so he would not actually pay to
> run these nodes. He could run in machine idle times while no external
> customer is paying for them.

If this could be arranged that would be great of course, but I don't
want to assume something of this nature unless explicitly stated. And
who knows what sort of internal accounting there might be in place to
keep track (and approve) uses like this too.

> I suspect with the random nature of the boilpot that it is the total
> number of hours runtime that matter, not whether they are contiguous
> or not.  So running 24x boilpot nodes for 1h during off-peak times
> would likely produce the same result as 24h continuous on one node.

Well, not exactly true. There need to be continuous chunks of at least
1x the longest testrun and preferably much more (2x is better as the
minimum?).
If conf-sanity takes 5 hours in this setup (cpu overcommit making
things slow and whatnot) and you always only run for an hour - we never
get to try most of conf-sanity.

Also 50 sessions of conf-sanity running in parallel 1x vs
10 sessions running conf-sanity in parallel 5x - the latter probably
wins coverage wise because over time the other conflicting VMs would
deviate more so the stress points in the code would fall more and more
differently, I suspect (but we can probably test this by running both
setups for long enough in parallel on the same code and see how much of
a crash rate difference it makes)

>
> Cheers, Andreas
>
> > On Feb 3, 2025, at 15:30, Oleg Drokin <green@whamcloud.com> wrote:
> >
> > On Mon, 2025-02-03 at 20:24 +0000, Oleg Drokin wrote:
> >
> > > at $11/hour the m7a.metal-48xl would take $264 to run for just
> > > one
> > > day,
> > > a week is an eye-watering $1848, so running this for every patch
> > > is
> > > not
> > > super economical I'd say.
> >
> > x2gd metal at $5.34 per hour makes more sense as it has more RAM
> > (and
> > 64 CPUs is adequate I'd say) but still quite pricey if you want to
> > run
> > this at any sort of scale.
> > _______________________________________________
> > lustre-devel mailing list
> > lustre-devel@lists.lustre.org
> > http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

[-- Attachment #1.2: Type: text/html, Size: 5103 bytes --]

[-- Attachment #2: Type: text/plain, Size: 165 bytes --]

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming
  2025-02-04 18:38                             ` Oleg Drokin
  2025-02-04 23:43                               ` Patrick Farrell
@ 2025-02-05 12:05                               ` Andreas Dilger
  2025-02-06 18:36                                 ` Day, Timothy
  1 sibling, 1 reply; 61+ messages in thread
From: Andreas Dilger @ 2025-02-05 12:05 UTC (permalink / raw)
  To: Oleg Drokin; +Cc: lustre-devel@lists.lustre.org

To better cover the skew between different VMs running different subtests, we could change the test-framework code to run the subtests in different order (either starting at a random offset, or in random order).

This would also expose some hidden assumptions and dependencies in the subtests themselves, so that would need to be fixed to avoid false test failures, but the main goal of the boilpot testing is finding crashes/deadlocks, so if a few tests fail because of minor test issues I don't think that is a blocker. 

Cheers, Andreas

> On Feb 4, 2025, at 13:38, Oleg Drokin <green@whamcloud.com> wrote:
> 
> On Tue, 2025-02-04 at 17:33 +0000, Andreas Dilger wrote:
>> You overlook that Tim works for AWS, so he would not actually pay to
>> run these nodes. He could run in machine idle times while no external
>> customer is paying for them.
> 
> If this could be arranged that would be great of course, but I don't
> want to assume something of this nature unless explicitly stated. And
> who knows what sort of internal accounting there might be in place to
> keep track (and approve) uses like this too.
> 
>> I suspect with the random nature of the boilpot that it is the total
>> number of hours runtime that matter, not whether they are contiguous
>> or not.  So running 24x boilpot nodes for 1h during off-peak times
>> would likely produce the same result as 24h continuous on one node.
> 
> Well, not exactly true. There need to be continuous chunks of at least
> 1x the longest testrun and preferably much more (2x is better as the
> minimum?).
> If conf-sanity takes 5 hours in this setup (cpu overcommit making
> things slow and whatnot) and you always only run for an hour - we never
> get to try most of conf-sanity.
> 
> Also 50 sessions of conf-sanity running in parallel 1x vs
> 10 sessions running conf-sanity in parallel 5x - the latter probably
> wins coverage wise because over time the other conflicting VMs would
> deviate more so the stress points in the code would fall more and more
> differently, I suspect (but we can probably test this by running both
> setups for long enough in parallel on the same code and see how much of
> a crash rate difference it makes)
> 
>> 
>> Cheers, Andreas
>> 
>>>> On Feb 3, 2025, at 15:30, Oleg Drokin <green@whamcloud.com> wrote:
>>> 
>>> On Mon, 2025-02-03 at 20:24 +0000, Oleg Drokin wrote:
>>> 
>>>> at $11/hour the m7a.metal-48xl would take $264 to run for just
>>>> one
>>>> day,
>>>> a week is an eye-watering $1848, so running this for every patch
>>>> is
>>>> not
>>>> super economical I'd say.
>>> 
>>> x2gd metal at $5.34 per hour makes more sense as it has more RAM
>>> (and
>>> 64 CPUs is adequate I'd say) but still quite pricey if you want to
>>> run
>>> this at any sort of scale.
>>> _______________________________________________
>>> lustre-devel mailing list
>>> lustre-devel@lists.lustre.org
>>> http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org
> 
_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming
  2025-02-03 20:24                       ` Oleg Drokin
  2025-02-03 20:29                         ` Oleg Drokin
@ 2025-02-06 18:24                         ` Day, Timothy
  2025-02-06 18:47                           ` Oleg Drokin
  1 sibling, 1 reply; 61+ messages in thread
From: Day, Timothy @ 2025-02-06 18:24 UTC (permalink / raw)
  To: Oleg Drokin, neilb@suse.de; +Cc: lustre-devel@lists.lustre.org

> don't really need memory overcommit (in fact it's somewhat
> counterproductive), but since VMs typically don't use all RAM I wing it
> and run somewhat more VMs than what memory permits.
>
> as for CPU - the more overcommit is the better (my box has 96 cores).

Thanks for the pointers. Not sure which instance type I'd use, but
it's easy enough to try a bunch and see what works best.

> if this is to be deployed in the cloud at will, some robust
> orchestration is needed host-side - I create 240 libvirt driven VMs
> with their own storage in LVM, dhcp-driven autoconf, nfs export host-
> side with the right distro - just once per box lifetime and compiled
> lustre every time I run testing (so a resh checkout of master-next
> usually).
> Then configure crashdumping and an inotifywatch-based script to catch
> cores and do some light processign and ship results to the central data
> collector. (might be more efficient to do using in-vm crashdumping
> instead?)

I wrote a parallel ktest runner [1] a while back that probably does
the needed orchestration on the host side. It was originally intended
to run sanity tests faster (mostly for the OSD stuff I was working on).
But I think it could be adapted to run boilpot without much work.
It'd probably need some daemonize mode and I'd need to validate
that ktest actually captures all of the error modes we care about.

Ideally, the boilpot part would be platform agnostic. The cloud
orchestration part would just create the VM, run boilpot, and shuffle
the crash dumps off the box. My main goal (right now) is to get
something easily reproducible and get a sense of the signal/noise
ratio on boilpot. Plus, it might be interesting to try and flush out bugs
in my OSD as well [2]. It's hard to say how often I'd run it without
first seeing how effective it is.

Tim Day

[1] https://github.com/tim-day-387/ktest/tree/pktest
[2] https://review.whamcloud.com/c/fs/lustre-release/+/55594

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming
  2025-02-05 12:05                               ` Andreas Dilger
@ 2025-02-06 18:36                                 ` Day, Timothy
  2025-02-06 19:08                                   ` Oleg Drokin
  0 siblings, 1 reply; 61+ messages in thread
From: Day, Timothy @ 2025-02-06 18:36 UTC (permalink / raw)
  To: Andreas Dilger, Oleg Drokin; +Cc: lustre-devel@lists.lustre.org

> To better cover the skew between different VMs running different subtests, we could change the test-framework code to run the subtests in different order (either starting at a random offset, or in random order).
>
> This would also expose some hidden assumptions and dependencies in the subtests themselves, so that would need to be fixed to avoid false test failures, but the main goal of the boilpot testing is finding crashes/deadlocks, so if a few tests fail because of minor test issues I don't think that is a blocker.
>
>
> Cheers, Andreas

We could also limit each VM to run a subset of the sanity tests. Then
we could cap the length of the longest run test. I have some scripting
magic already to divide up the subtests between VMs.

Tim Day

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming
  2025-02-06 18:24                         ` Day, Timothy
@ 2025-02-06 18:47                           ` Oleg Drokin
  0 siblings, 0 replies; 61+ messages in thread
From: Oleg Drokin @ 2025-02-06 18:47 UTC (permalink / raw)
  To: timday@amazon.com, neilb@suse.de; +Cc: lustre-devel@lists.lustre.org

On Thu, 2025-02-06 at 18:24 +0000, Day, Timothy wrote:
> 
> I wrote a parallel ktest runner [1] a while back that probably does
> the needed orchestration on the host side. It was originally intended
> to run sanity tests faster (mostly for the OSD stuff I was working
> on).
> But I think it could be adapted to run boilpot without much work.
> It'd probably need some daemonize mode and I'd need to validate
> that ktest actually captures all of the error modes we care about.

Aha, thanks, I'll try to look into that.

> 
> Ideally, the boilpot part would be platform agnostic. The cloud
> orchestration part would just create the VM, run boilpot, and shuffle
> the crash dumps off the box. My main goal (right now) is to get

In fact I ppre-process crashdumps on the boilpot and then feed a server
with that data and if the crash is deemed "new" or interesting enough
for other reason it wil lrequest more data that hte boilpot will then
provide.
After all there's only so many identical known crashes one needs to
store.

> something easily reproducible and get a sense of the signal/noise
> ratio on boilpot. Plus, it might be interesting to try and flush out
> bugs

After I filter for all the known and "invalid" failures I get probably
on the order of mat be 1 crash a day, sometimes less, sometimes more.
Last one out of current master-next came totally unknown:

https://knox.linuxhacker.ru/crashdb_ui_external.py.cgi?newid=72768

This allows much higher visibility when something breaks, so with the
recent https://review.whamcloud.com/c/fs/lustre-release/+/55724 all the
procfs failures were really visible (and when I changed the recovery-
small to only run tests 55, 56 and 57 - the frequency shot up to
several crashes every other hour).

You can also see time-sorted crashes from all sources that report to my
server here: https://knox.linuxhacker.ru/crashdb_ui_external.py.cgi

(add ?count=XXX if you want more than the default number. It only shows
"unvetted" crashes too, which is something I probably need to change
eventually, but those are the most important ones I guess)

> in my OSD as well [2]. It's hard to say how often I'd run it without
> first seeing how effective it is.
> 
> Tim Day
> 
> [1] https://github.com/tim-day-387/ktest/tree/pktest
> [2] https://review.whamcloud.com/c/fs/lustre-release/+/55594
> 
> 

_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming
  2025-02-06 18:36                                 ` Day, Timothy
@ 2025-02-06 19:08                                   ` Oleg Drokin
  0 siblings, 0 replies; 61+ messages in thread
From: Oleg Drokin @ 2025-02-06 19:08 UTC (permalink / raw)
  To: timday@amazon.com, Andreas Dilger; +Cc: lustre-devel@lists.lustre.org

On Thu, 2025-02-06 at 18:36 +0000, Day, Timothy wrote:
> > To better cover the skew between different VMs running different
> > subtests, we could change the test-framework code to run the
> > subtests in different order (either starting at a random offset, or
> > in random order).
> > 
> > This would also expose some hidden assumptions and dependencies in
> > the subtests themselves, so that would need to be fixed to avoid
> > false test failures, but the main goal of the boilpot testing is
> > finding crashes/deadlocks, so if a few tests fail because of minor
> > test issues I don't think that is a blocker.
> > 
> > 
> > Cheers, Andreas
> 
> We could also limit each VM to run a subset of the sanity tests. Then
> we could cap the length of the longest run test. I have some
> scripting
> magic already to divide up the subtests between VMs.

That probably means you need more VMs to cover everything good enough
(the process is probabilistic as it is)
_______________________________________________
lustre-devel mailing list
lustre-devel@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

^ permalink raw reply	[flat|nested] 61+ messages in thread

end of thread, other threads:[~2025-02-06 19:08 UTC | newest]

Thread overview: 61+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-16 21:25 [lustre-devel] [LSF/MM/BPF TOPIC] [DRAFT] Lustre client upstreaming Day, Timothy
     [not found] ` <C9513675-3287-4784-90B7-AD133328C42A@ddn.com>
2025-01-17 22:46   ` Day, Timothy
2025-01-18  0:45 ` NeilBrown
2025-01-18  3:16   ` Oleg Drokin
2025-01-18 21:46     ` Day, Timothy
2025-01-19 20:46       ` Oleg Drokin
2025-01-20  4:38         ` Day, Timothy
2025-01-20  5:37           ` Oleg Drokin
2025-01-23  9:00           ` Alexey Lyahkov
2025-01-18 22:48     ` NeilBrown
2025-01-19  6:37       ` Alexey Lyahkov
2025-01-19  8:03         ` NeilBrown
2025-01-19 16:12           ` Alexey Lyahkov
2025-01-22 20:54             ` NeilBrown
2025-01-22 21:44               ` Oleg Drokin
2025-01-23  4:51               ` Alexey Lyahkov
2025-01-24 23:24                 ` NeilBrown
2025-01-25  9:09                   ` Alexey Lyahkov
2025-01-25 23:25                     ` NeilBrown
2025-01-19 21:20       ` Oleg Drokin
2025-01-24 23:12         ` NeilBrown
2025-01-25  6:40           ` Oleg Drokin
2025-02-01 22:19             ` NeilBrown
2025-02-01 23:25               ` Oleg Drokin
2025-02-03 17:24                 ` Day, Timothy
2025-02-03 19:42                   ` Oleg Drokin
2025-02-03 20:10                     ` Day, Timothy
2025-02-03 20:24                       ` Oleg Drokin
2025-02-03 20:29                         ` Oleg Drokin
2025-02-04 17:33                           ` Andreas Dilger
2025-02-04 18:38                             ` Oleg Drokin
2025-02-04 23:43                               ` Patrick Farrell
2025-02-05 12:05                               ` Andreas Dilger
2025-02-06 18:36                                 ` Day, Timothy
2025-02-06 19:08                                   ` Oleg Drokin
2025-02-06 18:24                         ` Day, Timothy
2025-02-06 18:47                           ` Oleg Drokin
2025-01-18 17:51   ` Day, Timothy
2025-01-18 22:21     ` NeilBrown
2025-01-20  3:57       ` Day, Timothy
2025-01-21 17:02         ` Patrick Farrell
2025-01-22  6:57           ` Andreas Dilger
2025-01-22 17:33             ` Day, Timothy
2025-01-22 20:48             ` NeilBrown
     [not found]   ` <E4481869-E21A-4941-9A97-8C59B7104528@ddn.com>
2025-01-18 22:25     ` NeilBrown
2025-01-20  4:54     ` Day, Timothy
2025-01-22  6:35 ` Day, Timothy
2025-01-22  7:09   ` Andreas Dilger
2025-01-22 11:12   ` Alexey Lyahkov
2025-01-22 17:17     ` Day, Timothy
2025-01-22 17:48       ` Alexey Lyahkov
2025-01-24 17:06         ` Day, Timothy
2025-01-24 19:23           ` Alexey Lyahkov
2025-01-29 19:00             ` Day, Timothy
2025-01-29 19:32               ` Alexey Lyahkov
2025-02-01 22:58                 ` NeilBrown
2025-02-01 23:23                   ` NeilBrown
2025-02-02  7:33                   ` Alexey Lyahkov
2025-02-03 17:33                   ` Day, Timothy
2025-02-03 17:43                     ` Alexey Lyahkov
2025-01-24 15:53   ` Day, Timothy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).