linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [QUESTION] zig build systems fails on XFS V4 volumes
@ 2024-02-03 17:50 Donald Buczek
  2024-02-04 21:56 ` Dave Chinner
  0 siblings, 1 reply; 5+ messages in thread
From: Donald Buczek @ 2024-02-03 17:50 UTC (permalink / raw)
  To: linux-xfs

Dear Experts,

I'm encountering consistent build failures with the Zig language from source on certain systems, and I'm seeking insights into the issue.

Issue Summary:

     Build fails on XFS volumes with V4 format (crc=0).
     Build succeeds on XFS volumes with V5 format (crc=1), regardless of bigtime value.

Observations:

     The failure occurs silently during Zig's native build process.
     The build system relies on timestamps for dependencies and employs parallelism.
     Debugging is challenging without debug support at this stage, and strace output hasn't been illuminating.

Speculation:

     The issue may be related to timestamp handling, although I'm not aware of significant differences between V4 and V5 formats in this regard.

Questions:

     Why might a dependency build system behave differently on XFS V4 vs. V5 volumes? Could this be a race condition, despite consistent failure on V4 and success on V5 in repeated tests?

Any guidance or suggestions would be greatly appreciated.

Thank you for your time and expertise.

Please let me know if you need any further information or clarification.

Best regards,

   Donald

-- 
Donald Buczek
buczek@molgen.mpg.de
Tel: +49 30 8413 1433

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [QUESTION] zig build systems fails on XFS V4 volumes
  2024-02-03 17:50 [QUESTION] zig build systems fails on XFS V4 volumes Donald Buczek
@ 2024-02-04 21:56 ` Dave Chinner
  2024-02-05 13:12   ` Donald Buczek
  0 siblings, 1 reply; 5+ messages in thread
From: Dave Chinner @ 2024-02-04 21:56 UTC (permalink / raw)
  To: Donald Buczek; +Cc: linux-xfs

On Sat, Feb 03, 2024 at 06:50:31PM +0100, Donald Buczek wrote:
> Dear Experts,
> 
> I'm encountering consistent build failures with the Zig language from source on certain systems, and I'm seeking insights into the issue.
> 
> Issue Summary:
> 
>     Build fails on XFS volumes with V4 format (crc=0).
>     Build succeeds on XFS volumes with V5 format (crc=1), regardless of bigtime value.

mkfs.xfs output for a successful build vs a broken build, please!

Also a description of the hardware and storage stack configuration
would be useful.

> 
> Observations:
> 
>     The failure occurs silently during Zig's native build process.

What is the actual failure? What is the symptoms of this "silent
failure". Please give output showing how the failure is occurs, how
it is detected, etc. From there we can work to identify what to look
at next.

Everything remaining in the bug report is pure speculation, but
there's no information provided that allows us to do anything other
than speculate in return, so I'm just going to ignore it. Document
the evidence of the problem so we can understand it - speculation
about causes in the absence of evidence is simply not helpful....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [QUESTION] zig build systems fails on XFS V4 volumes
  2024-02-04 21:56 ` Dave Chinner
@ 2024-02-05 13:12   ` Donald Buczek
  2024-02-05 21:13     ` Dave Chinner
  0 siblings, 1 reply; 5+ messages in thread
From: Donald Buczek @ 2024-02-05 13:12 UTC (permalink / raw)
  To: Dave Chinner, linux-xfs

On 2/4/24 22:56, Dave Chinner wrote:
> On Sat, Feb 03, 2024 at 06:50:31PM +0100, Donald Buczek wrote:
>> Dear Experts,
>>
>> I'm encountering consistent build failures with the Zig language from source on certain systems, and I'm seeking insights into the issue.
>>
>> Issue Summary:
>>
>>     Build fails on XFS volumes with V4 format (crc=0).
>>     Build succeeds on XFS volumes with V5 format (crc=1), regardless of bigtime value.
> 
> mkfs.xfs output for a successful build vs a broken build, please!
> 
> Also a description of the hardware and storage stack configuration
> would be useful.
> 
>>
>> Observations:
>>
>>     The failure occurs silently during Zig's native build process.
> 
> What is the actual failure? What is the symptoms of this "silent
> failure". Please give output showing how the failure is occurs, how
> it is detected, etc. From there we can work to identify what to look
> at next.
> 
> Everything remaining in the bug report is pure speculation, but
> there's no information provided that allows us to do anything other
> than speculate in return, so I'm just going to ignore it. Document
> the evidence of the problem so we can understand it - speculation
> about causes in the absence of evidence is simply not helpful....

I was actually just hoping that someone could confirm that the functionality, as visible from userspace, should be identical, apart from timing. Or, that someone might have an idea based on experience what could be causing the different behavior. This was not intended as a bug report for XFS.

But I'm grateful, of course, if you want to look deeper into it.

Through further investigation, I've pinpointed the discrepancy between functional and non-functional filesystems, narrowing it down from "V5 vs. V4" to "ftype=1 vs. ftype=0"

Detailed filesystem configurations and a comparison are available at https://owww.molgen.mpg.de/~buczek/zt/

Included are:

    test.sh: The main script setting up two XFS volumes and initiating the build process using build.sh.
    test.log: The output log from test.sh.
    build.log: Located in xfs_{ok,fail} directories, containing the build process outputs.
    cmp.sh and cmp.log: A script and its output for comparing xfs_ok and xfs_fail directories.

In build.sh there is a test which demonstrated the build failure: "stage3/lib" not produced.  After that test, the command which should produce stage3/lib is run again, this time with strace. The traces are in xfs_{ok,fail}/traces and the whole build directories in xfs_{ok,fail}/zig-0.11.0

There is also a script cmp.sh and its output cmp.log, which compares the xfs_ok and xfs_fail directories. It also produces traces.cmp.txt which is a (width 200) side by side comparison of the strace files.

The outcome has been be replicated across various CPU architectures and kernel versions.

Best regards,

  Donald


> 
> -Dave.

-- 
Donald Buczek
buczek@molgen.mpg.de
Tel: +49 30 8413 1433


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [QUESTION] zig build systems fails on XFS V4 volumes
  2024-02-05 13:12   ` Donald Buczek
@ 2024-02-05 21:13     ` Dave Chinner
  2024-02-06  6:45       ` Donald Buczek
  0 siblings, 1 reply; 5+ messages in thread
From: Dave Chinner @ 2024-02-05 21:13 UTC (permalink / raw)
  To: Donald Buczek; +Cc: linux-xfs

On Mon, Feb 05, 2024 at 02:12:43PM +0100, Donald Buczek wrote:
> On 2/4/24 22:56, Dave Chinner wrote:
> > On Sat, Feb 03, 2024 at 06:50:31PM +0100, Donald Buczek wrote:
> >> Dear Experts,
> >>
> >> I'm encountering consistent build failures with the Zig
> >> language from source on certain systems, and I'm seeking
> >> insights into the issue.
> >>
> >> Issue Summary:
> >>
> >>     Build fails on XFS volumes with V4 format (crc=0).  Build
> >>     succeeds on XFS volumes with V5 format (crc=1), regardless
> >>     of bigtime value.
> > 
> > mkfs.xfs output for a successful build vs a broken build,
> > please!
> > 
> > Also a description of the hardware and storage stack
> > configuration would be useful.
> > 
> >>
> >> Observations:
> >>
> >>     The failure occurs silently during Zig's native build
> >>     process.
> > 
> > What is the actual failure? What is the symptoms of this "silent
> > failure". Please give output showing how the failure is occurs,
> > how it is detected, etc. From there we can work to identify what
> > to look at next.
> > 
> > Everything remaining in the bug report is pure speculation, but
> > there's no information provided that allows us to do anything
> > other than speculate in return, so I'm just going to ignore it.
> > Document the evidence of the problem so we can understand it -
> > speculation about causes in the absence of evidence is simply
> > not helpful....
> 
> I was actually just hoping that someone could confirm that the
> functionality, as visible from userspace, should be identical,
> apart from timing. Or, that someone might have an idea based on
> experience what could be causing the different behavior. This was
> not intended as a bug report for XFS.

Maybe not, but as a report of "weird unexpected behaviour on XFS"
it could be an XFS issue....

[....]

> There is also a script cmp.sh and its output cmp.log, which
> compares the xfs_ok and xfs_fail directories. It also produces
> traces.cmp.txt which is a (width 200) side by side comparison of
> the strace files.

I think this one contains a smoking gun w.r.t. whatever code is
running. Near the end of the first trace comaprison, there is an
iteration of test/cases via getdents64(). They have different
behaviour, yet the directory structure is the same.

Good:

openat(3, "test/cases", O_RDONLY|O_CLOEXEC|O_DIRECTORY) = 7
lseek(7, 0, SEEK_SET)                   = 0
getdents64(7, 0x7f8d106b69b8 /* 21 entries */, 1024) = 1000
getdents64(7, 0x7f8d106b69b8 /* 21 entries */, 1024) = 1016
getdents64(7, 0x7f8d106b69b8 /* 21 entries */, 1024) = 1000
getdents64(7, 0x7f8d106b69b8 /* 23 entries */, 1024) = 1016
openat(7, "compile_errors", O_RDONLY|O_CLOEXEC|O_DIRECTORY) = 8
getdents64(8, 0x7f8d106b6de0 /* 16 entries */, 1024) = 968
getdents64(8, 0x7f8d106b6de0 /* 17 entries */, 1024) = 1008
getdents64(8, 0x7f8d106b6de0 /* 17 entries */, 1024) = 1016
getdents64(8, 0x7f8d106b6de0 /* 14 entries */, 1024) = 968
openat(8, "async", O_RDONLY|O_CLOEXEC|O_DIRECTORY) = 9
getdents64(9, 0x7f8d106b7208 /* 16 entries */, 1024) = 1000
......


Bad:

openat(3, "test/cases", O_RDONLY|O_CLOEXEC|O_DIRECTORY) = 7
lseek(7, 0, SEEK_SET)                   = 0
getdents64(7, 0x7f2593eb89b8 /* 21 entries */, 1024) = 1000
getdents64(7, 0x7f2593eb89b8 /* 21 entries */, 1024) = 1016
getdents64(7, 0x7f2593eb89b8 /* 21 entries */, 1024) = 1000
getdents64(7, 0x7f2593eb89b8 /* 23 entries */, 1024) = 1016
getdents64(7, 0x7f2593eb89b8 /* 25 entries */, 1024) = 1016
getdents64(7, 0x7f2593eb89b8 /* 20 entries */, 1024) = 1016
getdents64(7, 0x7f2593eb89b8 /* 19 entries */, 1024) = 992
getdents64(7, 0x7f2593eb89b8 /* 22 entries */, 1024) = 1016
getdents64(7, 0x7f2593eb89b8 /* 22 entries */, 1024) = 992
getdents64(7, 0x7f2593eb89b8 /* 17 entries */, 1024) = 760
getdents64(7, 0x7f2593eb89b8 /* 0 entries */, 1024) = 0

In the good case, we see a test/cases being read, and then the first
subdir test/cases/compile_errors being opened and read. And then a
subdir test/cases/compile_errors/async being opened and read.

IOWs, in the good case it's doing a depth first directory traversal.

In the bad case, there's no subdirectories being opened and read.

I see the same difference in other traces that involve directory
traversal.

The reason for this difference seems obvious: there's a distinct
lack of stat() calls in the ftype=0 (bad) case. dirent->d_type in
this situation will be reporting DT_UNKNOWN for all entries except
'.' and '..'. It is the application's responsibility to handle this,
as the only way to determine if a DT_UNKNOWN entry is a directory is
to stat() the pathname and look at the st_mode returned.

The code is clearly not doing this, and so I'm guessing that the zig
people have rolled their own nftw() function and didn't pay
attention to the getdents() man page:

	Currently,  only some filesystems (among them: Btrfs, ext2,
	ext3, and ext4) have full support for returning the file
	type in d_type.  All applications must properly handle a
	return of DT_UNKNOWN.

So, yeah, looks like someone didn't read the getdents man page
completely and it's not a filesystem issue.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [QUESTION] zig build systems fails on XFS V4 volumes
  2024-02-05 21:13     ` Dave Chinner
@ 2024-02-06  6:45       ` Donald Buczek
  0 siblings, 0 replies; 5+ messages in thread
From: Donald Buczek @ 2024-02-06  6:45 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On 2/5/24 22:13, Dave Chinner wrote:
> On Mon, Feb 05, 2024 at 02:12:43PM +0100, Donald Buczek wrote:
>> On 2/4/24 22:56, Dave Chinner wrote:
>>> On Sat, Feb 03, 2024 at 06:50:31PM +0100, Donald Buczek wrote:
>>>> Dear Experts,
>>>>
>>>> I'm encountering consistent build failures with the Zig
>>>> language from source on certain systems, and I'm seeking
>>>> insights into the issue.
>>>>

> The reason for this difference seems obvious: there's a distinct
> lack of stat() calls in the ftype=0 (bad) case. dirent->d_type in
> this situation will be reporting DT_UNKNOWN for all entries except
> '.' and '..'. It is the application's responsibility to handle this,
> as the only way to determine if a DT_UNKNOWN entry is a directory is
> to stat() the pathname and look at the st_mode returned.

You've nailed it. [1][2]

I'll take this over to the zig community.

Thanks!

   Donald

[1]: https://github.com/ziglang/zig/blob/39ec3d311673716e145957d6d81f9d4ec7848471/lib/std/fs/Dir.zig#L372
[2]: https://github.com/ziglang/zig/blob/39ec3d311673716e145957d6d81f9d4ec7848471/lib/std/fs/Dir.zig#L669

> 
> The code is clearly not doing this, and so I'm guessing that the zig
> people have rolled their own nftw() function and didn't pay
> attention to the getdents() man page:
> 
> 	Currently,  only some filesystems (among them: Btrfs, ext2,
> 	ext3, and ext4) have full support for returning the file
> 	type in d_type.  All applications must properly handle a
> 	return of DT_UNKNOWN.
> 
> So, yeah, looks like someone didn't read the getdents man page
> completely and it's not a filesystem issue.
> 
> -Dave.

-- 
Donald Buczek
buczek@molgen.mpg.de
Tel: +49 30 8413 1433

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2024-02-06  6:45 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-02-03 17:50 [QUESTION] zig build systems fails on XFS V4 volumes Donald Buczek
2024-02-04 21:56 ` Dave Chinner
2024-02-05 13:12   ` Donald Buczek
2024-02-05 21:13     ` Dave Chinner
2024-02-06  6:45       ` Donald Buczek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).