From: Mike Snitzer <snitzer@kernel.org>
To: Chuck Lever <cel@kernel.org>
Cc: linux-nfs@vger.kernel.org, Tom Talpey <tom@talpey.com>,
NeilBrown <neil@brown.name>, Jeff Layton <jlayton@kernel.org>,
Olga Kornievskaia <okorniev@redhat.com>,
Dai Ngo <dai.ngo@oracle.com>
Subject: Re: [PATCH v6 4/5] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
Date: Fri, 24 Oct 2025 14:30:02 -0400 [thread overview]
Message-ID: <aPvFqvtObywC2Bpt@kernel.org> (raw)
In-Reply-To: <d23cf628-50f4-441b-af65-1553e09ba153@kernel.org>
[-- Attachment #1: Type: text/plain, Size: 3740 bytes --]
On Fri, Oct 24, 2025 at 10:24:41AM -0400, Chuck Lever wrote:
> On 10/23/25 5:23 PM, Mike Snitzer wrote:
> > On Thu, Oct 23, 2025 at 03:37:36PM -0400, Chuck Lever wrote:
> >> On 10/22/25 3:22 PM, Chuck Lever wrote:
> >>> +struct nfsd_write_dio {
> >>> + ssize_t start_len; /* Length for misaligned first extent */
> >>> + ssize_t middle_len; /* Length for DIO-aligned middle extent */
> >>> + ssize_t end_len; /* Length for misaligned last extent */
> >>> +};
> >>> +
> >>> +static bool
> >>> +nfsd_is_write_dio_possible(loff_t offset, unsigned long len,
> >>> + struct nfsd_file *nf,
> >>> + struct nfsd_write_dio *write_dio)
> >>> +{
> >>> + const u32 dio_blocksize = nf->nf_dio_offset_align;
> >>> + loff_t start_end, orig_end, middle_end;
> >>> +
> >>> + if (unlikely(dio_blocksize > PAGE_SIZE || len < dio_blocksize))
> >>> + return false;
> >>> +
> >>> + start_end = round_up(offset, dio_blocksize);
> >>> + orig_end = offset + len;
> >>> + middle_end = round_down(orig_end, dio_blocksize);
> >>> +
> >>> + write_dio->start_len = start_end - offset;
> >>> + write_dio->middle_len = middle_end - start_end;
> >>> + write_dio->end_len = orig_end - middle_end;
> >>> +
> >>> + return true;
> >>> +}
> >>> +
> >>> +static bool
> >>> +nfsd_iov_iter_aligned_bvec(const struct iov_iter *i, unsigned int addr_mask,
> >>> + unsigned int len_mask)
> >>> +{
> >>> + const struct bio_vec *bvec = i->bvec;
> >>> + size_t skip = i->iov_offset;
> >>> + size_t size = i->count;
> >>> +
> >>> + if (size & len_mask)
> >>> + return false;
> >>> + do {
> >>> + size_t len = bvec->bv_len;
> >>> +
> >>> + if (len > size)
> >>> + len = size;
> >>> + if ((unsigned long)(bvec->bv_offset + skip) & addr_mask)
> >>> + return false;
> >>> + bvec++;
> >>> + size -= len;
> >>> + skip = 0;
> >>> + } while (size);
> >>> +
> >>> + return true;
> >>> +}
> >>> +
> >>
> >> Hey Mike, I'm trying to understand when nfsd_is_write_dio_possible()
> >> would return true but nfsd_iov_iter_aligned_bvec() on the middle segment
> >> would return false.
> >
> > It is always due to memory alignment (addr_mask check), never due to
> > logical alignment (len_mask check).
> >
> > So we could remove the len_mask arg and the 'if (size & len_mask)'
> > check from nfsd_iov_iter_aligned_bvec
>
> Fair enough.
>
> Another question: why does nfsd_iov_iter_aligned_bvec need to walk the
> entire bvec? For current NFSD and SunRPC, the elements of the bvec
> array will be contiguous pages. After nfsd_is_write_dio_possible says
> "OK", seems like you need to check only the first element in the bvec?
I haven't analyzed if simply checking the first element of the bvec is
sufficient, that'd be a very welcomed efficiency improvement!.
I just know the checking to important and so when
iov_iter_aligned_bvec() removed by Keith I knew we still needed the
benefit of that checking.
> Or, maybe the check isn't necessary at all?
No, the memory alignment checking is certainly needed. I've attached
the python reproducer that Jonathan Flynn coded (with ChatGPT assist,
hence the emoji like output ;) ) that really exposes the need for
memory alignment checking.
This python code effectively is the preliminary work done by the
MLperf benchmark to extract npz files as executed using MLperf's
buffered IO mode at the client. (MLperf _can_ be executed in DIO mode
and when that is used there is no resulting misaligned memory).
Run it with:
./mlperf_npz_tool.py --npz-path /mnt/share1/test.npz
> I'm just wondering why nfsd_is_write_dio_possible by itself isn't
> sufficient. Our network transports aren't going to hand us arbitrary
> blobs in the bvec elements.
Sure, would be great if simply checking the first element sufficient.
Thanks,
Mike
[-- Attachment #2: mlperf_npz_tool.py --]
[-- Type: text/plain, Size: 7770 bytes --]
#!/usr/bin/env python3
import argparse, math, os, struct, zlib
from pathlib import Path
import numpy as np
import zipfile
# -----------------------
# Defaults (from your YAML)
# -----------------------
DEFAULT_MEAN_BYTES = 146_600_628
DEFAULT_STDEV_BYTES = 68_341_808
DEFAULT_RESIZE = 2_097_152 # 2 MiB
DEFAULT_SAMPLES = 1
DEFAULT_DTYPE = "uint8"
DEFAULT_SEED = 10
# -----------------------
# Helpers
# -----------------------
DTYPE_SIZES = {
"uint8": 1, "int8": 1,
"uint16": 2, "int16": 2,
"uint32": 4, "int32": 4,
"float32": 4, "float64": 8,
}
def choose_target_bytes(mean_b, stdev_b, resize, randomize):
if randomize and stdev_b > 0:
draw = int(np.random.normal(loc=mean_b, scale=stdev_b))
draw = max(draw, 1)
else:
draw = int(mean_b)
# Round to nearest multiple of resize
return int(round(draw / resize) * resize)
def choose_hw_for_bytes(total_bytes, samples, dtype_size):
"""
Choose H, W making H*W*samples*dtype_size == total_bytes.
We factor total elements and spread powers of two across H and W
to avoid super-skinny arrays.
"""
total_elems = total_bytes // (dtype_size * samples)
if total_elems == 0:
raise ValueError("Total elements computed as 0; check inputs.")
n = total_elems
# Factor out powers of two
exp2 = (n & -n).bit_length() - 1
odd = n >> exp2
h = 1 << (exp2 // 2)
w = (1 << (exp2 - exp2 // 2)) * odd
return int(h), int(w)
def save_npz(out_path: Path, *, mean_bytes, stdev_bytes, resize_bytes,
samples, dtype_name, seed, compress, randomize):
dtype = getattr(np, dtype_name)
dtype_size = DTYPE_SIZES[dtype_name]
np.random.seed(seed)
target_bytes = choose_target_bytes(mean_bytes, stdev_bytes, resize_bytes, randomize)
# Ensure divisibility:
elems_per_sample = target_bytes // dtype_size // samples
if elems_per_sample * dtype_size * samples != target_bytes:
raise ValueError("Target bytes not divisible by dtype_size*samples; adjust params.")
h, w = choose_hw_for_bytes(target_bytes, samples, dtype_size)
x = np.random.randint(255, size=(h, w, samples), dtype=dtype if dtype_name == "uint8" else np.uint8)
if dtype_name != "uint8":
x = x.astype(dtype, copy=False)
y = np.zeros((samples,), dtype=np.uint8) # matches DLIO NPZ generator convention
out_path.parent.mkdir(parents=True, exist_ok=True)
if compress:
np.savez_compressed(out_path, x=x, y=y)
else:
np.savez(out_path, x=x, y=y)
print(f"✅ Wrote {out_path}")
try:
sz = out_path.stat().st_size
print(f" size={sz} bytes, x.shape={x.shape}, dtype={x.dtype}, samples={samples}")
except FileNotFoundError:
pass
def list_and_crc(npz_path: Path, deep=False):
print(f"📂 File: {npz_path}")
with zipfile.ZipFile(npz_path, "r") as zf:
names = zf.namelist()
print(f"📦 Files in archive: {names}\n")
for name in names:
info = zf.getinfo(name)
print(f"--- {name} ---")
print(f"Stored CRC32 : 0x{info.CRC:08x}")
print(f"Compressed Size : {info.compress_size}")
print(f"Uncompressed Size : {info.file_size}")
try:
with zf.open(info) as f:
_ = f.read() # will raise if CRC mismatch
print("✅ CRC verified by zipfile.\n")
except zipfile.BadZipFile as e:
print(f"⚠️ CRC error via zipfile: {e}")
if deep:
ok = deep_crc_check(npz_path, info)
print("🔎 Deep check :", "✅ OK\n" if ok else "❌ Mismatch\n")
else:
print("ℹ️ Re-run with --deep-check to diagnose.\n")
except Exception as e:
print(f"❌ Unexpected error: {e}\n")
def deep_crc_check(npz_path: Path, info: zipfile.ZipInfo) -> bool:
"""
Manual CRC of the *uncompressed* payload.
Parse the local file header to find the compressed bytes, then
decompress and compute CRC32 of the uncompressed stream.
"""
with npz_path.open("rb") as fh:
fh.seek(info.header_offset)
local = fh.read(30) # fixed part of local header
# local file header sig 'PK\x03\x04'
if local[:4] != b'PK\x03\x04':
return False
# filename length, extra length
name_len, extra_len = struct.unpack("<HH", local[26:30])
fh.seek(info.header_offset + 30 + name_len + extra_len)
comp = fh.read(info.compress_size)
# Decompress if needed
if info.compress_type == zipfile.ZIP_STORED:
data = comp
elif info.compress_type == zipfile.ZIP_DEFLATED:
# ZIP uses raw DEFLATE stream (no zlib header): wbits = -15
try:
data = zlib.decompress(comp, -15)
except zlib.error:
# Some writers include zlib headers; try normal
data = zlib.decompress(comp)
else:
# Other methods not handled here
return False
crc = zlib.crc32(data) & 0xFFFFFFFF
print(f"Computed CRC32 : 0x{crc:08x}")
return crc == info.CRC
# -----------------------
# CLI
# -----------------------
def parse_args():
p = argparse.ArgumentParser(description="MLPerf-style NPZ save & check tool")
mode = p.add_mutually_exclusive_group()
mode.add_argument("--save-only", action="store_true", help="Only save the NPZ")
mode.add_argument("--check-only", action="store_true", help="Only verify/show the NPZ")
p.add_argument("--npz-path", type=Path, help="Full output/check path to NPZ (overrides --out-dir/--name)")
p.add_argument("--out-dir", type=Path, default=Path("/mnt/hs_test"), help="Directory for output NPZ")
p.add_argument("--name", default="sample_000000.npz", help="Filename for output NPZ")
p.add_argument("--compress", action="store_true", help="Use compressed NPZ (deflate)")
# Size/dtype controls
p.add_argument("--mean-bytes", type=int, default=DEFAULT_MEAN_BYTES, help="record_length_bytes")
p.add_argument("--stdev-bytes", type=int, default=DEFAULT_STDEV_BYTES, help="record_length_bytes_stdev")
p.add_argument("--resize-bytes", type=int, default=DEFAULT_RESIZE, help="record_length_bytes_resize multiple")
p.add_argument("--randomize", action="store_true", help="Draw size from N(mean,stdev) before rounding")
p.add_argument("--samples", type=int, default=DEFAULT_SAMPLES, help="num_samples_per_file")
p.add_argument("--dtype", choices=DTYPE_SIZES.keys(), default=DEFAULT_DTYPE, help="Data dtype for 'x'")
p.add_argument("--seed", type=int, default=DEFAULT_SEED, help="RNG seed for reproducibility")
# Check controls
p.add_argument("--deep-check", action="store_true", help="When checking, manually parse & CRC the member data")
return p.parse_args()
def main():
args = parse_args()
out_path = args.npz_path or (args.out_dir / args.name)
did_save = False
if not args.check_only:
save_npz(
out_path=out_path,
mean_bytes=args.mean_bytes,
stdev_bytes=args.stdev_bytes,
resize_bytes=args.resize_bytes,
samples=args.samples,
dtype_name=args.dtype,
seed=args.seed,
compress=args.compress,
randomize=args.randomize,
)
did_save = True
if not args.save_only:
if not out_path.exists():
raise SystemExit(f"File not found for check: {out_path}")
list_and_crc(out_path, deep=args.deep_check)
elif did_save:
# echo path for easy piping
print(out_path)
if __name__ == "__main__":
main()
next prev parent reply other threads:[~2025-10-24 18:30 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-22 19:22 [PATCH v6 0/5] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
2025-10-22 19:22 ` [PATCH v6 1/5] NFSD: Make FILE_SYNC WRITEs comply with spec Chuck Lever
2025-10-22 19:22 ` [PATCH v6 2/5] NFSD: Enable return of an updated stable_how to NFS clients Chuck Lever
2025-10-22 19:22 ` [PATCH v6 3/5] NFSD: Refactor nfsd_vfs_write() Chuck Lever
2025-10-22 19:22 ` [PATCH v6 4/5] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
2025-10-22 19:27 ` Chuck Lever
2025-10-22 21:09 ` Mike Snitzer
2025-10-22 20:58 ` Mike Snitzer
2025-10-22 21:09 ` Chuck Lever
2025-10-23 19:37 ` Chuck Lever
2025-10-23 21:23 ` Mike Snitzer
2025-10-24 14:24 ` Chuck Lever
2025-10-24 18:30 ` Mike Snitzer [this message]
2025-10-22 19:22 ` [PATCH v6 5/5] svcrdma: Mark Read chunks Chuck Lever
2025-10-22 21:25 ` Mike Snitzer
2025-10-23 13:00 ` Chuck Lever
2025-10-22 21:56 ` [PATCH v6 0/5] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Mike Snitzer
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aPvFqvtObywC2Bpt@kernel.org \
--to=snitzer@kernel.org \
--cc=cel@kernel.org \
--cc=dai.ngo@oracle.com \
--cc=jlayton@kernel.org \
--cc=linux-nfs@vger.kernel.org \
--cc=neil@brown.name \
--cc=okorniev@redhat.com \
--cc=tom@talpey.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).