linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Mike Snitzer <snitzer@kernel.org>
To: Chuck Lever <cel@kernel.org>
Cc: linux-nfs@vger.kernel.org, Tom Talpey <tom@talpey.com>,
	NeilBrown <neil@brown.name>, Jeff Layton <jlayton@kernel.org>,
	Olga Kornievskaia <okorniev@redhat.com>,
	Dai Ngo <dai.ngo@oracle.com>
Subject: Re: [PATCH v6 4/5] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE
Date: Fri, 24 Oct 2025 14:30:02 -0400	[thread overview]
Message-ID: <aPvFqvtObywC2Bpt@kernel.org> (raw)
In-Reply-To: <d23cf628-50f4-441b-af65-1553e09ba153@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 3740 bytes --]

On Fri, Oct 24, 2025 at 10:24:41AM -0400, Chuck Lever wrote:
> On 10/23/25 5:23 PM, Mike Snitzer wrote:
> > On Thu, Oct 23, 2025 at 03:37:36PM -0400, Chuck Lever wrote:
> >> On 10/22/25 3:22 PM, Chuck Lever wrote:
> >>> +struct nfsd_write_dio {
> >>> +	ssize_t	start_len;	/* Length for misaligned first extent */
> >>> +	ssize_t	middle_len;	/* Length for DIO-aligned middle extent */
> >>> +	ssize_t	end_len;	/* Length for misaligned last extent */
> >>> +};
> >>> +
> >>> +static bool
> >>> +nfsd_is_write_dio_possible(loff_t offset, unsigned long len,
> >>> +			   struct nfsd_file *nf,
> >>> +			   struct nfsd_write_dio *write_dio)
> >>> +{
> >>> +	const u32 dio_blocksize = nf->nf_dio_offset_align;
> >>> +	loff_t start_end, orig_end, middle_end;
> >>> +
> >>> +	if (unlikely(dio_blocksize > PAGE_SIZE || len < dio_blocksize))
> >>> +		return false;
> >>> +
> >>> +	start_end = round_up(offset, dio_blocksize);
> >>> +	orig_end = offset + len;
> >>> +	middle_end = round_down(orig_end, dio_blocksize);
> >>> +
> >>> +	write_dio->start_len = start_end - offset;
> >>> +	write_dio->middle_len = middle_end - start_end;
> >>> +	write_dio->end_len = orig_end - middle_end;
> >>> +
> >>> +	return true;
> >>> +}
> >>> +
> >>> +static bool
> >>> +nfsd_iov_iter_aligned_bvec(const struct iov_iter *i, unsigned int addr_mask,
> >>> +			   unsigned int len_mask)
> >>> +{
> >>> +	const struct bio_vec *bvec = i->bvec;
> >>> +	size_t skip = i->iov_offset;
> >>> +	size_t size = i->count;
> >>> +
> >>> +	if (size & len_mask)
> >>> +		return false;
> >>> +	do {
> >>> +		size_t len = bvec->bv_len;
> >>> +
> >>> +		if (len > size)
> >>> +			len = size;
> >>> +		if ((unsigned long)(bvec->bv_offset + skip) & addr_mask)
> >>> +			return false;
> >>> +		bvec++;
> >>> +		size -= len;
> >>> +		skip = 0;
> >>> +	} while (size);
> >>> +
> >>> +	return true;
> >>> +}
> >>> +
> >>
> >> Hey Mike, I'm trying to understand when nfsd_is_write_dio_possible()
> >> would return true but nfsd_iov_iter_aligned_bvec() on the middle segment
> >> would return false.
> > 
> > It is always due to memory alignment (addr_mask check), never due to
> > logical alignment (len_mask check).
> > 
> > So we could remove the len_mask arg and the 'if (size & len_mask)'
> > check from nfsd_iov_iter_aligned_bvec
> 
> Fair enough.
> 
> Another question: why does nfsd_iov_iter_aligned_bvec need to walk the
> entire bvec? For current NFSD and SunRPC, the elements of the bvec
> array will be contiguous pages. After nfsd_is_write_dio_possible says
> "OK", seems like you need to check only the first element in the bvec?

I haven't analyzed if simply checking the first element of the bvec is
sufficient, that'd be a very welcomed efficiency improvement!.

I just know the checking to important and so when
iov_iter_aligned_bvec() removed by Keith I knew we still needed the
benefit of that checking.

> Or, maybe the check isn't necessary at all?

No, the memory alignment checking is certainly needed.  I've attached
the python reproducer that Jonathan Flynn coded (with ChatGPT assist,
hence the emoji like output ;) ) that really exposes the need for
memory alignment checking.

This python code effectively is the preliminary work done by the
MLperf benchmark to extract npz files as executed using MLperf's
buffered IO mode at the client.  (MLperf _can_ be executed in DIO mode
and when that is used there is no resulting misaligned memory).

Run it with:
./mlperf_npz_tool.py --npz-path /mnt/share1/test.npz

> I'm just wondering why nfsd_is_write_dio_possible by itself isn't
> sufficient. Our network transports aren't going to hand us arbitrary
> blobs in the bvec elements.

Sure, would be great if simply checking the first element sufficient.

Thanks,
Mike

[-- Attachment #2: mlperf_npz_tool.py --]
[-- Type: text/plain, Size: 7770 bytes --]

#!/usr/bin/env python3
import argparse, math, os, struct, zlib
from pathlib import Path
import numpy as np
import zipfile

# -----------------------
# Defaults (from your YAML)
# -----------------------
DEFAULT_MEAN_BYTES  = 146_600_628
DEFAULT_STDEV_BYTES = 68_341_808
DEFAULT_RESIZE      = 2_097_152     # 2 MiB
DEFAULT_SAMPLES     = 1
DEFAULT_DTYPE       = "uint8"
DEFAULT_SEED        = 10

# -----------------------
# Helpers
# -----------------------
DTYPE_SIZES = {
    "uint8": 1, "int8": 1,
    "uint16": 2, "int16": 2,
    "uint32": 4, "int32": 4,
    "float32": 4, "float64": 8,
}

def choose_target_bytes(mean_b, stdev_b, resize, randomize):
    if randomize and stdev_b > 0:
        draw = int(np.random.normal(loc=mean_b, scale=stdev_b))
        draw = max(draw, 1)
    else:
        draw = int(mean_b)
    # Round to nearest multiple of resize
    return int(round(draw / resize) * resize)

def choose_hw_for_bytes(total_bytes, samples, dtype_size):
    """
    Choose H, W making H*W*samples*dtype_size == total_bytes.
    We factor total elements and spread powers of two across H and W
    to avoid super-skinny arrays.
    """
    total_elems = total_bytes // (dtype_size * samples)
    if total_elems == 0:
        raise ValueError("Total elements computed as 0; check inputs.")
    n = total_elems
    # Factor out powers of two
    exp2 = (n & -n).bit_length() - 1
    odd  = n >> exp2
    h = 1 << (exp2 // 2)
    w = (1 << (exp2 - exp2 // 2)) * odd
    return int(h), int(w)

def save_npz(out_path: Path, *, mean_bytes, stdev_bytes, resize_bytes,
             samples, dtype_name, seed, compress, randomize):
    dtype = getattr(np, dtype_name)
    dtype_size = DTYPE_SIZES[dtype_name]

    np.random.seed(seed)
    target_bytes = choose_target_bytes(mean_bytes, stdev_bytes, resize_bytes, randomize)
    # Ensure divisibility:
    elems_per_sample = target_bytes // dtype_size // samples
    if elems_per_sample * dtype_size * samples != target_bytes:
        raise ValueError("Target bytes not divisible by dtype_size*samples; adjust params.")

    h, w = choose_hw_for_bytes(target_bytes, samples, dtype_size)

    x = np.random.randint(255, size=(h, w, samples), dtype=dtype if dtype_name == "uint8" else np.uint8)
    if dtype_name != "uint8":
        x = x.astype(dtype, copy=False)
    y = np.zeros((samples,), dtype=np.uint8)  # matches DLIO NPZ generator convention

    out_path.parent.mkdir(parents=True, exist_ok=True)
    if compress:
        np.savez_compressed(out_path, x=x, y=y)
    else:
        np.savez(out_path, x=x, y=y)

    print(f"✅ Wrote {out_path}")
    try:
        sz = out_path.stat().st_size
        print(f"   size={sz} bytes, x.shape={x.shape}, dtype={x.dtype}, samples={samples}")
    except FileNotFoundError:
        pass

def list_and_crc(npz_path: Path, deep=False):
    print(f"📂 File: {npz_path}")
    with zipfile.ZipFile(npz_path, "r") as zf:
        names = zf.namelist()
        print(f"📦 Files in archive: {names}\n")
        for name in names:
            info = zf.getinfo(name)
            print(f"--- {name} ---")
            print(f"Stored CRC32       : 0x{info.CRC:08x}")
            print(f"Compressed Size    : {info.compress_size}")
            print(f"Uncompressed Size  : {info.file_size}")
            try:
                with zf.open(info) as f:
                    _ = f.read()                 # will raise if CRC mismatch
                print("✅ CRC verified by zipfile.\n")
            except zipfile.BadZipFile as e:
                print(f"⚠️ CRC error via zipfile: {e}")
                if deep:
                    ok = deep_crc_check(npz_path, info)
                    print("🔎 Deep check      :", "✅ OK\n" if ok else "❌ Mismatch\n")
                else:
                    print("ℹ️  Re-run with --deep-check to diagnose.\n")
            except Exception as e:
                print(f"❌ Unexpected error: {e}\n")

def deep_crc_check(npz_path: Path, info: zipfile.ZipInfo) -> bool:
    """
    Manual CRC of the *uncompressed* payload.
    Parse the local file header to find the compressed bytes, then
    decompress and compute CRC32 of the uncompressed stream.
    """
    with npz_path.open("rb") as fh:
        fh.seek(info.header_offset)
        local = fh.read(30)  # fixed part of local header
        # local file header sig 'PK\x03\x04'
        if local[:4] != b'PK\x03\x04':
            return False
        # filename length, extra length
        name_len, extra_len = struct.unpack("<HH", local[26:30])
        fh.seek(info.header_offset + 30 + name_len + extra_len)
        comp = fh.read(info.compress_size)

    # Decompress if needed
    if info.compress_type == zipfile.ZIP_STORED:
        data = comp
    elif info.compress_type == zipfile.ZIP_DEFLATED:
        # ZIP uses raw DEFLATE stream (no zlib header): wbits = -15
        try:
            data = zlib.decompress(comp, -15)
        except zlib.error:
            # Some writers include zlib headers; try normal
            data = zlib.decompress(comp)
    else:
        # Other methods not handled here
        return False

    crc = zlib.crc32(data) & 0xFFFFFFFF
    print(f"Computed CRC32     : 0x{crc:08x}")
    return crc == info.CRC

# -----------------------
# CLI
# -----------------------
def parse_args():
    p = argparse.ArgumentParser(description="MLPerf-style NPZ save & check tool")
    mode = p.add_mutually_exclusive_group()
    mode.add_argument("--save-only", action="store_true", help="Only save the NPZ")
    mode.add_argument("--check-only", action="store_true", help="Only verify/show the NPZ")

    p.add_argument("--npz-path", type=Path, help="Full output/check path to NPZ (overrides --out-dir/--name)")
    p.add_argument("--out-dir", type=Path, default=Path("/mnt/hs_test"), help="Directory for output NPZ")
    p.add_argument("--name", default="sample_000000.npz", help="Filename for output NPZ")
    p.add_argument("--compress", action="store_true", help="Use compressed NPZ (deflate)")

    # Size/dtype controls
    p.add_argument("--mean-bytes", type=int, default=DEFAULT_MEAN_BYTES, help="record_length_bytes")
    p.add_argument("--stdev-bytes", type=int, default=DEFAULT_STDEV_BYTES, help="record_length_bytes_stdev")
    p.add_argument("--resize-bytes", type=int, default=DEFAULT_RESIZE, help="record_length_bytes_resize multiple")
    p.add_argument("--randomize", action="store_true", help="Draw size from N(mean,stdev) before rounding")
    p.add_argument("--samples", type=int, default=DEFAULT_SAMPLES, help="num_samples_per_file")
    p.add_argument("--dtype", choices=DTYPE_SIZES.keys(), default=DEFAULT_DTYPE, help="Data dtype for 'x'")
    p.add_argument("--seed", type=int, default=DEFAULT_SEED, help="RNG seed for reproducibility")

    # Check controls
    p.add_argument("--deep-check", action="store_true", help="When checking, manually parse & CRC the member data")
    return p.parse_args()

def main():
    args = parse_args()
    out_path = args.npz_path or (args.out_dir / args.name)

    did_save = False
    if not args.check_only:
        save_npz(
            out_path=out_path,
            mean_bytes=args.mean_bytes,
            stdev_bytes=args.stdev_bytes,
            resize_bytes=args.resize_bytes,
            samples=args.samples,
            dtype_name=args.dtype,
            seed=args.seed,
            compress=args.compress,
            randomize=args.randomize,
        )
        did_save = True

    if not args.save_only:
        if not out_path.exists():
            raise SystemExit(f"File not found for check: {out_path}")
        list_and_crc(out_path, deep=args.deep_check)
    elif did_save:
        # echo path for easy piping
        print(out_path)

if __name__ == "__main__":
    main()


  reply	other threads:[~2025-10-24 18:30 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-22 19:22 [PATCH v6 0/5] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
2025-10-22 19:22 ` [PATCH v6 1/5] NFSD: Make FILE_SYNC WRITEs comply with spec Chuck Lever
2025-10-22 19:22 ` [PATCH v6 2/5] NFSD: Enable return of an updated stable_how to NFS clients Chuck Lever
2025-10-22 19:22 ` [PATCH v6 3/5] NFSD: Refactor nfsd_vfs_write() Chuck Lever
2025-10-22 19:22 ` [PATCH v6 4/5] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Chuck Lever
2025-10-22 19:27   ` Chuck Lever
2025-10-22 21:09     ` Mike Snitzer
2025-10-22 20:58   ` Mike Snitzer
2025-10-22 21:09     ` Chuck Lever
2025-10-23 19:37   ` Chuck Lever
2025-10-23 21:23     ` Mike Snitzer
2025-10-24 14:24       ` Chuck Lever
2025-10-24 18:30         ` Mike Snitzer [this message]
2025-10-22 19:22 ` [PATCH v6 5/5] svcrdma: Mark Read chunks Chuck Lever
2025-10-22 21:25   ` Mike Snitzer
2025-10-23 13:00     ` Chuck Lever
2025-10-22 21:56 ` [PATCH v6 0/5] NFSD: Implement NFSD_IO_DIRECT for NFS WRITE Mike Snitzer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aPvFqvtObywC2Bpt@kernel.org \
    --to=snitzer@kernel.org \
    --cc=cel@kernel.org \
    --cc=dai.ngo@oracle.com \
    --cc=jlayton@kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=neil@brown.name \
    --cc=okorniev@redhat.com \
    --cc=tom@talpey.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).