From: Nix <nix@esperi.demon.co.uk>
To: ultralinux@vger.kernel.org
Cc: Linux Kernel Development <linux-kernel@vger.kernel.org>,
Trond Myklebust <trond.myklebust@fys.uio.no>
Subject: Re: strange sparc64 -> i586 intermittent but reproducible NFS write errors to one and only one fs
Date: 19 Jan 2003 20:21:57 +0000 [thread overview]
Message-ID: <87iswkx53u.fsf@amaterasu.srvr.nix> (raw)
In-Reply-To: <200301100658.h0A6vxs14580@Port.imtp.ilyichevsk.odessa.ua>
[Large amount of quoting to provide useful context; see below.]
On Fri, 10 Jan 2003, Denis Vlasenko recommended:
> On 10 January 2003 00:56, Nix wrote:
>> When I rebooted my systems into 2.4.20 (from 2.4.19), I started
>> seeing EIO write() errors to files in my ext3 home directory
>> (NFS-mounted, exported async).
>>
>> So I knocked up a test program (included below) to try to track the
>> failing writes down, and got more confused.
>>
>> The properties of the failing writes that I've been able to determine
>> thus far are as follows; look out, they're weird as hell:
>>
>> - the failures are definitely from write(), not open().
>>
>> - writes from sparc64 to one filesystem, and only one filesystem, on
>> i586, both running 2.4.20, UDP NFSv3; rquotad and quotas are on,
>> but I am well within my quota. (quota 3.06, nfs-utils 1.0). Writes to
>> other filesystems on the same machine, even if they too are using
>> ext3, even if they too have user quotas for the same user.
>>
>> What differs between filesystems that work and the one that fails
>> I can't tell; other FSen *on the same block device* work... (the
>> block device is an un-RAIDed SCSI disk.)
>>
>> - local writes to the same filesystem, with the same test program,
>> never fail.
>>
>> - writes from another IA32 box (all these boxes are near-clones of
>> each other as far as software is concerned) to the NFS server box
>> never fail.
>>
>> - It happens if I mount the fs with -o soft (my default for all NFS
>> mounts for robustness-in-the-presence-of-machine-failure reasons),
>> but also if I mount with -o hard :(( besides, the timeouts happen
>> far too fast for it to be major timeout expiry that casues the EIOs.
>>
>> - The failure always occurs for writes that cross the 2^21 byte
>> boundary, but not all such writes fail. You seem to need to have
>> done a lot of write()s before, perhaps even starting with O_TRUNC
>> and write()ing like mad from there on up (the WRITES_PER_OPEN
>> #define is a way to test that; I've never had a failure for a file
>> opened with O_APPEND, even if it crossed the 2^21 byte boundary).
>>
>> - It happens whether _LARGEFILE_SOURCE / _FILE_OFFSET_BITS are
>> defined or not (I'd be amazed if this affected it, actually, but
>> it never hurts to check).
>>
>> - Despite the EIO, the write actually *succeeds* most of the time
>> (perhaps not all the time; again, I'm not sure yet). In fact...
>>
>> - It is quite thoroughly inconsistent. If you #define REPRODUCE to 1
>> in the test program and fill out sizes_to_reproduce[] with a set
>> of write() sizes that have caused the error in the past, the error
>> happens again, but not always:
[and more; see <http://www.uwsg.iu.edu/hypermail/linux/kernel/0301.1/0597.html>;
note that I have seen errors on writes to files of total size <2Mb --- e.g.,
my mail overview databases --- but I can't reproduce them with a test program
yet.]
> This beast is most probably Sparc64 or 64-bit arch specific.
That seems very likely.
> Try to pin down the first 2.4.20-preN where it appears.
> Then inform NFS and Sparc64 folks.
Done. Sorry for the delay; this box is quite hard to arrange to reboot :(
Anyway, the problem appears in 2.4.20-pre10; I suspect
Trond Myklebust <trond.myklebust@fys.uio.no>:
o Workaround NFS hangs introduced in 2.4.20-pre
(so Cc:ed)
Does anyone have a pointer to this patch so I can try reversing it from
2.4.20pre10? (I can't see it on l-k, but since I don't know what it
looks like it's hard to find it in the archives; I don't have bitkeeper
on this machine, and can't, as one of my current projects involves
version-control filesystems).
Anyone got any other ideas, suggestions, or anything?
Thanks!
--
`I knew that there had to be aliens somewhere in the universe. What I
did not know until now was that they read USENET.' --- Mark Hughes,
on those who unaccountably fail to like _A Fire Upon The Deep_
next prev parent reply other threads:[~2003-01-19 20:13 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2003-01-09 22:56 strange sparc64 -> i586 intermittent but reproducible NFS write errors to one and only one fs Nix
2003-01-10 6:57 ` Denis Vlasenko
2003-01-19 20:21 ` Nix [this message]
2003-01-19 21:00 ` Trond Myklebust
2003-01-20 6:38 ` David S. Miller
2003-01-20 20:53 ` Nix
2003-02-03 17:35 ` Nix
-- strict thread matches above, loose matches on Subject: below --
2003-01-19 20:21 Nix
2003-01-19 21:00 ` Trond Myklebust
2003-01-20 20:53 ` Nix
2003-02-03 17:35 ` Nix
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87iswkx53u.fsf@amaterasu.srvr.nix \
--to=nix@esperi.demon.co.uk \
--cc=linux-kernel@vger.kernel.org \
--cc=trond.myklebust@fys.uio.no \
--cc=ultralinux@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.