public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed
From: Spelic <spelic-9AbUPqfR1/2XDw4h08c5KA@public.gmane.org>
To: Tom Tucker <tom-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Subject: Re: NFS-RDMA hangs: connection closed (-103)
Date: Thu, 02 Dec 2010 13:16:14 +0100	[thread overview]
Message-ID: <4CF78E0E.2040308@shiftmail.org> (raw)
In-Reply-To: <4CF6E144.1080200-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>

On 12/02/2010 12:59 AM, Tom Tucker wrote:
> Spelic,
>
> I have seen this problem before, but have not been able to reliably 
> reproduce it. When I saw the problem, there were no transport errors 
> and it appeared as if the I/O had actually completed, but that the 
> waiter was not being awoken. I was not able to reliably reproduce the 
> problem and was not able to determine if the problem was a latent bug 
> in NFS in general or a bug in the RDMA transport in particular.
>
> I will try your setup here, but I don't have a system like yours so 
> I'll have to settle for a smaller ramdisk, however, I have a few 
> questions:
>
> - Does the FS matter? For example, can you use ext[2-4] on the ramdisk 
> and not still reproduce
> - As I mentioned earlier NFS v3 vs. NFS v4
> - RAMDISK size, i.e. 2G vs. 14G
>
> Thanks,
> Tom

Hello Tom, thanks for replying

- The FS matters to some extent: as I wrote, with ext4 it's not possible 
to reproduce the bug in this way, so immediately and reliably, however 
ext4 also will hang eventually if you work on it for hours so I had to 
switch to IPoIB for our real work; reread my previous post.

- NFS3 not tried yet. Never tried to do RDMA on NFS3... do you have a 
pointer on instructions?


- RAMDISK size: I am testing it.

Ok I confirm with 1.5GB ramdisk it's reproducible.
boot option ramdisk_size=1572864
(1.5*1024**2=1572864.0)
confirm: blockdev --getsize64 /dev/ram0 == 1610612736

now at server side mkfs and mount with defaults:
mkfs.xfs /dev/ram0
mount /dev/ram0 /mnt/ram
(this is a simplification over my previous email, and it's needed with a 
smaller ramdisk or mkfs.xfs will refuse to work. The bug is still 
reproducible like this)


DOH! another bug:
It's strange how at the end of the test
     ls -lh /mnt/ram
at server side will show a zerofile larger than 1.5GB at the end of the 
procedure, sometimes it's 3GB, sometimes it's 2.3GB... but it's larger 
than the ramdisk size.

# ll -h /mnt/ram
total 1.5G
drwxr-xr-x 2 root root   21 2010-12-02 12:54 ./
drwxr-xr-x 3 root root 4.0K 2010-11-29 23:51 ../
-rw-r--r-- 1 root root 2.3G 2010-12-02 12:59 zerofile
# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda1             294G  4.1G  275G   2% /
devtmpfs              7.9G  184K  7.9G   1% /dev
none                  7.9G     0  7.9G   0% /dev/shm
none                  7.9G  100K  7.9G   1% /var/run
none                  7.9G     0  7.9G   0% /var/lock
none                  7.9G     0  7.9G   0% /lib/init/rw
/dev/ram0             1.5G  1.5G   20K 100% /mnt/ram

# dd if=/mnt/ram/zerofile | wc -c
4791480+0 records in
4791480+0 records out
2453237760
2453237760 bytes (2.5 GB) copied, 8.41821 s, 291 MB/s

It seems there is also an XFS bug here...

This might help triggering the bug however please note than ext4 
(nfs-rdma over it) also hanged on us and it was real work on HDD disks 
and they were not full... after switching to IPoIB it didn't hang anymore.

On IPoIB the size problem also shows up: final file is 2.3GB instead of 
< 1.5GB, however nothing hangs:

# echo begin; dd if=/dev/zero of=/mnt/nfsram/zerofile bs=1M ; echo 
syncing now ; time sync ; echo finished
begin
dd: writing `/mnt/nfsram/zerofile': Input/output error
2497+0 records in
2496+0 records out
2617245696 bytes (2.6 GB) copied, 10.4 s, 252 MB/s
syncing now

real    0m0.057s
user    0m0.000s
sys     0m0.000s
finished

I think I noticed the same problem with a 14GB ramdisk, the file ended 
up to be about 15GB, but at that time I thought I made some computation 
mistakes. Now with a smaller ramdisk it's more obvious.

Earlier or later someone should notify the XFS developers of the "size" bug.
However currently it's a good thing: the size bug might help us to fix 
the RDMA bug.

Thanks for your help

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

  parent reply	other threads:[~2010-12-02 12:16 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-12-01 23:13 NFS-RDMA hangs: connection closed (-103) Spelic
     [not found] ` <4CF6D69B.4030501-9AbUPqfR1/2XDw4h08c5KA@public.gmane.org>
2010-12-01 23:25   ` Tom Tucker
2010-12-01 23:59   ` Tom Tucker
     [not found]     ` <4CF6E144.1080200-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
2010-12-02 12:16       ` Spelic [this message]
     [not found]         ` <4CF78E0E.2040308-9AbUPqfR1/2XDw4h08c5KA@public.gmane.org>
2010-12-02 18:37           ` Roland Dreier
     [not found]             ` <ada39qgm36k.fsf-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>
2010-12-02 19:09               ` Spelic
     [not found]                 ` <4CF7EEE0.9030408-9AbUPqfR1/2XDw4h08c5KA@public.gmane.org>
2010-12-07 16:12                   ` Tom Tucker
     [not found]                     ` <4CFE5CF1.6020806-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
2010-12-08 15:10                       ` Spelic
     [not found]                         ` <4CFF9FE4.5010705-9AbUPqfR1/2XDw4h08c5KA@public.gmane.org>
2010-12-09 15:25                           ` Tom Tucker

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4CF78E0E.2040308@shiftmail.org \
    --to=spelic-9abupqfr1/2xdw4h08c5ka@public.gmane.org \
    --cc=linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=tom-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox