All of lore.kernel.org
 help / color / mirror / Atom feed
From: Tao Ma <tao.ma@oracle.com>
To: ocfs2-devel@oss.oracle.com
Subject: [Ocfs2-devel] 40TB RAID and OCFS2 woes (inode64, JDB2, huge partition support, Volume might try to write to blocks beyond what jbd can address in 32 bits)
Date: Mon, 04 Jan 2010 10:36:24 +0800	[thread overview]
Message-ID: <4B415428.3060700@oracle.com> (raw)
In-Reply-To: <760849C3-6C5E-43D2-8536-C0F47E7AFDC9@wansecurity.com>

Hi Robert,
Robert Smith wrote:
> #
> # strace tail -n 2
> #
> root at s2-replay02:~# time strace -ttt tail -n 2 /data/storage/ReplayDataVolume001/biggest_yet_file 2> strace_tail_biggest_yet_file.out
> ^C                
> real    9m22.950s
> user    0m44.300s
> sys     1m24.470s
> root at s2-replay02:~# ls -aFl strace_tail_biggest_yet_file.out 
> -rw-r--r-- 1 root root 187251511 2010-01-03 19:07 strace_tail_biggest_yet_file.out
> root at s2-replay02:~# tail -f strace_tail_biggest_yet_file.out 
> 1262567242.186934 lseek(3, 20962998501376, SEEK_SET) = 20962998501376
> 1262567242.187022 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
> 1262567242.187433 lseek(3, 20962998493184, SEEK_SET) = 20962998493184
> 1262567242.187524 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
> 1262567242.187928 lseek(3, 20962998484992, SEEK_SET) = 20962998484992
> 1262567242.188017 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
> 1262567242.188424 lseek(3, 20962998476800, SEEK_SET) = 20962998476800
> 1262567242.188513 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
> 1262567242.188946 lseek(3, 20962998468608, SEEK_SET) = 20962998468608
> 1262567242.189034 read(3,  <unfinished ...>
> ^C
> root at s2-replay02:~# tail -f strace_tail_biggest_yet_file.out
	Thanks for the test.
	I just tested it in my env and found the root cause. Joel's guess is 
right. The problem is part of tail, and also part of the way the file is 
created. ;)

IIRC, you created the test file by dd if=/dev/zero, so the content is 
filled with '0' at the beginning.
When you appended a line in the end, and "tail" wanted to find a line 
feed before the last line I guess, but it can't find it, so it read 8192 
again and again from the end to head(you can see it from the log of 
strace). And it will last for a long time.

Now, you can test what I have said easily in this way:
Since you have appended 2 lines in this file, just do:

tail -n 1 /data/storage/ReplayDataVolume001/biggest_yet_file
I think it will return as fast as you expect.

Regards,
Tao

> 
> 
> That was 10 minutes of seek time. I can let it run longer if you want.
> 
> 
> -Robert
> 
> 
> On Jan 2, 2010, at 4:33 PM, Tao Ma wrote:
> 
>> Hi Robert,
>> 	Great thanks for your test. I haven't met with such a big volume ever before. ;)
>>
>> Robert Smith wrote:
>>> Just thought I would let you guys know that creating a 20TB file was successful. I even appended data to the end of it. Any operations on the file are completely useless because they take way to long. A appended "hello" to the end of the file no problem, but tail -n 1 {filename} yielded nothing except a lot of disk read after 159minutes of waiting.
>> I don't think 159 minutes is a good number.
>> So could you please use "strace -tttt tail -n 1 biggest_yet_file"? Just want to find out which system call last such a long time.
>> And also could you please run such command to find the disk layout of that file.
>> echo 'stat biggest_yet_file'|debugfs.ocfs2 /dev/sdx
>> sdx is your device of course.
>>
>> btw, you said appending has no problem, so how long does it take?
>> And also please run "strace -ttt" to it.
>>
>> Regards,
>> Tao
>>
>>> I don't really even know if this is good information or common knowledge.
>>> dd bs=1000M count=20000 if=/dev/zero of=/data/storage/ReplayDataVolume001/biggest_yet_file
>>> root at s2-replay02:/data/storage/ReplayDataVolume001# ls -aFl
>>> total 1266312192
>>> drwxr-xr-x 3 root root           3896 2009-12-31 23:31 ./
>>> drwxr-xr-x 3 root root             88 2010-01-01 11:02 ../
>>> -rw-r--r-- 1 root root     1048576000 2009-12-31 23:23 big_file
>>> -rw-r--r-- 1 root root    10485760000 2009-12-31 23:24 bigger_file
>>> -rw-r--r-- 1 root root   104857600000 2009-12-31 23:28 biggest_file
>>> -rw-r--r-- 1 root root 20971520000006 2010-01-01 11:03 biggest_yet_file
>>> drwxr-xr-x 2 root root           3896 2009-12-31 11:53 lost+found/
>>> root at s2-replay02:/data/storage/ReplayDataVolume001#
>>> -Robert
>>> On Jan 1, 2010, at 5:08 AM, Joel Becker wrote:
>>>> On Fri, Jan 01, 2010 at 04:36:02AM +0900, Robert Smith wrote:
>>>>> Oh, I found it at line #2163 of fs/ocfs2/super.c.
>>>>>
>>>>> I imagine that something as simple as the following would work, but perhaps I'll wait for your feedback.
>>>>>
>>>>>
>>>>> /*
>>>>>       if (ocfs2_clusters_to_blocks(osb->sb, le32_to_cpu(di->i_clusters) - 1)
>>>>>> (u32)~0UL) {
>>>>>               mlog(ML_ERROR, "Volume might try to write to blocks beyond "
>>>>>                    "what jbd can address in 32 bits.\n");
>>>>>               status = -EINVAL;
>>>>>               goto bail;
>>>>>       }
>>>>> */
>>>> 	That should work.  The real solution will check based on the
>>>> journal flags.  Be warned, there be tygers in here.
>>>>
>>>> Joel
>>>>
>>>> -- 
>>>>
>>>> "But all my words come back to me
>>>> In shades of mediocrity.
>>>> Like emptiness in harmony
>>>> I need someone to comfort me."
>>>>
>>>> Joel Becker
>>>> Principal Software Developer
>>>> Oracle
>>>> E-mail: joel.becker at oracle.com
>>>> Phone: (650) 506-8127
>>> _______________________________________________
>>> Ocfs2-devel mailing list
>>> Ocfs2-devel at oss.oracle.com
>>> http://oss.oracle.com/mailman/listinfo/ocfs2-devel
> 

  reply	other threads:[~2010-01-04  2:36 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-12-30 14:30 [Ocfs2-devel] 40TB RAID and OCFS2 woes (inode64, JDB2, huge partition support, Volume might try to write to blocks beyond what jbd can address in 32 bits) Robert Smith
2009-12-30 20:34 ` Joel Becker
2009-12-30 22:25   ` Robert Smith
2009-12-30 22:49     ` Sunil Mushran
2009-12-31  3:42     ` Joel Becker
2009-12-31 18:15       ` Robert Smith
2009-12-31 19:05         ` Sunil Mushran
2009-12-31 19:20           ` Robert Smith
2009-12-31 19:36           ` Robert Smith
2009-12-31 20:08             ` Joel Becker
2010-01-01  6:12               ` [Ocfs2-devel] 40TB RAID and OCFS2 woes (inode64, JBD2, " Robert Smith
2010-01-02 20:15                 ` Joel Becker
2010-01-01 19:47               ` [Ocfs2-devel] 40TB RAID and OCFS2 woes (inode64, JDB2, " Robert Smith
2010-01-02  7:33                 ` Tao Ma
2010-01-04  1:09                   ` Robert Smith
2010-01-04  2:36                     ` Tao Ma [this message]
2010-01-05 19:28                       ` Robert Smith
2010-01-02 20:19                 ` Joel Becker
2010-01-04  1:20                   ` Robert Smith
2009-12-31  8:19     ` Tao Ma
2009-12-31 17:42       ` Robert Smith

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4B415428.3060700@oracle.com \
    --to=tao.ma@oracle.com \
    --cc=ocfs2-devel@oss.oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.