* large files
@ 2004-05-17 19:48 Bernd Schubert
2004-05-17 20:12 ` Chris Mason
0 siblings, 1 reply; 19+ messages in thread
From: Bernd Schubert @ 2004-05-17 19:48 UTC (permalink / raw)
To: Reiserfs mail-list
[-- Attachment #1: signed data --]
[-- Type: text/plain, Size: 983 bytes --]
Hello,
I'm currently testing our new server and though it will primarily not serve
really large files (about 40-60 users will have a quota of 25GB each on a 2TB
array), I'm still testing the performance for large files.
So I created an about 300GB fil and the problem is to remove it now.
Removing it took much more than 15 minutes. Here's the the relevant top line:
5012 root 18 0 368 368 312 D 21.9 0.0 5:48 rm
Since I didn't expect it to take so much time, I didn't measure the time to
delete this file.
system specifications:
- dual opteron 242 (1600 MHz)
- linux-2.4.26 with all patches from Chris, no further patches
- reiserfs-3.6 format
The partition with the 300GB file has a size of 1.7TB.
Any ideas whats going on?
Thanks,
Bernd
--
Bernd Schubert
Physikalisch Chemisches Institut / Theoretische Chemie
Universität Heidelberg
INF 229
69120 Heidelberg
e-mail: bernd.schubert@pci.uni-heidelberg.de
[-- Attachment #2: signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: large files
2004-05-17 19:48 large files Bernd Schubert
@ 2004-05-17 20:12 ` Chris Mason
2004-05-17 20:25 ` Bernd Schubert
2004-05-18 13:42 ` Bernd Schubert
0 siblings, 2 replies; 19+ messages in thread
From: Chris Mason @ 2004-05-17 20:12 UTC (permalink / raw)
To: Bernd Schubert; +Cc: Reiserfs mail-list
On Mon, 2004-05-17 at 15:48, Bernd Schubert wrote:
> Hello,
>
> I'm currently testing our new server and though it will primarily not serve
> really large files (about 40-60 users will have a quota of 25GB each on a 2TB
> array), I'm still testing the performance for large files.
>
> So I created an about 300GB fil and the problem is to remove it now.
> Removing it took much more than 15 minutes. Here's the the relevant top line:
>
> 5012 root 18 0 368 368 312 D 21.9 0.0 5:48 rm
>
> Since I didn't expect it to take so much time, I didn't measure the time to
> delete this file.
>
> system specifications:
> - dual opteron 242 (1600 MHz)
> - linux-2.4.26 with all patches from Chris, no further patches
> - reiserfs-3.6 format
>
> The partition with the 300GB file has a size of 1.7TB.
This is most likely a combination of metadata fragmentation and the fact
that during deletes, 2.4.x reiserfs ends up reading one block at a time.
As a comparison data point, could you please try 2.6.6-mm3? I realize
you don't want to run this kernel in production, but it would tell us if
I understand the problems at hand.
-chris
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: large files
2004-05-17 20:12 ` Chris Mason
@ 2004-05-17 20:25 ` Bernd Schubert
2004-05-18 13:42 ` Bernd Schubert
1 sibling, 0 replies; 19+ messages in thread
From: Bernd Schubert @ 2004-05-17 20:25 UTC (permalink / raw)
To: Reiserfs mail-list
[-- Attachment #1: signed data --]
[-- Type: text/plain, Size: 536 bytes --]
> As a comparison data point, could you please try 2.6.6-mm3? I realize
> you don't want to run this kernel in production, but it would tell us if
> I understand the problems at hand.
I will do this during the next days. Currently the system is not running in
production yet, so rebooting other kernel versions is no problem.
Thanks,
Bernd
--
Bernd Schubert
Physikalisch Chemisches Institut / Theoretische Chemie
Universität Heidelberg
INF 229
69120 Heidelberg
e-mail: bernd.schubert@pci.uni-heidelberg.de
[-- Attachment #2: signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: large files
2004-05-17 20:12 ` Chris Mason
2004-05-17 20:25 ` Bernd Schubert
@ 2004-05-18 13:42 ` Bernd Schubert
2004-05-18 13:57 ` Chris Mason
1 sibling, 1 reply; 19+ messages in thread
From: Bernd Schubert @ 2004-05-18 13:42 UTC (permalink / raw)
To: Chris Mason; +Cc: Reiserfs mail-list
[-- Attachment #1: signed data --]
[-- Type: text/plain, Size: 1963 bytes --]
Hello Chris,
>
> As a comparison data point, could you please try 2.6.6-mm3? I realize
> you don't want to run this kernel in production, but it would tell us if
> I understand the problems at hand.
the results in 2.6.6-mm3 are below, we almost consider to run this kernel
version.
Here are two other interesting facts:
1.) During the filecreation in 2.4.26 the load on the system was around 3-4,
whereas in 2.6.6-mm3 the load was at about 8-9.
2.) When the dd file creation process finished (2.4.26 was running) the system
became so unresponsible, that the drdb connection timed out and a resync
process automatically started when the system became responsible again. I
don't have any comparism to 2.6.6-mm3 since we would need another drbd
version. Also, I don't know if this happend when dd finished or when
rm-started, since both were running from a script.
Here are the measured times for file creation and file deleting
=====> 2.4.26:
taylor:~# cat test.out-2.4.26
time dd if=/dev/zero of=/worka/testfile.dd bs=1M count=300000
300000+0 records in
300000+0 records out
314572800000 bytes transferred in 5746.266841 seconds (54743855 bytes/sec)
real 95m46.275s
user 0m0.760s
sys 29m57.800s
time rm -fr /worka/testfile.dd
real 11m20.589s
user 0m0.000s
sys 4m59.850s
=====> 2.6.6-mm3
taylor:~# cat test.out-2.6.6-mm3
time dd if=/dev/zero of=/worka/testfile.dd bs=1M count=300000
300000+0 records in
300000+0 records out
314572800000 bytes transferred in 4902.873869 seconds (64160900 bytes/sec)
real 81m46.211s
user 0m1.172s
sys 22m26.010s
time rm -fr /worka/testfile.dd
real 1m38.000s
user 0m0.000s
sys 1m5.872s
Do you have any ideas how we could improve 2.4.x?
Thanks,
Bernd
--
Bernd Schubert
Physikalisch Chemisches Institut / Theoretische Chemie
Universität Heidelberg
INF 229
69120 Heidelberg
e-mail: bernd.schubert@pci.uni-heidelberg.de
[-- Attachment #2: signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: large files
2004-05-18 13:42 ` Bernd Schubert
@ 2004-05-18 13:57 ` Chris Mason
2004-05-18 14:49 ` Bernd Schubert
0 siblings, 1 reply; 19+ messages in thread
From: Chris Mason @ 2004-05-18 13:57 UTC (permalink / raw)
To: Bernd Schubert; +Cc: Reiserfs mail-list
On Tue, 2004-05-18 at 09:42, Bernd Schubert wrote:
> Hello Chris,
>
> >
> > As a comparison data point, could you please try 2.6.6-mm3? I realize
> > you don't want to run this kernel in production, but it would tell us if
> > I understand the problems at hand.
>
> the results in 2.6.6-mm3 are below, we almost consider to run this kernel
> version.
>
> Here are two other interesting facts:
>
> 1.) During the filecreation in 2.4.26 the load on the system was around 3-4,
> whereas in 2.6.6-mm3 the load was at about 8-9.
>
Which procs contributed to this load? The simple dd should have kept
the load at one.
> 2.) When the dd file creation process finished (2.4.26 was running) the system
> became so unresponsible, that the drdb connection timed out and a resync
> process automatically started when the system became responsible again. I
> don't have any comparism to 2.6.6-mm3 since we would need another drbd
> version. Also, I don't know if this happend when dd finished or when
> rm-started, since both were running from a script.
>
Probably the rm.
[ 2.6.6-mm3 is much faster ]
> Do you have any ideas how we could improve 2.4.x?
>
2.6.6-mm has a few key improvements. There's less metadata
fragmentation thanks to some block allocator fixes. More importantly,
during the rm, metadata blocks are read in 16 at a time instead of 1 at
a time. I'd be happy to give someone pointers on porting the metadata
readahead bits back to 2.4.
-chris
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: large files
2004-05-18 13:57 ` Chris Mason
@ 2004-05-18 14:49 ` Bernd Schubert
2004-05-18 15:07 ` Chris Mason
0 siblings, 1 reply; 19+ messages in thread
From: Bernd Schubert @ 2004-05-18 14:49 UTC (permalink / raw)
To: Chris Mason; +Cc: Reiserfs mail-list
[-- Attachment #1: signed data --]
[-- Type: text/plain, Size: 2736 bytes --]
> > 1.) During the filecreation in 2.4.26 the load on the system was around
> > 3-4, whereas in 2.6.6-mm3 the load was at about 8-9.
>
> Which procs contributed to this load? The simple dd should have kept
> the load at one.
Thats all I can see from top (2.4.26):
top - 16:45:14 up 4:47, 1 user, load average: 3.30, 2.80, 2.09
Tasks: 80 total, 1 running, 79 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0% user, 24.0% system, 0.0% nice, 76.0% idle
Mem: 3104428k total, 3018816k used, 85612k free, 228936k buffers
Swap: 1951888k total, 0k used, 1951888k free, 2662272k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1043 root 19 0 1392 364 320 D 40.7 0.0 3:11.55 dd
7 root 9 0 0 0 0 D 3.7 0.0 4:02.48 kupdated
6 root 9 0 0 0 0 D 1.7 0.0 2:15.18 bdflush
5 root 9 0 0 0 0 S 1.0 0.0 2:47.13 kswapd
17 root 9 0 0 0 0 D 0.3 0.0 0:46.62 kreiserfsd
1052 root 9 0 1040 1040 820 R 0.3 0.0 0:00.02 top
taylor:~# cat /proc/stat
cpu 402 0 538233 2938065
cpu0 221 0 267698 1470431
cpu1 181 0 270535 1467634
page 215506778 453499588
swap 1 0
intr 200179775 1738350 2 0 9 4 0 2 0 4 2 0 0 0 0 13 5 0 0 0 0 0 0 0 0 5017728
187006 0 0 0 193236650 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
disk_io: (3,0):(4,4,32,0,0) (8,0):
(5058268,3638518,431013524,1419750,906999184)
ctxt 179608469
btime 1084874257
processes 1054
Unfortunality I even don't have an idea how to interprete those numbers.
> > Do you have any ideas how we could improve 2.4.x?
>
> 2.6.6-mm has a few key improvements. There's less metadata
> fragmentation thanks to some block allocator fixes. More importantly,
> during the rm, metadata blocks are read in 16 at a time instead of 1 at
> a time. I'd be happy to give someone pointers on porting the metadata
> readahead bits back to 2.4.
I certainly have neither the knowledge nor the time to do that.
Cheers,
Bernd
[-- Attachment #2: signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: large files
2004-05-18 14:49 ` Bernd Schubert
@ 2004-05-18 15:07 ` Chris Mason
2004-05-18 15:19 ` Hans Reiser
0 siblings, 1 reply; 19+ messages in thread
From: Chris Mason @ 2004-05-18 15:07 UTC (permalink / raw)
To: Bernd Schubert; +Cc: Reiserfs mail-list
On Tue, 2004-05-18 at 10:49, Bernd Schubert wrote:
> Unfortunality I even don't have an idea how to interprete those numbers.
>
Well, on 2.6, you've got one or more pdflush daemons that might
contribute to the load as well.
>
> > > Do you have any ideas how we could improve 2.4.x?
> >
> > 2.6.6-mm has a few key improvements. There's less metadata
> > fragmentation thanks to some block allocator fixes. More importantly,
> > during the rm, metadata blocks are read in 16 at a time instead of 1 at
> > a time. I'd be happy to give someone pointers on porting the metadata
> > readahead bits back to 2.4.
>
> I certainly have neither the knowledge nor the time to do that.
>
I'd suggest contacting namesys. Really all the patch needs is some
reject / fuzz resolution and testing.
-chris
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: large files
2004-05-18 15:07 ` Chris Mason
@ 2004-05-18 15:19 ` Hans Reiser
2004-05-18 15:40 ` Chris Mason
0 siblings, 1 reply; 19+ messages in thread
From: Hans Reiser @ 2004-05-18 15:19 UTC (permalink / raw)
To: Bernd Schubert; +Cc: Chris Mason, Reiserfs mail-list
Chris Mason wrote:
>On Tue, 2004-05-18 at 10:49, Bernd Schubert wrote:
>
>
>
>>Unfortunality I even don't have an idea how to interprete those numbers.
>>
>>
>>
>
>Well, on 2.6, you've got one or more pdflush daemons that might
>contribute to the load as well.
>
>
>
>>>>Do you have any ideas how we could improve 2.4.x?
>>>>
>>>>
>>>2.6.6-mm has a few key improvements. There's less metadata
>>>fragmentation thanks to some block allocator fixes. More importantly,
>>>during the rm, metadata blocks are read in 16 at a time instead of 1 at
>>>a time. I'd be happy to give someone pointers on porting the metadata
>>>readahead bits back to 2.4.
>>>
>>>
>>I certainly have neither the knowledge nor the time to do that.
>>
>>
>>
>
>I'd suggest contacting namesys. Really all the patch needs is some
>reject / fuzz resolution and testing.
>
>-chris
>
>
>
>
>
>
Patch? Tell me more please.;-)
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: large files
2004-05-18 15:19 ` Hans Reiser
@ 2004-05-18 15:40 ` Chris Mason
0 siblings, 0 replies; 19+ messages in thread
From: Chris Mason @ 2004-05-18 15:40 UTC (permalink / raw)
To: Hans Reiser; +Cc: Bernd Schubert, Reiserfs mail-list
[-- Attachment #1: Type: text/plain, Size: 1229 bytes --]
On Tue, 2004-05-18 at 11:19, Hans Reiser wrote:
> >I'd suggest contacting namesys. Really all the patch needs is some
> >reject / fuzz resolution and testing.
> >
> Patch? Tell me more please.;-)
>
>
;-) I believe the slow down during rm comes from
prepare_for_delete_or_cut reading in one btree leaf at a time. I
changed the (previously disabled) btree readahead code to automatically
do readahead for directory reads and unlinks. I think I've got this
solved in 2.6.x reiserv3, Bernd needs it against 2.4.x + data logging.
In the directory read case, it does forward looking readahead, meaning
that when you read offset N in the object, it assumes you're going to
next read offset N+1.
For unlinks, it does backwards looking readahead. If you read offset N
in the object, it assumes you're going to next read offset N-1. This is
because we delete bytes in a file starting from the last byte and keep
decrementing until the file is empty. You should be able to increase
performance further by doing readahead in 32 or 64 block chunks instead
of 16.
The patch is attached. This one was in 2.6.6-mm3, but the 2.4.x data
logging code is fairly close to 2.6.6, especially code outside
fs/reiserfs/journal.c
-chris
[-- Attachment #2: reiserfs-search_reada-5.patch --]
[-- Type: text/x-patch, Size: 6850 bytes --]
From: Chris Mason <mason@suse.com>
Walking the btree can trigger a number of single block synchronous reads.
This patch does btree readahead during operations that are likely to be long
and sequential. So far, that only includes directory reads and truncates, but
it can make both much faster.
---
25-akpm/fs/reiserfs/dir.c | 1
25-akpm/fs/reiserfs/stree.c | 93 ++++++++++++++++++++++++++----------
25-akpm/include/linux/reiserfs_fs.h | 6 +-
3 files changed, 74 insertions(+), 26 deletions(-)
diff -puN fs/reiserfs/dir.c~reiserfs-search_reada-5 fs/reiserfs/dir.c
--- 25/fs/reiserfs/dir.c~reiserfs-search_reada-5 Fri Apr 23 14:38:39 2004
+++ 25-akpm/fs/reiserfs/dir.c Fri Apr 23 14:38:39 2004
@@ -64,6 +64,7 @@ static int reiserfs_readdir (struct file
/* reiserfs_warning (inode->i_sb, "reiserfs_readdir 1: f_pos = %Ld", filp->f_pos);*/
+ path_to_entry.reada = PATH_READA;
while (1) {
research:
/* search the directory item, containing entry with specified key */
diff -puN fs/reiserfs/stree.c~reiserfs-search_reada-5 fs/reiserfs/stree.c
--- 25/fs/reiserfs/stree.c~reiserfs-search_reada-5 Fri Apr 23 14:38:39 2004
+++ 25-akpm/fs/reiserfs/stree.c Fri Apr 23 14:38:39 2004
@@ -596,26 +596,29 @@ static int is_tree_node (struct buffer_h
-#ifdef SEARCH_BY_KEY_READA
+#define SEARCH_BY_KEY_READA 16
/* The function is NOT SCHEDULE-SAFE! */
-static void search_by_key_reada (struct super_block * s, int blocknr)
+static void search_by_key_reada (struct super_block * s,
+ struct buffer_head **bh,
+ unsigned long *b, int num)
{
- struct buffer_head * bh;
+ int i,j;
- if (blocknr == 0)
- return;
-
- bh = sb_getblk (s, blocknr);
-
- if (!buffer_uptodate (bh)) {
- ll_rw_block (READA, 1, &bh);
+ for (i = 0 ; i < num ; i++) {
+ bh[i] = sb_getblk (s, b[i]);
+ }
+ for (j = 0 ; j < i ; j++) {
+ /*
+ * note, this needs attention if we are getting rid of the BKL
+ * you have to make sure the prepared bit isn't set on this buffer
+ */
+ if (!buffer_uptodate(bh[j]))
+ ll_rw_block(READA, 1, bh + j);
+ brelse(bh[j]);
}
- bh->b_count --;
}
-#endif
-
/**************************************************************************
* Algorithm SearchByKey *
* look for item in the Disk S+Tree by its key *
@@ -657,6 +660,9 @@ int search_by_key (struct super_block *
int n_node_level, n_retval;
int right_neighbor_of_leaf_node;
int fs_gen;
+ struct buffer_head *reada_bh[SEARCH_BY_KEY_READA];
+ unsigned long reada_blocks[SEARCH_BY_KEY_READA];
+ int reada_count = 0;
#ifdef CONFIG_REISERFS_CHECK
int n_repeat_counter = 0;
@@ -691,19 +697,25 @@ int search_by_key (struct super_block *
p_s_last_element = PATH_OFFSET_PELEMENT(p_s_search_path, ++p_s_search_path->path_length);
fs_gen = get_generation (p_s_sb);
-#ifdef SEARCH_BY_KEY_READA
- /* schedule read of right neighbor */
- search_by_key_reada (p_s_sb, right_neighbor_of_leaf_node);
-#endif
-
/* Read the next tree node, and set the last element in the path to
have a pointer to it. */
- if ( ! (p_s_bh = p_s_last_element->pe_buffer =
- sb_bread(p_s_sb, n_block_number)) ) {
+ if ((p_s_bh = p_s_last_element->pe_buffer =
+ sb_getblk(p_s_sb, n_block_number)) ) {
+ if (!buffer_uptodate(p_s_bh) && reada_count > 1) {
+ search_by_key_reada (p_s_sb, reada_bh,
+ reada_blocks, reada_count);
+ }
+ ll_rw_block(READ, 1, &p_s_bh);
+ wait_on_buffer(p_s_bh);
+ if (!buffer_uptodate(p_s_bh))
+ goto io_error;
+ } else {
+io_error:
p_s_search_path->path_length --;
pathrelse(p_s_search_path);
return IO_ERROR;
}
+ reada_count = 0;
if (expected_level == -1)
expected_level = SB_TREE_HEIGHT (p_s_sb);
expected_level --;
@@ -784,11 +796,36 @@ int search_by_key (struct super_block *
position in the node. */
n_block_number = B_N_CHILD_NUM(p_s_bh, p_s_last_element->pe_position);
-#ifdef SEARCH_BY_KEY_READA
- /* if we are going to read leaf node, then calculate its right neighbor if possible */
- if (n_node_level == DISK_LEAF_NODE_LEVEL + 1 && p_s_last_element->pe_position < B_NR_ITEMS (p_s_bh))
- right_neighbor_of_leaf_node = B_N_CHILD_NUM(p_s_bh, p_s_last_element->pe_position + 1);
-#endif
+ /* if we are going to read leaf nodes, try for read ahead as well */
+ if ((p_s_search_path->reada & PATH_READA) &&
+ n_node_level == DISK_LEAF_NODE_LEVEL + 1)
+ {
+ int pos = p_s_last_element->pe_position;
+ int limit = B_NR_ITEMS(p_s_bh);
+ struct key *le_key;
+
+ if (p_s_search_path->reada & PATH_READA_BACK)
+ limit = 0;
+ while(reada_count < SEARCH_BY_KEY_READA) {
+ if (pos == limit)
+ break;
+ reada_blocks[reada_count++] = B_N_CHILD_NUM(p_s_bh, pos);
+ if (p_s_search_path->reada & PATH_READA_BACK)
+ pos--;
+ else
+ pos++;
+
+ /*
+ * check to make sure we're in the same object
+ */
+ le_key = B_N_PDELIM_KEY(p_s_bh, pos);
+ if (le32_to_cpu(le_key->k_objectid) !=
+ p_s_key->on_disk_key.k_objectid)
+ {
+ break;
+ }
+ }
+ }
}
}
@@ -1778,6 +1815,12 @@ void reiserfs_do_truncate (struct reiser
space, this file would have this file size */
n_file_size = offset + bytes - 1;
}
+ /*
+ * are we doing a full truncate or delete, if so
+ * kick in the reada code
+ */
+ if (n_new_file_size == 0)
+ s_search_path.reada = PATH_READA | PATH_READA_BACK;
if ( n_file_size == 0 || n_file_size < n_new_file_size ) {
goto update_and_out ;
diff -puN include/linux/reiserfs_fs.h~reiserfs-search_reada-5 include/linux/reiserfs_fs.h
--- 25/include/linux/reiserfs_fs.h~reiserfs-search_reada-5 Fri Apr 23 14:38:39 2004
+++ 25-akpm/include/linux/reiserfs_fs.h Fri Apr 23 14:38:39 2004
@@ -1238,8 +1238,12 @@ excessive effort to avoid disturbing the
gods only know how we are going to SMP the code that uses them.
znodes are the way! */
+#define PATH_READA 0x1 /* do read ahead */
+#define PATH_READA_BACK 0x2 /* read backwards */
+
struct path {
int path_length; /* Length of the array above. */
+ int reada;
struct path_element path_elements[EXTENDED_MAX_HEIGHT]; /* Array of the path elements. */
int pos_in_item;
};
@@ -1247,7 +1251,7 @@ struct path {
#define pos_in_item(path) ((path)->pos_in_item)
#define INITIALIZE_PATH(var) \
-struct path var = {.path_length = ILLEGAL_PATH_ELEMENT_OFFSET,}
+struct path var = {.path_length = ILLEGAL_PATH_ELEMENT_OFFSET, .reada = 0,}
/* Get path element by path and path position. */
#define PATH_OFFSET_PELEMENT(p_s_path,n_offset) ((p_s_path)->path_elements +(n_offset))
_
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Large files
@ 2003-06-10 22:38 Ray Lee
0 siblings, 0 replies; 19+ messages in thread
From: Ray Lee @ 2003-06-10 22:38 UTC (permalink / raw)
To: root, Linux Kernel
[-- Attachment #1: Type: text/plain, Size: 1286 bytes --]
> With 32 bit return values, ix86 Linux has a file-size limitation
> which is currently about 0x7fffffff. Unfortunately, instead of
> returning from a write() with a -1 and errno being set, so that
> a program can do something about it, write() executes a signal(25)
> which kills the task even if trapped
Works For Me(tm)
ray:~/work/test/signals$ ls
sig.c
ray:~/work/test/signals$ tcc -run sig.c
write errored
seem to have gone overboard, switching to next log file...
write errored
seem to have gone overboard, switching to next log file...
write errored
seem to have gone overboard, switching to next log file...
ray:~/work/test/signals$ ls -l
total 4
-rw------- 1 ray ray 2147483647 Jun 10 15:35 log.0
-rw------- 1 ray ray 2147483647 Jun 10 15:35 log.1
-rw------- 1 ray ray 2147483647 Jun 10 15:35 log.2
-rw------- 1 ray ray 259 Jun 10 15:35 log.3
-rw-r--r-- 1 ray ray 2119 Jun 10 15:33 sig.c
ray:~/work/test/signals$
Test code attached. Please excuse the somewhat haphazard structure, it
was tossed together from code I'd written for other projects.
Ray
[-- Attachment #2: sig.c --]
[-- Type: text/x-c, Size: 2119 bytes --]
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
#include <termios.h>
#include <errno.h>
#include <signal.h>
typedef void sigfunc_t(int);
void signal_handler(int param) {
static int fd;
unsigned char sig;
if (fd && param==0) {
close(fd);
return;
}
if (param < 1) {
fd = -param;
return;
}
if (fd) {
sig=param;
while (-1 == write(fd, &sig, 1) && EINTR == errno)
;
}
}
sigfunc_t *connect_signal(int signo, sigfunc_t *func) {
struct sigaction act, oact;
act.sa_handler = func;
sigemptyset(&act.sa_mask);
act.sa_flags = 0;
if (sigaction(signo, &act, &oact) < 0)
return SIG_ERR;
return oact.sa_handler;
}
int open_signal_file(void) {
int fd[2];
pipe(fd);
signal_handler(-fd[1]); // hand off the write side of the pipe
connect_signal(SIGHUP, signal_handler);
connect_signal(SIGINT, signal_handler);
connect_signal(SIGXFSZ, signal_handler);
return fd[0]; //return the read side of the pipe
}
void close_signal_file(int fd) {
signal_handler(0);
close(fd);
}
int main(void) {
fd_set read_fds;
char buf[256], fname[]="log.0";
int sigfd, fd, i;
unsigned long long bytes_left=3ull * (1ull<<31) + 256ull;
sigfd = open_signal_file();
if (!sigfd)
return 1;
while (bytes_left) {
int len;
fd = creat(fname, S_IRUSR | S_IWUSR);
if (fd == -1)
return 2;
len = 0x7fffffff;
if (len > bytes_left)
len = bytes_left;
ftruncate(fd, len);
bytes_left -= len;
if (!bytes_left)
break;
lseek(fd, 0, SEEK_END);
i = write(fd, buf, 256);
if (i>0)
bytes_left -= i;
if (i<0)
puts("write errored");
FD_SET(sigfd, &read_fds);
select(sigfd + 1, &read_fds, NULL, NULL, NULL);
if (FD_ISSET(sigfd, &read_fds)) {
int sigs = read(sigfd, buf, 256);
if (sigs > 0)
for (i=0; i<sigs; i++)
switch(buf[i]) {
case SIGINT:
return 0;
case SIGXFSZ:
puts("seem to have gone overboard, switching to next log file...");
break;
default:
break;
}
}
close(fd);
fname[4]++;
}
close_signal_file(sigfd);
return 0;
}
^ permalink raw reply [flat|nested] 19+ messages in thread* Large files
@ 2003-06-10 13:57 Richard B. Johnson
2003-06-10 14:16 ` ZCane, Ed (Test Purposes)
` (2 more replies)
0 siblings, 3 replies; 19+ messages in thread
From: Richard B. Johnson @ 2003-06-10 13:57 UTC (permalink / raw)
To: Linux kernel
With 32 bit return values, ix86 Linux has a file-size limitation
which is currently about 0x7fffffff. Unfortunately, instead of
returning from a write() with a -1 and errno being set, so that
a program can do something about it, write() executes a signal(25)
which kills the task even if trapped. Is this one of those <expletive
deleted> POSIX requirements or is somebody going to fix it?
Cheers,
Dick Johnson
Penguin : Linux version 2.4.20 on an i686 machine (797.90 BogoMips).
Why is the government concerned about the lunatic fringe? Think about it.
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: Large files
2003-06-10 13:57 Richard B. Johnson
@ 2003-06-10 14:16 ` ZCane, Ed (Test Purposes)
2003-06-10 14:17 ` Matti Aarnio
2003-06-10 20:14 ` David Schwartz
2 siblings, 0 replies; 19+ messages in thread
From: ZCane, Ed (Test Purposes) @ 2003-06-10 14:16 UTC (permalink / raw)
To: Linux kernel
Dear All,
I'm allocating a large buffer at boot-time, from the kernel, using
alloc_bootmem_low_pages, which I wish to use for DMA from an device driver.
For example, the bootmem returns an address of 0xc0006000. This all works
fine, but...
What is the mechanism for communicating this address to user-space
processes, and mapping it to a virtual address, so that they can use my
buffer?
I want user-space processes to be able to read and write from this block of
memory, without having to be
suid root (if possible).
Cheers,
Ed
begin 666 RMRL-Disclaimer.txt
M4F5G:7-T97)E9"!/9F9I8V4Z(%)O:V4@36%N;W(@4F5S96%R8V@@3'1D+"!3
M:65M96YS($AO=7-E+"!/;&1B=7)Y+"!"<F%C:VYE;&PL( T*0F5R:W-H:7)E
M+B!21S$R(#A&6@T*#0I4:&4@:6YF;W)M871I;VX@8V]N=&%I;F5D(&EN('1H
M:7,@92UM86EL(&%N9"!A;GD@871T86-H;65N=',@:7,@8V]N9FED96YT:6%L
M('1O(%)O:V4@#0T-"DUA;F]R(%)E<V5A<F-H($QT9"!A;F0@;75S="!N;W0@
M8F4@<&%S<V5D('1O(&%N>2!T:&ER9"!P87)T>2!W:71H;W5T('!E<FUI<W-I
M;VXN(%1H:7,@#0T-"F-O;6UU;FEC871I;VX@:7,@9F]R(&EN9F]R;6%T:6]N
M(&]N;'D@86YD('-H86QL(&YO="!C<F5A=&4@;W(@8VAA;F=E(&%N>2!C;VYT
;<F%C='5A;" -#0T*<F5L871I;VYS:&EP+@T*
end
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: Large files
2003-06-10 13:57 Richard B. Johnson
2003-06-10 14:16 ` ZCane, Ed (Test Purposes)
@ 2003-06-10 14:17 ` Matti Aarnio
2003-06-10 15:12 ` Richard B. Johnson
2003-06-10 20:14 ` David Schwartz
2 siblings, 1 reply; 19+ messages in thread
From: Matti Aarnio @ 2003-06-10 14:17 UTC (permalink / raw)
To: Richard B. Johnson; +Cc: Linux kernel
On Tue, Jun 10, 2003 at 09:57:57AM -0400, Richard B. Johnson wrote:
> With 32 bit return values, ix86 Linux has a file-size limitation
> which is currently about 0x7fffffff. Unfortunately, instead of
> returning from a write() with a -1 and errno being set, so that
> a program can do something about it, write() executes a signal(25)
> which kills the task even if trapped. Is this one of those <expletive
> deleted> POSIX requirements or is somebody going to fix it?
http://www.sas.com/standards/large.file/
#define SIGXFSZ 25 /* File size limit exceeded (4.2 BSD). */
from fs/buffer.c:
err = -EFBIG;
limit = current->rlim[RLIMIT_FSIZE].rlim_cur;
if (limit != RLIM_INFINITY && size > (loff_t)limit) {
send_sig(SIGXFSZ, current, 0);
goto out;
}
if (size > inode->i_sb->s_maxbytes)
goto out;
> Cheers,
> Dick Johnson
> Penguin : Linux version 2.4.20 on an i686 machine (797.90 BogoMips).
/Matti Aarnio
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: Large files
2003-06-10 14:17 ` Matti Aarnio
@ 2003-06-10 15:12 ` Richard B. Johnson
2003-06-10 17:25 ` Martin Mares
` (2 more replies)
0 siblings, 3 replies; 19+ messages in thread
From: Richard B. Johnson @ 2003-06-10 15:12 UTC (permalink / raw)
To: Matti Aarnio; +Cc: Linux kernel
On Tue, 10 Jun 2003, Matti Aarnio wrote:
> On Tue, Jun 10, 2003 at 09:57:57AM -0400, Richard B. Johnson wrote:
> > With 32 bit return values, ix86 Linux has a file-size limitation
> > which is currently about 0x7fffffff. Unfortunately, instead of
> > returning from a write() with a -1 and errno being set, so that
> > a program can do something about it, write() executes a signal(25)
> > which kills the task even if trapped. Is this one of those <expletive
> > deleted> POSIX requirements or is somebody going to fix it?
>
> http://www.sas.com/standards/large.file/
>
> #define SIGXFSZ 25 /* File size limit exceeded (4.2 BSD). */
>
> from fs/buffer.c:
>
> err = -EFBIG;
> limit = current->rlim[RLIMIT_FSIZE].rlim_cur;
> if (limit != RLIM_INFINITY && size > (loff_t)limit) {
> send_sig(SIGXFSZ, current, 0);
> goto out;
> }
> if (size > inode->i_sb->s_maxbytes)
> goto out;
>
>
On the system that fails, there are no ulimits and it's the root
account, therefore I don't know how to set the above limit to
RLIM_INFINITY (~0LU). It's also version 2.4.20. I don't think
it has anything to do with 'rlim' shown above. In any event
sending a signal when the file-size exceeds some level is preposterous.
The write should return -1 and errno should have been set to EFBIG
(in user space). That allows the user's database to create another
file and keep on trucking instead of blowing up and destroying the
user's inventory or whatever else was in process.
FYI, this caused the failure of a samba server for M$ stuff. It
gives the impression of Linux being defective. This is not good.
Cheers,
Dick Johnson
Penguin : Linux version 2.4.20 on an i686 machine (797.90 BogoMips).
Why is the government concerned about the lunatic fringe? Think about it.
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: Large files
2003-06-10 15:12 ` Richard B. Johnson
@ 2003-06-10 17:25 ` Martin Mares
2003-06-10 18:14 ` Andreas Dilger
2003-06-10 22:12 ` Rob Landley
2 siblings, 0 replies; 19+ messages in thread
From: Martin Mares @ 2003-06-10 17:25 UTC (permalink / raw)
To: Richard B. Johnson; +Cc: Matti Aarnio, Linux kernel
Hello!
> On the system that fails, there are no ulimits and it's the root
> account, therefore I don't know how to set the above limit to
> RLIM_INFINITY (~0LU). It's also version 2.4.20. I don't think
> it has anything to do with 'rlim' shown above.
I think it has -- login (or PAM) in most distributions sets the
file size limit to 2GB instead of RLIM_INFINITY. If you are root,
try `ulimit -f unlimited' to see if it helps.
> sending a signal when the file-size exceeds some level is preposterous.
No, it's just the definition of the rlimits. Not leaving them at
RLIM_INFINITY by default is preposterous.
Have a nice fortnight
--
Martin `MJ' Mares <mj@ucw.cz> http://atrey.karlin.mff.cuni.cz/~mj/
Faculty of Math and Physics, Charles University, Prague, Czech Rep., Earth
COBOL -- Compiles Only Because Of Luck
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Large files
2003-06-10 15:12 ` Richard B. Johnson
2003-06-10 17:25 ` Martin Mares
@ 2003-06-10 18:14 ` Andreas Dilger
2003-06-10 22:12 ` Rob Landley
2 siblings, 0 replies; 19+ messages in thread
From: Andreas Dilger @ 2003-06-10 18:14 UTC (permalink / raw)
To: Richard B. Johnson; +Cc: Matti Aarnio, Linux kernel
On Jun 10, 2003 11:12 -0400, Richard B. Johnson wrote:
> > On Tue, Jun 10, 2003 at 09:57:57AM -0400, Richard B. Johnson wrote:
> > > With 32 bit return values, ix86 Linux has a file-size limitation
> > > which is currently about 0x7fffffff. Unfortunately, instead of
> > > returning from a write() with a -1 and errno being set, so that
> > > a program can do something about it, write() executes a signal(25)
> > > which kills the task even if trapped. Is this one of those <expletive
> > > deleted> POSIX requirements or is somebody going to fix it?
> >
> > http://www.sas.com/standards/large.file/
> >
> > #define SIGXFSZ 25 /* File size limit exceeded (4.2 BSD). */
> >
> > from fs/buffer.c:
> >
> > err = -EFBIG;
> > limit = current->rlim[RLIMIT_FSIZE].rlim_cur;
> > if (limit != RLIM_INFINITY && size > (loff_t)limit) {
> > send_sig(SIGXFSZ, current, 0);
> > goto out;
> > }
> > if (size > inode->i_sb->s_maxbytes)
> > goto out;
> >
> >
>
> On the system that fails, there are no ulimits and it's the root
> account, therefore I don't know how to set the above limit to
> RLIM_INFINITY (~0LU). It's also version 2.4.20. I don't think
> it has anything to do with 'rlim' shown above. In any event
> sending a signal when the file-size exceeds some level is preposterous.
> The write should return -1 and errno should have been set to EFBIG
> (in user space). That allows the user's database to create another
> file and keep on trucking instead of blowing up and destroying the
> user's inventory or whatever else was in process.
>
> FYI, this caused the failure of a samba server for M$ stuff. It
> gives the impression of Linux being defective. This is not good.
If your application is not compiled with O_LARGEFILE, you will also
get SIGXFSZ if you try to write past the 2GB limit. This is to avoid
your application corrupting data by trying to store a 64-bit file
size in an (apparently) 32-bit data value (32-bit because you didn't
specify O_LARGEFILE).
I don't see anything in signal(7) which says that SIGXFSZ(25) can't be
caught and handled by the application, but at that point you may as
well just fix the app to just open the file with O_LARGEFILE and handle
64-bit file offsets properly.
Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: Large files
2003-06-10 15:12 ` Richard B. Johnson
2003-06-10 17:25 ` Martin Mares
2003-06-10 18:14 ` Andreas Dilger
@ 2003-06-10 22:12 ` Rob Landley
2 siblings, 0 replies; 19+ messages in thread
From: Rob Landley @ 2003-06-10 22:12 UTC (permalink / raw)
To: root, Matti Aarnio; +Cc: Linux kernel
On Tuesday 10 June 2003 11:12, Richard B. Johnson wrote:
> On Tue, 10 Jun 2003, Matti Aarnio wrote:
> > On Tue, Jun 10, 2003 at 09:57:57AM -0400, Richard B. Johnson wrote:
> > > With 32 bit return values, ix86 Linux has a file-size limitation
> > > which is currently about 0x7fffffff. Unfortunately, instead of
> > > returning from a write() with a -1 and errno being set, so that
> > > a program can do something about it, write() executes a signal(25)
> > > which kills the task even if trapped. Is this one of those <expletive
> > > deleted> POSIX requirements or is somebody going to fix it?
> >
> > http://www.sas.com/standards/large.file/
Is anybody indexing these suckers? I've got a directory full of downloaded
PDFs of things like the el-torito spec and bits of posix and sus, and I was
just wondering if there's some kind of master list of all these things that
Linux actually implements.
I suspect the answer is "probably not", but i thought I'd ask...
Rob
Rob
^ permalink raw reply [flat|nested] 19+ messages in thread
* RE: Large files
2003-06-10 13:57 Richard B. Johnson
2003-06-10 14:16 ` ZCane, Ed (Test Purposes)
2003-06-10 14:17 ` Matti Aarnio
@ 2003-06-10 20:14 ` David Schwartz
2003-06-10 20:31 ` Richard B. Johnson
2 siblings, 1 reply; 19+ messages in thread
From: David Schwartz @ 2003-06-10 20:14 UTC (permalink / raw)
To: root, Linux kernel
> With 32 bit return values, ix86 Linux has a file-size limitation
> which is currently about 0x7fffffff. Unfortunately, instead of
> returning from a write() with a -1 and errno being set, so that
> a program can do something about it, write() executes a signal(25)
> which kills the task even if trapped. Is this one of those <expletive
> deleted> POSIX requirements or is somebody going to fix it?
If the program were smart enough to do something sane about it, it should
be smart enough to handle the signal correctly. What do you think should
happen if a program compiled today calls 'time' in 2039? You want to shut
down the program as quickly as possible before it does something insane.
DS
^ permalink raw reply [flat|nested] 19+ messages in thread
* RE: Large files
2003-06-10 20:14 ` David Schwartz
@ 2003-06-10 20:31 ` Richard B. Johnson
0 siblings, 0 replies; 19+ messages in thread
From: Richard B. Johnson @ 2003-06-10 20:31 UTC (permalink / raw)
To: David Schwartz; +Cc: Linux kernel
On Tue, 10 Jun 2003, David Schwartz wrote:
>
> > With 32 bit return values, ix86 Linux has a file-size limitation
> > which is currently about 0x7fffffff. Unfortunately, instead of
> > returning from a write() with a -1 and errno being set, so that
> > a program can do something about it, write() executes a signal(25)
> > which kills the task even if trapped. Is this one of those <expletive
> > deleted> POSIX requirements or is somebody going to fix it?
>
> If the program were smart enough to do something sane about it, it should
> be smart enough to handle the signal correctly. What do you think should
> happen if a program compiled today calls 'time' in 2039? You want to shut
> down the program as quickly as possible before it does something insane.
>
> DS
>
A trap on that signal doesn't even allow a longjump() to recover!
The signal can be trapped, but the kernel kills the task anyway.
All you can do is make the program print something else than
the "File too large" default. It's sick, very sick. The file-too-
big problem should have been handled properly by the kernel. The
kernel has no business making a policy decision. If the file is
getting too big, the kernel should fail to write any more than
the maximum allowable and return the correct information in the
defined API. It must not make a policy decision and kill the
task.
This has far-reaching consequences.
Even opening the file with large file attributes can result in
the file getting to large eventually. The kernel must not blow
away a task because it "thinks" something. It is not allowed to
"think". It is not allowed to generate policy.
Cheers,
Dick Johnson
Penguin : Linux version 2.4.20 on an i686 machine (797.90 BogoMips).
Why is the government concerned about the lunatic fringe? Think about it.
^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2004-05-18 15:40 UTC | newest]
Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-05-17 19:48 large files Bernd Schubert
2004-05-17 20:12 ` Chris Mason
2004-05-17 20:25 ` Bernd Schubert
2004-05-18 13:42 ` Bernd Schubert
2004-05-18 13:57 ` Chris Mason
2004-05-18 14:49 ` Bernd Schubert
2004-05-18 15:07 ` Chris Mason
2004-05-18 15:19 ` Hans Reiser
2004-05-18 15:40 ` Chris Mason
-- strict thread matches above, loose matches on Subject: below --
2003-06-10 22:38 Large files Ray Lee
2003-06-10 13:57 Richard B. Johnson
2003-06-10 14:16 ` ZCane, Ed (Test Purposes)
2003-06-10 14:17 ` Matti Aarnio
2003-06-10 15:12 ` Richard B. Johnson
2003-06-10 17:25 ` Martin Mares
2003-06-10 18:14 ` Andreas Dilger
2003-06-10 22:12 ` Rob Landley
2003-06-10 20:14 ` David Schwartz
2003-06-10 20:31 ` Richard B. Johnson
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.