* [Cluster-devel] seq_file: Use larger buffer to reduce time traversing lists
@ 2012-06-01 10:39 Steven Whitehouse
[not found] ` <1338552626.2760.1510.camel@edumazet-glaptop>
2012-06-01 15:54 ` Joe Perches
0 siblings, 2 replies; 8+ messages in thread
From: Steven Whitehouse @ 2012-06-01 10:39 UTC (permalink / raw)
To: cluster-devel.redhat.com
I've just been taking a look at the seq_read() code, since we've noticed
that dumping files with large numbers of records can take some
considerable time. This is due to seq_read() using a buffer which, at
most is a single page in size, and that it has to find its place again
on every call to seq_read(). That makes it rather inefficient.
As an example, I created a GFS2 filesystem with 100k inodes in it, and
then ran ls -l to get a decent number of cached inodes. This result in
there being approx 400k lines in the debugfs file containing GFS2's
glocks. I then timed how long it takes to read this file:
[root at chywoon mnt]# time dd if=/sys/kernel/debug/gfs2/unity\:myfs/glocks
of=/dev/null bs=1M
0+5769 records in
0+5769 records out
23273958 bytes (23 MB) copied, 63.3681 s, 367 kB/s
real 1m3.371s
user 0m0.005s
sys 1m3.317s
I tried this several times and it seems pretty consistent. Then I tried
again with the below patch applied:
[root at chywoon mnt]# time dd if=/sys/kernel/debug/gfs2/unity\:myfs/glocks
of=/dev/null bs=1M
0+23 records in
0+23 records out
23107575 bytes (23 MB) copied, 1.64175 s, 14.1 MB/s
real 0m1.644s
user 0m0.000s
sys 0m1.643s
So that is a speed up of approx 38x, which isn't too bad I think. I
noticed that currently, the code caches the buffer page rather than
doing a free/alloc on each seq_read() call. Since my patch can make that
buffer rather large, I was not sure that this was a good idea, in
general, as it would seem to lead to potential DoS attacks in pinning
down large amounts of kmalloced memory. So my patch will free the buffer
on each cycle unless it is only a single page in size. That way we won't
pin down any more memory than the current code does, even when using a
larger buffer. I suspect that any time spent doing alloc/free will be
more than made up by the reduction in time traversing the records in the
file.
Also, since multipage kmalloc allocations are likely to fail in low
memory conditions, I've made that happen silently, and it falls back to
a single page buffer if a larger buffer cannot be allocated.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
diff --git a/fs/seq_file.c b/fs/seq_file.c
index 0cbd049..feb86f2 100644
--- a/fs/seq_file.c
+++ b/fs/seq_file.c
@@ -153,7 +153,9 @@ ssize_t seq_read(struct file *file, char __user *buf, size_t size, loff_t *ppos)
size_t n;
void *p;
int err = 0;
+ unsigned bsize = clamp(size, PAGE_SIZE, KMALLOC_MAX_SIZE);
+ bsize = round_up(bsize, PAGE_SIZE);
mutex_lock(&m->lock);
/*
@@ -169,6 +171,16 @@ ssize_t seq_read(struct file *file, char __user *buf, size_t size, loff_t *ppos)
*/
m->version = file->f_version;
+ if ((m->size < bsize) || !m->buf) {
+ kfree(m->buf);
+ m->size = 0;
+ m->buf = kmalloc(m->size = bsize, GFP_KERNEL|__GFP_NOWARN);
+ if (!m->buf)
+ m->buf = kmalloc(m->size = PAGE_SIZE, GFP_KERNEL);
+ if (!m->buf)
+ goto Enomem;
+ }
+
/* Don't assume *ppos is where we left it */
if (unlikely(*ppos != m->read_pos)) {
while ((err = traverse(m, *ppos)) == -EAGAIN)
@@ -277,6 +289,11 @@ Done:
m->read_pos += copied;
}
file->f_version = m->version;
+ if (m->size > PAGE_SIZE) {
+ kfree(m->buf);
+ m->buf = NULL;
+ m->size = 0;
+ }
mutex_unlock(&m->lock);
return copied;
Enomem:
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [Cluster-devel] seq_file: Use larger buffer to reduce time traversing lists
[not found] ` <1338552626.2760.1510.camel@edumazet-glaptop>
@ 2012-06-01 12:24 ` Steven Whitehouse
[not found] ` <1338554890.2760.1517.camel@edumazet-glaptop>
[not found] ` <1338552870.2760.1512.camel@edumazet-glaptop>
1 sibling, 1 reply; 8+ messages in thread
From: Steven Whitehouse @ 2012-06-01 12:24 UTC (permalink / raw)
To: cluster-devel.redhat.com
Hi,
On Fri, 2012-06-01 at 14:10 +0200, Eric Dumazet wrote:
> On Fri, 2012-06-01 at 11:39 +0100, Steven Whitehouse wrote:
> > I've just been taking a look at the seq_read() code, since we've noticed
> > that dumping files with large numbers of records can take some
> > considerable time. This is due to seq_read() using a buffer which, at
> > most is a single page in size, and that it has to find its place again
> > on every call to seq_read(). That makes it rather inefficient.
> >
> > As an example, I created a GFS2 filesystem with 100k inodes in it, and
> > then ran ls -l to get a decent number of cached inodes. This result in
> > there being approx 400k lines in the debugfs file containing GFS2's
> > glocks. I then timed how long it takes to read this file:
> >
> > [root at chywoon mnt]# time dd if=/sys/kernel/debug/gfs2/unity\:myfs/glocks
> > of=/dev/null bs=1M
> > 0+5769 records in
> > 0+5769 records out
> > 23273958 bytes (23 MB) copied, 63.3681 s, 367 kB/s
>
> What time do you get if you do
>
> time dd if=/sys/kernel/debug/gfs2/unity\:myfs/glocks of=/dev/null bs=4k
>
Yes, you are right that it will be slow, still. But I'm not sure what we
can do about that. We have to find our place again on each read call, in
any case I think.
> This patch seems the wrong way to me.
>
> seq_read(size = 1MB) should perform many copy_to_user() calls instead of a single one.
>
> Instead of doing kmalloc(m->size <<= 1, GFP_KERNEL) each time we overflow the buffer,
> we should flush its content to user space.
>
>
>
Yes, I thought of that... there is a potential problem though. Is it
valid to do a copy to user while between ->seq_start() and ->seq_stop()
in general? These might be holding locks which may potentially be
required during a page fault caused by the copy_to_user(). Thats not an
issue in the GFS2 case, but I don't know if it is generally ok to do
that. I assume that is why the results are buffered rather than being
directly written to userspace anyway.
As soon as we do a ->seq_stop, then we have to go through the whole list
(assuming a list based data source) again to find our place :(
Steve.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Cluster-devel] seq_file: Use larger buffer to reduce time traversing lists
[not found] ` <1338552870.2760.1512.camel@edumazet-glaptop>
@ 2012-06-01 12:26 ` Steven Whitehouse
0 siblings, 0 replies; 8+ messages in thread
From: Steven Whitehouse @ 2012-06-01 12:26 UTC (permalink / raw)
To: cluster-devel.redhat.com
Hi,
On Fri, 2012-06-01 at 14:14 +0200, Eric Dumazet wrote:
> On Fri, 2012-06-01 at 14:10 +0200, Eric Dumazet wrote:
> > On Fri, 2012-06-01 at 11:39 +0100, Steven Whitehouse wrote:
> > > I've just been taking a look at the seq_read() code, since we've noticed
> > > that dumping files with large numbers of records can take some
> > > considerable time. This is due to seq_read() using a buffer which, at
> > > most is a single page in size, and that it has to find its place again
> > > on every call to seq_read(). That makes it rather inefficient.
> > >
> > > As an example, I created a GFS2 filesystem with 100k inodes in it, and
> > > then ran ls -l to get a decent number of cached inodes. This result in
> > > there being approx 400k lines in the debugfs file containing GFS2's
> > > glocks. I then timed how long it takes to read this file:
> > >
> > > [root at chywoon mnt]# time dd if=/sys/kernel/debug/gfs2/unity\:myfs/glocks
> > > of=/dev/null bs=1M
> > > 0+5769 records in
> > > 0+5769 records out
> > > 23273958 bytes (23 MB) copied, 63.3681 s, 367 kB/s
> >
> > What time do you get if you do
> >
> > time dd if=/sys/kernel/debug/gfs2/unity\:myfs/glocks of=/dev/null bs=4k
> >
> > This patch seems the wrong way to me.
> >
> > seq_read(size = 1MB) should perform many copy_to_user() calls instead of a single one.
> >
> > Instead of doing kmalloc(m->size <<= 1, GFP_KERNEL) each time we overflow the buffer,
> > we should flush its content to user space.
> >
> >
>
> by the way, is the following command even working ?
>
> time dd if=/sys/kernel/debug/gfs2/unity\:myfs/glocks of=/dev/null bs=16M
>
> I guess not, it probably returns -ENOMEM
>
>
>
Why would it return -ENOMEM? It works for me, at worst it will fall back
to a single page buffer size unless we are really stuck for memory, and
in that case, all bets are off,
Steve.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Cluster-devel] seq_file: Use larger buffer to reduce time traversing lists
[not found] ` <1338554890.2760.1517.camel@edumazet-glaptop>
@ 2012-06-01 13:14 ` Steven Whitehouse
[not found] ` <1338557229.2760.1520.camel@edumazet-glaptop>
0 siblings, 1 reply; 8+ messages in thread
From: Steven Whitehouse @ 2012-06-01 13:14 UTC (permalink / raw)
To: cluster-devel.redhat.com
Hi,
On Fri, 2012-06-01 at 14:48 +0200, Eric Dumazet wrote:
> On Fri, 2012-06-01 at 13:24 +0100, Steven Whitehouse wrote:
>
> > Yes, you are right that it will be slow, still. But I'm not sure what we
> > can do about that. We have to find our place again on each read call, in
> > any case I think.
>
> You dont really answer my question, I asked the exact timing...
>
Here it is (with the patch):
[root at chywoon mnt]# time dd if=/sys/kernel/debug/gfs2/unity\:myfs/glocks
of=/dev/null bs=4k
0+5726 records in
0+5726 records out
23107575 bytes (23 MB) copied, 82.3082 s, 281 kB/s
real 1m22.311s
user 0m0.013s
sys 1m22.231s
So thats slow, as promised :-)
> I can't reproduce this slow behavior you have, using /proc/net seq
> files.
>
> Isn't it a problem with this particular file ?
>
Well, yes and no. The problem would affect any file with lots of records
in it, but there may not be many with that number of records. Do any of
your net files have numbers of entries in the region of hundreds of
thousands or more?
> Does it want to output a single record ( m->op->show(m, p) ) much larger
> than 4KB ?
>
No. That appears to work ok, so far as I can tell, anyway. What we have
are lots of relatively short records. Here is an example of a few lines.
Each line starting G: is a new record, so this is 5 calls to ->show():
G: s:SH n:5/1da5e f:Iqob t:SH d:EX/0 a:0 v:0 r:2 m:200
H: s:SH f:EH e:0 p:6577 [(ended)] gfs2_inode_lookup+0x116/0x2d0 [gfs2]
G: s:SH n:2/a852 f:IqLob t:SH d:EX/0 a:0 v:0 r:2 m:200
I: n:9712/43090 t:8 f:0x00 d:0x00000000 s:0
G: s:SH n:2/8bcd f:IqLob t:SH d:EX/0 a:0 v:0 r:2 m:200
I: n:2584/35789 t:8 f:0x00 d:0x00000000 s:0
G: s:SH n:2/1eea7 f:IqLob t:SH d:EX/0 a:0 v:0 r:2 m:200
I: n:58968/126631 t:8 f:0x00 d:0x00000000 s:0
G: s:SH n:2/12fbd f:IqLob t:SH d:EX/0 a:0 v:0 r:2 m:200
I: n:11120/77757 t:8 f:0x00 d:0x00000000 s:0
The key here is that we have a lot of them. My example using just over
400k records is in fact a fairly modest example - it is not unusual to
see millions of records in this file. We use it for debug purposes only,
and this patch was prompted by people reporting it taking a very long
time to dump the file.
The issue is not the time taken to create each record, or to copy the
data, but the time taken each time we have to find our place again in
the list of glocks (actually a hash table, but same thing applies as we
traverse it as a set of lists)
I don't think there is really much we can easily do in the case of
readers requesting small reads of the file. At least we can make it much
more efficient when they request larger reads though,
Steve.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Cluster-devel] seq_file: Use larger buffer to reduce time traversing lists
[not found] ` <1338557229.2760.1520.camel@edumazet-glaptop>
@ 2012-06-01 14:18 ` Steven Whitehouse
[not found] ` <1338562627.2760.1526.camel@edumazet-glaptop>
0 siblings, 1 reply; 8+ messages in thread
From: Steven Whitehouse @ 2012-06-01 14:18 UTC (permalink / raw)
To: cluster-devel.redhat.com
Hi,
On Fri, 2012-06-01 at 15:27 +0200, Eric Dumazet wrote:
> On Fri, 2012-06-01 at 14:14 +0100, Steven Whitehouse wrote:
>
> > Here it is (with the patch):
> >
> > [root at chywoon mnt]# time dd if=/sys/kernel/debug/gfs2/unity\:myfs/glocks
> > of=/dev/null bs=4k
> > 0+5726 records in
> > 0+5726 records out
> > 23107575 bytes (23 MB) copied, 82.3082 s, 281 kB/s
> >
> > real 1m22.311s
> > user 0m0.013s
> > sys 1m22.231s
> >
> > So thats slow, as promised :-)
> >
> > > I can't reproduce this slow behavior you have, using /proc/net seq
> > > files.
> > >
> > > Isn't it a problem with this particular file ?
> > >
> > Well, yes and no. The problem would affect any file with lots of records
> > in it, but there may not be many with that number of records. Do any of
> > your net files have numbers of entries in the region of hundreds of
> > thousands or more?
> >
> > > Does it want to output a single record ( m->op->show(m, p) ) much larger
> > > than 4KB ?
> > >
> > No. That appears to work ok, so far as I can tell, anyway. What we have
> > are lots of relatively short records. Here is an example of a few lines.
> > Each line starting G: is a new record, so this is 5 calls to ->show():
> >
> > G: s:SH n:5/1da5e f:Iqob t:SH d:EX/0 a:0 v:0 r:2 m:200
> > H: s:SH f:EH e:0 p:6577 [(ended)] gfs2_inode_lookup+0x116/0x2d0 [gfs2]
> > G: s:SH n:2/a852 f:IqLob t:SH d:EX/0 a:0 v:0 r:2 m:200
> > I: n:9712/43090 t:8 f:0x00 d:0x00000000 s:0
> > G: s:SH n:2/8bcd f:IqLob t:SH d:EX/0 a:0 v:0 r:2 m:200
> > I: n:2584/35789 t:8 f:0x00 d:0x00000000 s:0
> > G: s:SH n:2/1eea7 f:IqLob t:SH d:EX/0 a:0 v:0 r:2 m:200
> > I: n:58968/126631 t:8 f:0x00 d:0x00000000 s:0
> > G: s:SH n:2/12fbd f:IqLob t:SH d:EX/0 a:0 v:0 r:2 m:200
> > I: n:11120/77757 t:8 f:0x00 d:0x00000000 s:0
> >
> >
> > The key here is that we have a lot of them. My example using just over
> > 400k records is in fact a fairly modest example - it is not unusual to
> > see millions of records in this file. We use it for debug purposes only,
> > and this patch was prompted by people reporting it taking a very long
> > time to dump the file.
> >
> > The issue is not the time taken to create each record, or to copy the
> > data, but the time taken each time we have to find our place again in
> > the list of glocks (actually a hash table, but same thing applies as we
> > traverse it as a set of lists)
> >
> > I don't think there is really much we can easily do in the case of
> > readers requesting small reads of the file. At least we can make it much
> > more efficient when they request larger reads though,
>
> Issue is your seq_file provider has O(N^2) behavior
>
Well that is a slight simplification. The provider has O(N) behaviour
wrt to the number of entries when streaming data, and O(N) behaviour
when seeking to a specific file offset. It is the combination of
repeated calls of the seeking function that results in O(N^2)
> We used to have same issues in network land, and we fixed this some time
> ago, and we only use 4KB as seq_file buffer, not a huge one.
>
> Check commit a8b690f98baf9fb1 ( tcp: Fix slowness in
> read /proc/net/tcp ) for an example
>
So far as I can tell, you are just caching the last hash bucket, but are
still seeking down each hash chain. That will only work well when the
hash buckets are reasonably empty. We can try that in GFS2, however I'm
not sure that is the whole story.
It appears that /proc/net/unix still seems to suffer from this O(N^2)
problem, even if /proc/net/tcp does not. I did a quick test using a
program to create a bunch of tcp sockets, and using my kernel patch
landed up with these results:
[root at chywoon mnt]# time dd if=/proc/net/tcp of=/dev/null bs=4k
0+1046 records in
0+1046 records out
4236300 bytes (4.2 MB) copied, 1.20739 s, 3.5 MB/s
real 0m1.310s
user 0m0.003s
sys 0m1.220s
[root at chywoon mnt]# time dd if=/proc/net/tcp of=/dev/null bs=1M
0+5 records in
0+5 records out
4236300 bytes (4.2 MB) copied, 0.331139 s, 12.8 MB/s
real 0m0.470s
user 0m0.001s
sys 0m0.374s
So even with the current tcp scheme this appears to speed things up by
nearly 3x. Also that was with only 28000 entries in the file,
Steve.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Cluster-devel] seq_file: Use larger buffer to reduce time traversing lists
[not found] ` <1338562627.2760.1526.camel@edumazet-glaptop>
@ 2012-06-01 15:28 ` Steven Whitehouse
[not found] ` <1338562897.2760.1528.camel@edumazet-glaptop>
1 sibling, 0 replies; 8+ messages in thread
From: Steven Whitehouse @ 2012-06-01 15:28 UTC (permalink / raw)
To: cluster-devel.redhat.com
Hi,
On Fri, 2012-06-01 at 16:57 +0200, Eric Dumazet wrote:
> On Fri, 2012-06-01 at 15:18 +0100, Steven Whitehouse wrote:
> > 0m0.374s
> >
> > So even with the current tcp scheme this appears to speed things up by
> > nearly 3x. Also that was with only 28000 entries in the file,
>
> Initial speedup was 100x, not 3x.
>
According to the patch description, that 100x was with 182k entries.
That is not comparing like with like, although I accept that that did
provide a really good improvement. I'm not suggesting that we should
have one approach or the other, but that both are worth considering.
I'll certainly have a look at the hash table based approach too.
> Of course using a 32KB buffer instead of 4KB will help.
>
> And If someones need 100.000 active unix sockets and fast /proc/net/udp
> file as well, patch is welcome. If I have time I can do it eventually.
>
> Really, kmalloc(2 MB) is not going to happen, even using __GFP_NOWARN
>
It is designed so that if this allocation fails, then we just fall back
to the old slow way, so I'm not sure that this is an issue. It will not
fail to work just because the initial kmalloc fails, so we would be no
worse off than with the current code, and we could also trim down the
top size limit to something a bit smaller than KMALLOC_MAX_SIZE and
still get most of the benefit. I just chose that as a convenient upper
limit to show the principle.
Also, I think it wouldn't be unreasonable to argue that if the
probability of a KMALLOC_MAX_SIZE allocation failing is so high that it
is very unlikely to ever succeed, then perhaps KMALLOC_MAX_SIZE is too
large.
So I know that I might have not convinced you :-) but I still think that
perhaps something along these lines is worth considering. I had looked
at various other possible ways of achieving a similar effect but all of
those I rejected in the end, as they fell foul of some of the subtleties
of the seq_read() code,
Steve.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Cluster-devel] seq_file: Use larger buffer to reduce time traversing lists
[not found] ` <1338563900.2760.1529.camel@edumazet-glaptop>
@ 2012-06-01 15:30 ` Steven Whitehouse
0 siblings, 0 replies; 8+ messages in thread
From: Steven Whitehouse @ 2012-06-01 15:30 UTC (permalink / raw)
To: cluster-devel.redhat.com
Hi,
On Fri, 2012-06-01 at 17:18 +0200, Eric Dumazet wrote:
> On Fri, 2012-06-01 at 17:01 +0200, Eric Dumazet wrote:
>
> > All unix sockets are linked into a single list, no hash table, so its
> > not very easy right now.
> >
>
> Oops, there is a hash table, so patch is coming ;)
>
>
The only reason that I looked at the unix sockets was that Nate Straz,
who originally reported the issue to me had sent me a test program for
unix sockets which he wrote in response to one of your earlier emails.
So I converted it to tcp just to do that quick test,
Steve.
^ permalink raw reply [flat|nested] 8+ messages in thread
* [Cluster-devel] seq_file: Use larger buffer to reduce time traversing lists
2012-06-01 10:39 [Cluster-devel] seq_file: Use larger buffer to reduce time traversing lists Steven Whitehouse
[not found] ` <1338552626.2760.1510.camel@edumazet-glaptop>
@ 2012-06-01 15:54 ` Joe Perches
1 sibling, 0 replies; 8+ messages in thread
From: Joe Perches @ 2012-06-01 15:54 UTC (permalink / raw)
To: cluster-devel.redhat.com
On Fri, 2012-06-01 at 11:39 +0100, Steven Whitehouse wrote:
> I've just been taking a look at the seq_read() code,
[]
nice improvement. trivial style comment:
> diff --git a/fs/seq_file.c b/fs/seq_file.c
[]
> @@ -169,6 +171,16 @@ ssize_t seq_read(struct file *file, char __user *buf, size_t size, loff_t *ppos)
> */
> m->version = file->f_version;
>
> + if ((m->size < bsize) || !m->buf) {
> + kfree(m->buf);
> + m->size = 0;
> + m->buf = kmalloc(m->size = bsize, GFP_KERNEL|__GFP_NOWARN);
You can remove the m->size = 0 line and
can you please add a space around the |?
It's a bit dense and too easy to read it
as a single word.
> + if (!m->buf)
> + m->buf = kmalloc(m->size = PAGE_SIZE, GFP_KERNEL);
> + if (!m->buf)
> + goto Enomem;
Mixed case label?
> @@ -277,6 +289,11 @@ Done:
> m->read_pos += copied;
> }
> file->f_version = m->version;
> + if (m->size > PAGE_SIZE) {
> + kfree(m->buf);
> + m->buf = NULL;
> + m->size = 0;
> + }
> mutex_unlock(&m->lock);
> return copied;
> Enomem:
Oh, it wasn't you. nevermind...
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2012-06-01 15:54 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-06-01 10:39 [Cluster-devel] seq_file: Use larger buffer to reduce time traversing lists Steven Whitehouse
[not found] ` <1338552626.2760.1510.camel@edumazet-glaptop>
2012-06-01 12:24 ` Steven Whitehouse
[not found] ` <1338554890.2760.1517.camel@edumazet-glaptop>
2012-06-01 13:14 ` Steven Whitehouse
[not found] ` <1338557229.2760.1520.camel@edumazet-glaptop>
2012-06-01 14:18 ` Steven Whitehouse
[not found] ` <1338562627.2760.1526.camel@edumazet-glaptop>
2012-06-01 15:28 ` Steven Whitehouse
[not found] ` <1338562897.2760.1528.camel@edumazet-glaptop>
[not found] ` <1338563900.2760.1529.camel@edumazet-glaptop>
2012-06-01 15:30 ` Steven Whitehouse
[not found] ` <1338552870.2760.1512.camel@edumazet-glaptop>
2012-06-01 12:26 ` Steven Whitehouse
2012-06-01 15:54 ` Joe Perches
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).