All of lore.kernel.org
 help / color / mirror / Atom feed
* [mlmmj] read(2) syscall bloat
@ 2011-09-05 11:56 Moritz Wilhelmy
  2011-09-05 12:30 ` Lukas Fleischer
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: Moritz Wilhelmy @ 2011-09-05 11:56 UTC (permalink / raw)
  To: mlmmj

Hello,

mlmmj currently does a read(2) system call for every single byte it
reads from a file descriptor. This is unnecessarily inefficient and
slow. 

Strace output is similar to the following:
open("/var/spool/mlmmj/foo/control/listaddress", O_RDONLY) = 4
read(4, "f", 1)                         = 1
read(4, "o", 1)                         = 1
read(4, "o", 1)                         = 1
read(4, "@", 1)                         = 1
read(4, "l", 1)                         = 1
read(4, "i", 1)                         = 1
read(4, "s", 1)                         = 1
read(4, "t", 1)                         = 1
read(4, "s", 1)                         = 1
read(4, ".", 1)                         = 1
read(4, "e", 1)                         = 1
read(4, "x", 1)                         = 1
read(4, "a", 1)                         = 1
read(4, "m", 1)                         = 1
read(4, "p", 1)                         = 1
read(4, "l", 1)                         = 1
read(4, "e", 1)                         = 1
read(4, ".", 1)                         = 1
read(4, "c", 1)                         = 1
read(4, "o", 1)                         = 1
read(4, "m", 1)                         = 1
read(4, "\n", 1)                        = 1
close(4)                                = 0

Given that there is a getline(3) function in POSIX.1-2008, shouldn't it
be possible to retire mygetline?

I've previously posted this issue to the musl mailing list [1], which
has an "anti-bloat side project", but I've been putting the mail to this
list off.

I don't see where any of Rich's arguments from [2] apply. Can anyone
please explain why it was done this way in the first place?

[1] http://www.openwall.com/lists/musl/2011/08/16/8
[2] http://www.openwall.com/lists/musl/2011/08/16/11

	Moritz


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [mlmmj] read(2) syscall bloat
  2011-09-05 11:56 [mlmmj] read(2) syscall bloat Moritz Wilhelmy
@ 2011-09-05 12:30 ` Lukas Fleischer
  2011-09-05 12:34 ` Ben Schmidt
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Lukas Fleischer @ 2011-09-05 12:30 UTC (permalink / raw)
  To: mlmmj

On Mon, Sep 05, 2011 at 01:56:03PM +0200, Moritz Wilhelmy wrote:
> Hello,
> 
> mlmmj currently does a read(2) system call for every single byte it
> reads from a file descriptor. This is unnecessarily inefficient and
> slow. 
> 
> Strace output is similar to the following:
> open("/var/spool/mlmmj/foo/control/listaddress", O_RDONLY) = 4
> read(4, "f", 1)                         = 1
> read(4, "o", 1)                         = 1
> read(4, "o", 1)                         = 1
> read(4, "@", 1)                         = 1
> read(4, "l", 1)                         = 1
> read(4, "i", 1)                         = 1
> read(4, "s", 1)                         = 1
> read(4, "t", 1)                         = 1
> read(4, "s", 1)                         = 1
> read(4, ".", 1)                         = 1
> read(4, "e", 1)                         = 1
> read(4, "x", 1)                         = 1
> read(4, "a", 1)                         = 1
> read(4, "m", 1)                         = 1
> read(4, "p", 1)                         = 1
> read(4, "l", 1)                         = 1
> read(4, "e", 1)                         = 1
> read(4, ".", 1)                         = 1
> read(4, "c", 1)                         = 1
> read(4, "o", 1)                         = 1
> read(4, "m", 1)                         = 1
> read(4, "\n", 1)                        = 1
> close(4)                                = 0
> 
> Given that there is a getline(3) function in POSIX.1-2008, shouldn't it
> be possible to retire mygetline?
> 
> I've previously posted this issue to the musl mailing list [1], which
> has an "anti-bloat side project", but I've been putting the mail to this
> list off.
> 
> I don't see where any of Rich's arguments from [2] apply. Can anyone
> please explain why it was done this way in the first place?

Not sure why it was done like that. Anyway, this shouldn't be too hard
to fix, given that we already use our own getline() implementation (see
"mygetline.c") and it shouldn't be too hard to add our own buffering
implementation or use some stream implementation here.

I can have a look at that around next week if no one else wants to.

> 
> [1] http://www.openwall.com/lists/musl/2011/08/16/8
> [2] http://www.openwall.com/lists/musl/2011/08/16/11
> 
> 	Moritz


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [mlmmj] read(2) syscall bloat
  2011-09-05 11:56 [mlmmj] read(2) syscall bloat Moritz Wilhelmy
  2011-09-05 12:30 ` Lukas Fleischer
@ 2011-09-05 12:34 ` Ben Schmidt
  2011-09-05 12:46 ` Lukas Fleischer
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Ben Schmidt @ 2011-09-05 12:34 UTC (permalink / raw)
  To: mlmmj

On 5/09/11 9:56 PM, Moritz Wilhelmy wrote:
> mlmmj currently does a read(2) system call for every single byte it
> reads from a file descriptor. This is unnecessarily inefficient and
> slow.

Mmm. There've gotta be a lot of context switches happening there....

> Strace output is similar to the following:
> open("/var/spool/mlmmj/foo/control/listaddress", O_RDONLY) = 4
> read(4, "f", 1)                         = 1
> read(4, "o", 1)                         = 1
> read(4, "o", 1)                         = 1
[...]
> read(4, "\n", 1)                        = 1
> close(4)                                = 0
>
> Given that there is a getline(3) function in POSIX.1-2008, shouldn't it
> be possible to retire mygetline?

Not if getline() is new as of 2008; there are a lot of systems older
than that around, and since Mlmmj is so nice and slim, it is an ideal
candidate for running on older systems. I don't want to compromise that.

> I've previously posted this issue to the musl mailing list [1], which
> has an "anti-bloat side project", but I've been putting the mail to this
> list off.
>
> I don't see where any of Rich's arguments from [2] apply.

He's just pointing out that you can't reimplement mygetline() to read in
larger chunks without some kind of buffering. This is because reading a
larger chunk might read past end-of-line. If it does, then you have to
rewind the stream (not always possible) or buffer the extra output so
that the next call to mygetline() can use it.

> Can anyone please explain why it was done this way in the first place?

Not me.

Maybe we should do some profiling to see if this truly is a bottleneck
or not.

Ben.



> [1] http://www.openwall.com/lists/musl/2011/08/16/8
> [2] http://www.openwall.com/lists/musl/2011/08/16/11


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [mlmmj] read(2) syscall bloat
  2011-09-05 11:56 [mlmmj] read(2) syscall bloat Moritz Wilhelmy
  2011-09-05 12:30 ` Lukas Fleischer
  2011-09-05 12:34 ` Ben Schmidt
@ 2011-09-05 12:46 ` Lukas Fleischer
  2011-09-05 12:57 ` Moritz Wilhelmy
  2011-09-09  8:24 ` Thomas Goirand
  4 siblings, 0 replies; 6+ messages in thread
From: Lukas Fleischer @ 2011-09-05 12:46 UTC (permalink / raw)
  To: mlmmj

On Mon, Sep 05, 2011 at 10:34:19PM +1000, Ben Schmidt wrote:
> On 5/09/11 9:56 PM, Moritz Wilhelmy wrote:
> >mlmmj currently does a read(2) system call for every single byte it
> >reads from a file descriptor. This is unnecessarily inefficient and
> >slow.
> 
> Mmm. There've gotta be a lot of context switches happening there....
> 
> >Strace output is similar to the following:
> >open("/var/spool/mlmmj/foo/control/listaddress", O_RDONLY) = 4
> >read(4, "f", 1)                         = 1
> >read(4, "o", 1)                         = 1
> >read(4, "o", 1)                         = 1
> [...]
> >read(4, "\n", 1)                        = 1
> >close(4)                                = 0
> >
> >Given that there is a getline(3) function in POSIX.1-2008, shouldn't it
> >be possible to retire mygetline?
> 
> Not if getline() is new as of 2008; there are a lot of systems older
> than that around, and since Mlmmj is so nice and slim, it is an ideal
> candidate for running on older systems. I don't want to compromise that.

Well, if you really care about that, consider using fgets() which is
part of C89, even. Or just use our own buffer implementation.

> 
> >I've previously posted this issue to the musl mailing list [1], which
> >has an "anti-bloat side project", but I've been putting the mail to this
> >list off.
> >
> >I don't see where any of Rich's arguments from [2] apply.
> 
> He's just pointing out that you can't reimplement mygetline() to read in
> larger chunks without some kind of buffering. This is because reading a
> larger chunk might read past end-of-line. If it does, then you have to
> rewind the stream (not always possible) or buffer the extra output so
> that the next call to mygetline() can use it.
> 
> >Can anyone please explain why it was done this way in the first place?
> 
> Not me.
> 
> Maybe we should do some profiling to see if this truly is a bottleneck
> or not.

Agreed, some numbers would be nice. Anyway, this shouldn't be too hard
to implement and this will imply some performance improvements for
sure...

> 
> Ben.
> 
> 
> 
> >[1] http://www.openwall.com/lists/musl/2011/08/16/8
> >[2] http://www.openwall.com/lists/musl/2011/08/16/11


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [mlmmj] read(2) syscall bloat
  2011-09-05 11:56 [mlmmj] read(2) syscall bloat Moritz Wilhelmy
                   ` (2 preceding siblings ...)
  2011-09-05 12:46 ` Lukas Fleischer
@ 2011-09-05 12:57 ` Moritz Wilhelmy
  2011-09-09  8:24 ` Thomas Goirand
  4 siblings, 0 replies; 6+ messages in thread
From: Moritz Wilhelmy @ 2011-09-05 12:57 UTC (permalink / raw)
  To: mlmmj

On Mon, Sep 05, 2011 at 22:34:19 +1000, Ben Schmidt wrote:
> On 5/09/11 9:56 PM, Moritz Wilhelmy wrote:
> >mlmmj currently does a read(2) system call for every single byte it
> >reads from a file descriptor. This is unnecessarily inefficient and
> >slow.
> 
> Mmm. There've gotta be a lot of context switches happening there....

That's the point :-)

> >Strace output is similar to the following:
> >open("/var/spool/mlmmj/foo/control/listaddress", O_RDONLY) = 4
> >read(4, "f", 1)                         = 1
> >read(4, "o", 1)                         = 1
> >read(4, "o", 1)                         = 1
> [...]
> >read(4, "\n", 1)                        = 1
> >close(4)                                = 0
> >
> >Given that there is a getline(3) function in POSIX.1-2008, shouldn't it
> >be possible to retire mygetline?
> 
> Not if getline() is new as of 2008; there are a lot of systems older
> than that around, and since Mlmmj is so nice and slim, it is an ideal
> candidate for running on older systems. I don't want to compromise that.

It has been in glibc long before and can be implemented in about 50
lines. You could detect if the libc has a getline function, and use your
own otherwise (you do have autotools after all!)

You could copy the FreeBSD implementation of getline/getdelim with small
changes, which is (obviously) BSD licensed. It doesn't look too specific
to BSD stdio. I've seen some kind of getline.c floating around in many
projects since many years, before it was finally put into the standard.

http://www.freebsd.org/cgi/cvsweb.cgi/src/lib/libc/stdio/getline.c?rev=1.1.2.1.6.1;content-type=text%2Fplain
http://www.freebsd.org/cgi/cvsweb.cgi/src/lib/libc/stdio/getdelim.c?rev=1.2.2.2.4.1;content-type=text%2Fplain

It would require switching to FILE*s though, but I see very little
reason not to do just that for local files.

> He's just pointing out that you can't reimplement mygetline() to read in
> larger chunks without some kind of buffering. This is because reading a
> larger chunk might read past end-of-line. If it does, then you have to
> rewind the stream (not always possible) or buffer the extra output so
> that the next call to mygetline() can use it.

Alright, that's actually obvious.

	Moritz


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [mlmmj] read(2) syscall bloat
  2011-09-05 11:56 [mlmmj] read(2) syscall bloat Moritz Wilhelmy
                   ` (3 preceding siblings ...)
  2011-09-05 12:57 ` Moritz Wilhelmy
@ 2011-09-09  8:24 ` Thomas Goirand
  4 siblings, 0 replies; 6+ messages in thread
From: Thomas Goirand @ 2011-09-09  8:24 UTC (permalink / raw)
  To: mlmmj

On 09/05/2011 08:34 PM, Ben Schmidt wrote:
> Maybe we should do some profiling to see if this truly is a bottleneck
> or not.
> 
> Ben.

Without looking, my bet is that it's NOT a bottleneck. Implementing the
buffering yourself is the kind of thing you do *not* want to do, because
it's error prone and can lead very easily to buffer overflow issue. At
least, take the implementation from something that already exists
(ideally, use a known shared lib). Working on this kind of "by hand"
optimization is, IMHO a waste of time, considering all what happens when
receiving a mail.

As a consequence, I vote for using getline() if it's available, and keep
the old (slower) implementation if it's not.

Thomas


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2011-09-09  8:24 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-09-05 11:56 [mlmmj] read(2) syscall bloat Moritz Wilhelmy
2011-09-05 12:30 ` Lukas Fleischer
2011-09-05 12:34 ` Ben Schmidt
2011-09-05 12:46 ` Lukas Fleischer
2011-09-05 12:57 ` Moritz Wilhelmy
2011-09-09  8:24 ` Thomas Goirand

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.