Re: Never ending stream of bitbake exceptions when the builder runs out of disk space

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Richard Purdie <richard.purdie@linuxfoundation.org>
To: Patrick Ohly <patrick.ohly@intel.com>,
	Martin Jansa <martin.jansa@gmail.com>
Cc: Patches and discussions about the oe-core layer
	<openembedded-core@lists.openembedded.org>
Subject: Re: Never ending stream of bitbake exceptions when the builder runs out of disk space
Date: Tue, 27 Jun 2017 10:41:26 +0100	[thread overview]
Message-ID: <1498556486.3124.8.camel@linuxfoundation.org> (raw)
In-Reply-To: <1498550932.7464.29.camel@intel.com>

On Tue, 2017-06-27 at 10:08 +0200, Patrick Ohly wrote:
> On Thu, 2017-06-15 at 08:48 +0200, Martin Jansa wrote:
> > 
> > This issue exists for very long time.
> > 
> > 
> > I know that when the builder runs out of disk space there are
> > multiple
> > things which might go wrong (I've seen bad archives on premirrors,
> > bad
> > sstate archives caused by this), so this issue isn't the main
> > problem,
> > but still would be nice to fail faster.
> > 
> > 
> > In last build which was running for some 9 hours, it was first
> > building for maybe 2 hours before it run out of disk space and this
> > morning there is 50MB log just from bitbake output stored on the
> > jenkins master. Repeating following message very quickly
> > 
> > 
> > # grep -c "Errno 28" consoleText.txt 
> > 42986
> > 
> > 
> > ERROR: Running command [['world'], 'build']
> > Traceback (most recent call last):
> >   File "/home/jenkins/oe/world/shr-core/bitbake/lib/bb/event.py",
> > line
> > 211, in fire(event=<bb.event.HeartbeatEvent object at
> > 0x7fcfed3e96a0>,
> > d=<bb.data_smart.DataSmart object at 0x7fd00330b198>):
> >      
> >     >    fire_class_handlers(event, d)
> >          if worker_fire:
> >   File "/home/jenkins/oe/world/shr-core/bitbake/lib/bb/event.py",
> > line
> > 134, in fire_class_handlers(event=<bb.event.HeartbeatEvent object
> > at
> > 0x7fcfed3e96a0>, d=<bb.data_smart.DataSmart object at
> > 0x7fd00330b198>):
> >                          continue
> >     >            execute_handler(name, handler, event, d)
> >      
> >   File "/home/jenkins/oe/world/shr-core/bitbake/lib/bb/event.py",
> > line
> > 106, in execute_handler(name='runqueue_stats', handler=<function
> > runqueue_stats at 0x7fd0020c6158>, event=<bb.event.HeartbeatEvent
> > object at 0x7fcfed3e96a0>, d=<bb.data_smart.DataSmart object at
> > 0x7fd00330b198>):
> >          try:
> >     >        ret = handler(event)
> >          except (bb.parse.SkipRecipe, bb.BBHandledException):
> >   File
> > "/home/jenkins/oe/world/shr-core/openembedded-
> > core/meta/classes/buildstats.bbclass", line 212, in
> > runqueue_stats(e=<bb.event.HeartbeatEvent object at
> > 0x7fcfed3e96a0>):
> >              done = isinstance(e, bb.event.BuildCompleted)
> >     >        system_stats.sample(e, force=done)
> >              if done:
> >   File
> > "/home/jenkins/oe/world/shr-core/openembedded-
> > core/meta/lib/buildstats.py", line 148, in
> > SystemStats.sample(event=<bb.event.HeartbeatEvent object at
> > 0x7fcfed3e96a0>, force=False):
> >                                       data +
> >     >                                 b'\n')
> >                  self.last_proc = now
> > OSError: [Errno 28] No space left on device
> > 
> > 
> > It would be better to exit completely when something as bad as
> > Errno
> > 28 happens.
> Do you have BB_DISKMON_DIRS active? Probably yes.
> 
> The reason why it did not trigger here might be that the build ran
> out
> of disk space so quickly that the disk monitoring had no chance to
> detect the problem before system stat sampling itself started failing
> with the error above.
> 
> System stat sampling and disk monitoring are hooking into the same
> event, so my theory is that once the system stat sampling fails, disk
> monitoring code no longer runs.
> 
> I'm not sure what exactly the right fix is: detect uncaught OSError
> like
> 28 in the bitbake event loop and abort the build, and/or catch the
> error
> in buildstats.py and ignore it so that the normal disk monitoring can
> happen?
> 
> I know how to do the latter, but not the former.


Incidentally, looking at this trace, I think bitbake should drop the
event handler triggering exceptions in a case like this, try and avoid
looping quite so badly. We should probably have a bug for that.

Cheers,

Richard

     prev parent reply	other threads:[~2017-06-27  9:41 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-06-15  6:48 Never ending stream of bitbake exceptions when the builder runs out of disk space Martin Jansa
2017-06-27  8:08 ` Patrick Ohly
2017-06-27  8:12   ` Martin Jansa
2017-06-27  8:25     ` Richard Purdie
2017-06-27  9:21       ` Patrick Ohly
2017-06-27  9:37         ` Richard Purdie
2017-06-27 13:00           ` Martin Jansa
2017-06-27  9:41   ` Richard Purdie [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1498556486.3124.8.camel@linuxfoundation.org \
    --to=richard.purdie@linuxfoundation.org \
    --cc=martin.jansa@gmail.com \
    --cc=openembedded-core@lists.openembedded.org \
    --cc=patrick.ohly@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.