* Re: [boot-time]
From: Marko Hoyer @ 2025-01-12 10:11 UTC (permalink / raw)
To: Rob Landley, Bird, Tim, Marko Hoyer, Shankari,
linux-embedded@vger.kernel.org
In-Reply-To: <0947e977-879b-495f-8225-534e38bc034b@landley.net>
Am 12.01.25 um 02:03 schrieb Rob Landley:
> On 1/11/25 12:57, Bird, Tim wrote:
>> Hey Rob, This is a great review of /dev, /sys and the different
>> ways that /dev gets populated.
>
> Feel free to link stuff from wikis or some such. The newest of those
> documents was written in 2007.
>
>> For a lot of embedded Linux devices, the only bus where
>> new items can show up dynamically is USB.
SDCARD readers connected via MMC are common in automtove head units as
well ...
>
> Yup, /sys/bus/usb/devices is in there too and when a driver binds to
> them, they wind up in /sys/block and such as well. (you USED to have
> to seprately mount a usbfs under /sys but they finally acknowledged
> that was silly about 5 years ago, hence
> https://askubuntu.com/questions/1218321/if-usbfs-has-been-deprecated-then-why-is-sys-bus-usb-drivers-usbfs-directory-p)
>
> When a driver DOESN'T automatically bind to them it gets a bit
> complicated, and one of the things mdev can be configured to do is act
> as a firmware loader! Which is just... Ahem, there are YEARS of poor
> design decisions the kernel guys made, where they ignored a mechanism
> they already had an implemented something more complicated. The
> mechanism whereby the kernel opens a firmware file and read it
> directly out of the filesystem instead of calling a hotplug helper
> was... I'm just going to gloss over that.
WIFI & Bluetooth devices often use this firmware mechanism. And yes I
agree, it looks a bit ** ugly** seeing the kernel loading a firmware
file from /lib/firmware searching it in the root file system w/o
knowing the state of it during boot ... For WIFI and bluetooth I do not
see a big issue here since I'd prevent putting such features on a
critical chain by system design in any way since bringing them up and
(re)connecting external devices is time consuming by nature. Nothing you
shall need to wait for ...
> Anyway, I saw patches go by ala
> https://patchwork.ozlabs.org/project/buildroot/patch/1436188175-7912-1-git-send-email-luca@lucaceresoli.net/
> which says it's from 2015. I haven't really tried to do it because I
> often know how the plumbing works and usually just implement a ten
> line hack rather than looking up how to configure the more generic
> tool. (Even the generic tool grew out of something I wrote 10 years
> earlier...)
>
> Anyway, the kernel's "request module" plumbing tends to do a "give me
> usb-vendorID-deviceID thing, and then there's alias plumbing lookup
> that figures out what module name to insmod for that, and at various
> points I've seen said alias lookup plumbing A) in the kernel, B) in
> module headers, C) in modprobe config files under /etc or /lib or
> something.
>
> *shrug* I build static kernels when given a choice, and don't source
> hardware that needs drivers WITHOUT built-in firmware, so I am WAY out
> of date on that stuff.
Compiling in modules vs. loading them later from user space is a
trade-off. The effect of putting stuff into modules is to keep the
kernel small which helps you in the "unpacking & loading kernel" phase
before the kernel is actually started. Having an 1MB unpacked kernel is
significantly a difference to a 5MB one. On the other hand, my
experience is that there is lot of overhead (CPU time and IO) loading
modules from user space. So it really only makes sense, if you have
drivers to load at a point in time during startup where you have enough
time and resources left.
> I remember enough to look it up but not the details off the top of my
> head. And I say that as someone who really SHOULD care more:
>
> http://lists.landley.net/pipermail/toybox-landley.net/2024-October/030549.html
>
>
> It's on the todo list...
>
>> I've always thought the best solution (in terms of boot time)
>> was to use static nodes in /dev during early boot (that is, just
>> mknod the /dev nodes in the rootfs manually, and have them be present
>> before the kernel even runs). No dynamic discovery or boot-time
>> population of /dev needed.
>
> The system knows what devices are available. While you can mknod a
> major:minor node the kernel doesn't have a driver for, if you open it
> you get some sort of -EWTF where the kernel goes "nope".
>
> The kernel has a CONFIG_DEVTMPFS_MOUNT that automatically mounts
> devtmpfs on /dev, but for SOME reason it doesn't apply to initramfs.
> I've been irregularly posting a patch to MAKE it apply on and off for
> most of a decade now:
>
> https://lkml.iu.edu/hypermail/linux/kernel/2005.1/09399.html
>
> And it's part of my mkroot kernel patches:
>
> https://landley.net/bin/mkroot/latest/linux-patches/0003-Wire-up-CONFIG_DEVTMPFS_MOUNT-to-initramfs.patch
>
>
> But Greg KH. Oh well...
>
>> Then, sometime later, use either
>> mdev or devtmpfs to accumulate (and remove?) other
>> runtime-plugged devices. This can be done after the time-critical
>> phase of booting.
>>
>> Does this overall approach work, or is there some in-kernel
>> connections that may be missing if the dynamic tools are
>> not used from startup?
>
> I mean it more or less works, it's just... pointless manual
> maintenance of something the kernel does for you in a very small
> amount of code? (In devtmpfs, the /dev node being there means
> something. In a static /dev, it doesn't.)
I agree. There is kind of dynamic device enumeration done by the kernel
drivers anyway once loaded. Any data structures to devices are build up
internally. Nothing you can save ...
I'm even not sure how devtmpfs can be combined w/ your static devnodes
you created in any kind of persistent partition. And if you even can get
the kernel accepting your partition to use as /dev, you need to have it
writeable for the case of dynamics you might need (usb for instance)
which does not really go well with a read only RFS ... You could ...
overlay fs ... well no, I think this goes into a wrong direction -> too
complicated ;)
>
> So: I blather on a lot about my mkroot project, which is a 400 line
> bash script that builds tiny linux system that boots to shell prompt
> (mostly under qemu) on a dozen different architectures. And the init
> script in that starts here:
>
> https://github.com/landley/toybox/blob/master/mkroot/mkroot.sh#L102
>
> And you see how the script does this setup (which is only needed when
> you don't apply my patch to the kernel):
>
> if ! mountpoint -q dev; then
> mount -t devtmpfs dev dev
> [ $$ -eq 1 ] && ! 2>/dev/null <0 && exec 0<>/dev/console 1>&0 2>&1
> for i in ,fd /0,stdin /1,stdout /2,stderr
> do ln -sf /proc/self/fd${i/,*/} dev/${i/*,/}; done
> mkdir -p dev/shm
> chmod +t /dev/shm
> fi
>
> (Don't ask me why devtmfs doesn't automatically have a "shm" directory
> with the sticky bit, it's a subclass of tmpfs so it does the right
> thing when it's there. And the middle two lines are just making
> /dev/{stdin,stdout,stderr} because toysh doesn't special case
> /dev/stdin like bash does so needs it in the filesystem if you're
> gonna use that.)
>
> But that exec redirect line hit a bug, because when you don't have
> devtmpfs automounting but ALSO don't have /dev/console in your
> statically linked initramfs image... oh here, I explained it when I
> fixed it:
>
> https://github.com/landley/toybox/commit/0b2d5c2bb3f1
>
> https://landley.net/notes-2024.html#06-08-2024
>
> tl;dr in usr/main.c the kernel tries to open /dev/console and fails if
> it's not already there, so PID 1 starts with stdin, stdout, and stderr
> closed, so if something goes wrong in early boot init can't tell you
> why it failed. This ONLY happens with static initramfs (created with
> cpio as a normal user and you can't mknod without root access),
> because of COURSE the kernel has two different codepaths for static vs
> dynamic, and the dynamic one does a manual fixup for this issue. No
> really!
>
> https://github.com/torvalds/linux/blob/master/init/noinitramfs.c#L18
>
> And of course the code doing the fixup runs for ANY system that
> doesn't have a static linked initramfs, including one where the
> bootloader points it at an EXTERNAL cpio.gz image like qemu -initrd
> blah.cpio.gz, because it checks for the external one AFTER doing the
> fixup. (Pop quiz: if you have static _and_ external initrd, and they
> include the same file, does the old one prevent the new one from
> extracting, does the new one replace the old one, or does the new one
> APPEND to the old one? At various points in history, it's done ALL
> THREE! I forget which is current, replace I think?) So external
> initramfs and built-in initramfs behave DIFFERENTLY in a subtle sharp
> edge ath I've complained at them about for YEARS...
>
> *shrug* I make this stuff work in as simple a way as I know how. Been
> doing it for an embarrassingly long time now. But you _reach_ simple
> by process of elimination, and proving a negative is a lot of work.
>
>> Thanks,
>> -- Tim
>>
>
> Rob
>
To summarize from my point of view:
* It's worth talking a bit about the effect of udev and about alternatives
* "mdev" is surely worth being named as an potential option besides
"selective triggering" and "static setup and moving triggers back in time"
* I wouldn't regard mknode as an real alternative in todays system
* In addition I can imagine is "modules loading" vs. "compiling in
drivers" something which is worth mentioning
* Once I've access to the wiki, I can try to put these ideas into an
initial structure filled up w/ info we discussed in this thread
Marko
^ permalink raw reply
* Re: [boot-time]
From: Rob Landley @ 2025-01-12 1:03 UTC (permalink / raw)
To: Bird, Tim, Marko Hoyer, Shankari, linux-embedded@vger.kernel.org
In-Reply-To: <MW5PR13MB56325452C6F217FB9C927653FD1D2@MW5PR13MB5632.namprd13.prod.outlook.com>
On 1/11/25 12:57, Bird, Tim wrote:
> Hey Rob, This is a great review of /dev, /sys and the different
> ways that /dev gets populated.
Feel free to link stuff from wikis or some such. The newest of those
documents was written in 2007.
> For a lot of embedded Linux devices, the only bus where
> new items can show up dynamically is USB.
Yup, /sys/bus/usb/devices is in there too and when a driver binds to
them, they wind up in /sys/block and such as well. (you USED to have to
seprately mount a usbfs under /sys but they finally acknowledged that
was silly about 5 years ago, hence
https://askubuntu.com/questions/1218321/if-usbfs-has-been-deprecated-then-why-is-sys-bus-usb-drivers-usbfs-directory-p)
When a driver DOESN'T automatically bind to them it gets a bit
complicated, and one of the things mdev can be configured to do is act
as a firmware loader! Which is just... Ahem, there are YEARS of poor
design decisions the kernel guys made, where they ignored a mechanism
they already had an implemented something more complicated. The
mechanism whereby the kernel opens a firmware file and read it directly
out of the filesystem instead of calling a hotplug helper was... I'm
just going to gloss over that.
Anyway, I saw patches go by ala
https://patchwork.ozlabs.org/project/buildroot/patch/1436188175-7912-1-git-send-email-luca@lucaceresoli.net/
which says it's from 2015. I haven't really tried to do it because I
often know how the plumbing works and usually just implement a ten line
hack rather than looking up how to configure the more generic tool.
(Even the generic tool grew out of something I wrote 10 years earlier...)
Anyway, the kernel's "request module" plumbing tends to do a "give me
usb-vendorID-deviceID thing, and then there's alias plumbing lookup that
figures out what module name to insmod for that, and at various points
I've seen said alias lookup plumbing A) in the kernel, B) in module
headers, C) in modprobe config files under /etc or /lib or something.
*shrug* I build static kernels when given a choice, and don't source
hardware that needs drivers WITHOUT built-in firmware, so I am WAY out
of date on that stuff. I remember enough to look it up but not the
details off the top of my head. And I say that as someone who really
SHOULD care more:
http://lists.landley.net/pipermail/toybox-landley.net/2024-October/030549.html
It's on the todo list...
> I've always thought the best solution (in terms of boot time)
> was to use static nodes in /dev during early boot (that is, just
> mknod the /dev nodes in the rootfs manually, and have them be present
> before the kernel even runs). No dynamic discovery or boot-time
> population of /dev needed.
The system knows what devices are available. While you can mknod a
major:minor node the kernel doesn't have a driver for, if you open it
you get some sort of -EWTF where the kernel goes "nope".
The kernel has a CONFIG_DEVTMPFS_MOUNT that automatically mounts
devtmpfs on /dev, but for SOME reason it doesn't apply to initramfs.
I've been irregularly posting a patch to MAKE it apply on and off for
most of a decade now:
https://lkml.iu.edu/hypermail/linux/kernel/2005.1/09399.html
And it's part of my mkroot kernel patches:
https://landley.net/bin/mkroot/latest/linux-patches/0003-Wire-up-CONFIG_DEVTMPFS_MOUNT-to-initramfs.patch
But Greg KH. Oh well...
> Then, sometime later, use either
> mdev or devtmpfs to accumulate (and remove?) other
> runtime-plugged devices. This can be done after the time-critical
> phase of booting.
>
> Does this overall approach work, or is there some in-kernel
> connections that may be missing if the dynamic tools are
> not used from startup?
I mean it more or less works, it's just... pointless manual maintenance
of something the kernel does for you in a very small amount of code? (In
devtmpfs, the /dev node being there means something. In a static /dev,
it doesn't.)
So: I blather on a lot about my mkroot project, which is a 400 line bash
script that builds tiny linux system that boots to shell prompt (mostly
under qemu) on a dozen different architectures. And the init script in
that starts here:
https://github.com/landley/toybox/blob/master/mkroot/mkroot.sh#L102
And you see how the script does this setup (which is only needed when
you don't apply my patch to the kernel):
if ! mountpoint -q dev; then
mount -t devtmpfs dev dev
[ $$ -eq 1 ] && ! 2>/dev/null <0 && exec 0<>/dev/console 1>&0 2>&1
for i in ,fd /0,stdin /1,stdout /2,stderr
do ln -sf /proc/self/fd${i/,*/} dev/${i/*,/}; done
mkdir -p dev/shm
chmod +t /dev/shm
fi
(Don't ask me why devtmfs doesn't automatically have a "shm" directory
with the sticky bit, it's a subclass of tmpfs so it does the right thing
when it's there. And the middle two lines are just making
/dev/{stdin,stdout,stderr} because toysh doesn't special case /dev/stdin
like bash does so needs it in the filesystem if you're gonna use that.)
But that exec redirect line hit a bug, because when you don't have
devtmpfs automounting but ALSO don't have /dev/console in your
statically linked initramfs image... oh here, I explained it when I
fixed it:
https://github.com/landley/toybox/commit/0b2d5c2bb3f1
https://landley.net/notes-2024.html#06-08-2024
tl;dr in usr/main.c the kernel tries to open /dev/console and fails if
it's not already there, so PID 1 starts with stdin, stdout, and stderr
closed, so if something goes wrong in early boot init can't tell you why
it failed. This ONLY happens with static initramfs (created with cpio as
a normal user and you can't mknod without root access), because of
COURSE the kernel has two different codepaths for static vs dynamic, and
the dynamic one does a manual fixup for this issue. No really!
https://github.com/torvalds/linux/blob/master/init/noinitramfs.c#L18
And of course the code doing the fixup runs for ANY system that doesn't
have a static linked initramfs, including one where the bootloader
points it at an EXTERNAL cpio.gz image like qemu -initrd blah.cpio.gz,
because it checks for the external one AFTER doing the fixup. (Pop quiz:
if you have static _and_ external initrd, and they include the same
file, does the old one prevent the new one from extracting, does the new
one replace the old one, or does the new one APPEND to the old one? At
various points in history, it's done ALL THREE! I forget which is
current, replace I think?) So external initramfs and built-in initramfs
behave DIFFERENTLY in a subtle sharp edge ath I've complained at them
about for YEARS...
*shrug* I make this stuff work in as simple a way as I know how. Been
doing it for an embarrassingly long time now. But you _reach_ simple by
process of elimination, and proving a negative is a lot of work.
> Thanks,
> -- Tim
>
Rob
^ permalink raw reply
* RE: [boot-time]
From: Bird, Tim @ 2025-01-11 18:57 UTC (permalink / raw)
To: Rob Landley, Marko Hoyer, Shankari,
linux-embedded@vger.kernel.org
In-Reply-To: <98ae21b7-43bb-4a72-bd50-0b13f9700734@landley.net>
> -----Original Message-----
> From: Rob Landley <rob@landley.net>
>
> On 1/11/25 02: 40, Marko Hoyer wrote: > > Am 11. 01. 25 um 00: 15 schrieb Rob Landley: >> On 1/10/25 16: 46, Marko Hoyer wrote: >>>
> So I think it is worth talking a bit about udev and options to deal >>> with it but
> On 1/11/25 02:40, Marko Hoyer wrote:
> >
> > Am 11.01.25 um 00:15 schrieb Rob Landley:
> >> On 1/10/25 16:46, Marko Hoyer wrote:
> >>> So I think it is worth talking a bit about udev and options to deal
> >>> with it but adapting thinks a bit to todays world. I'm currently
> >>> registering for the wiki, maybe I can setup an initial page at some
> >>> time ...
> >>
> >> busybox mdev is a lot lighter weight, and can do pretty elaborate things.
> >>
> >> https://wiki.alpinelinux.org/wiki/Mdev
> >>
> >> https://github.com/fff7d1bc/mdev-like-a-boss
> >>
> >> The theory these days is you mount devtmpfs and then use mdev to add
> >> scriptable behavior to device insertion/removal events via netlink
> >> notifications.
> >>
> >> Rob
> >
> > Hey Rob,
> >
> > thx for the hint. Sounds good!
> >
> > How is the enumeration of cold plugged devices realized in mdev? Is it
> > similar to udev triggering all devices in the complete device tree?
>
> mdev -s will scan /sys for current devices, it can be run as a hotplug
> helper, and there's netlink support somewhere but I've never used it.
>
> https://busybox.net/downloads/BusyBox.html#mdev
>
> The hotplug helper doesn't require a persistent demon like netlink does
> (or udev), at boot time you "mdev -s" to scan, then when you register a
> hotplug helper the kernel spawns a new process with environment
> variables whenever there's something new to do. The downside is if a lot
> of events come in rapidly it can spawn a lot of processes in parallel
> which makes sequencing difficult, which is why the netlink API exists as
> an alternative, but that doesn't really happen in systems I've put
> together, so...
>
> I wrote some introductory documentation about this back in 2007. It's a
> bit stale, and never REALLY got finished, but...
>
> https://landley.net/kdocs/local/hotplug2.html
>
> That's the context within which sysfs happened.
>
> /dev and /sys serve different purposes: /dev shows the device drivers'
> view of the system, full of devices that don't actually exist like
> /dev/null, or five devices for one piece of hardware (partitions),
> meanwhile a device that shows up but doesn't have a driver bound to it
> yet won't be in /dev at all. This is half the reason the old
> demon-managed "devfs" failed, it was CONCEPTUALLY wrong. (The other half
> was it used crazy solaris names for everything so people looking for
> /dev/hda1 couldn't find it and had to deal with some 9 character long
> monstrosity instead. Plus Linux isn't a microkernel so expecting a
> userspace demon to be necessary for /dev to _exist_ was just silly, and
> also led to some problems booting the system because were does that
> demon get its information from, eh?)
>
> sysfs is a hardware view of the system, where /sys/devices/pci0000:00 is
> full of what bus probing found, and /sys/block/sda/dev contains "8:0"
> (major and minor number) when a driver binds to something and goes
> "mine", but something still had to mknod that.
>
> devtmpfs is a synthetic filesystem that just DOES that, when a new "dev"
> node shows up under /sys/class or /sys/block it creates the apropriate
> char or block device under dev with that major/minor and the same name
> the directory uses (which is provided by the driver).
>
> Oh, "synthetic" filesystem is one of the four times of filesystem: block
> backed, char/pipe backed, ram backed, and synthetic. I wrote
> documentation about that a very long time ago...
>
> https://landley.net/toybox/doc/mount.html
Hey Rob, This is a great review of /dev, /sys and the different
ways that /dev gets populated.
For a lot of embedded Linux devices, the only bus where
new items can show up dynamically is USB.
I've always thought the best solution (in terms of boot time)
was to use static nodes in /dev during early boot (that is, just
mknod the /dev nodes in the rootfs manually, and have them be present
before the kernel even runs). No dynamic discovery or boot-time
population of /dev needed. Then, sometime later, use either
mdev or devtmpfs to accumulate (and remove?) other
runtime-plugged devices. This can be done after the time-critical
phase of booting.
Does this overall approach work, or is there some in-kernel
connections that may be missing if the dynamic tools are
not used from startup?
Thanks,
-- Tim
^ permalink raw reply
* Re: [boot-time]
From: Rob Landley @ 2025-01-11 17:56 UTC (permalink / raw)
To: Marko Hoyer, Bird, Tim, Shankari, linux-embedded@vger.kernel.org
In-Reply-To: <1b6e380d-000d-420a-b198-ecb68c356d34@freenet.de>
On 1/11/25 02:40, Marko Hoyer wrote:
>
> Am 11.01.25 um 00:15 schrieb Rob Landley:
>> On 1/10/25 16:46, Marko Hoyer wrote:
>>> So I think it is worth talking a bit about udev and options to deal
>>> with it but adapting thinks a bit to todays world. I'm currently
>>> registering for the wiki, maybe I can setup an initial page at some
>>> time ...
>>
>> busybox mdev is a lot lighter weight, and can do pretty elaborate things.
>>
>> https://wiki.alpinelinux.org/wiki/Mdev
>>
>> https://github.com/fff7d1bc/mdev-like-a-boss
>>
>> The theory these days is you mount devtmpfs and then use mdev to add
>> scriptable behavior to device insertion/removal events via netlink
>> notifications.
>>
>> Rob
>
> Hey Rob,
>
> thx for the hint. Sounds good!
>
> How is the enumeration of cold plugged devices realized in mdev? Is it
> similar to udev triggering all devices in the complete device tree?
mdev -s will scan /sys for current devices, it can be run as a hotplug
helper, and there's netlink support somewhere but I've never used it.
https://busybox.net/downloads/BusyBox.html#mdev
The hotplug helper doesn't require a persistent demon like netlink does
(or udev), at boot time you "mdev -s" to scan, then when you register a
hotplug helper the kernel spawns a new process with environment
variables whenever there's something new to do. The downside is if a lot
of events come in rapidly it can spawn a lot of processes in parallel
which makes sequencing difficult, which is why the netlink API exists as
an alternative, but that doesn't really happen in systems I've put
together, so...
I wrote some introductory documentation about this back in 2007. It's a
bit stale, and never REALLY got finished, but...
https://landley.net/kdocs/local/hotplug2.html
That's the context within which sysfs happened.
/dev and /sys serve different purposes: /dev shows the device drivers'
view of the system, full of devices that don't actually exist like
/dev/null, or five devices for one piece of hardware (partitions),
meanwhile a device that shows up but doesn't have a driver bound to it
yet won't be in /dev at all. This is half the reason the old
demon-managed "devfs" failed, it was CONCEPTUALLY wrong. (The other half
was it used crazy solaris names for everything so people looking for
/dev/hda1 couldn't find it and had to deal with some 9 character long
monstrosity instead. Plus Linux isn't a microkernel so expecting a
userspace demon to be necessary for /dev to _exist_ was just silly, and
also led to some problems booting the system because were does that
demon get its information from, eh?)
sysfs is a hardware view of the system, where /sys/devices/pci0000:00 is
full of what bus probing found, and /sys/block/sda/dev contains "8:0"
(major and minor number) when a driver binds to something and goes
"mine", but something still had to mknod that.
devtmpfs is a synthetic filesystem that just DOES that, when a new "dev"
node shows up under /sys/class or /sys/block it creates the apropriate
char or block device under dev with that major/minor and the same name
the directory uses (which is provided by the driver).
Oh, "synthetic" filesystem is one of the four times of filesystem: block
backed, char/pipe backed, ram backed, and synthetic. I wrote
documentation about that a very long time ago...
https://landley.net/toybox/doc/mount.html
Rob
^ permalink raw reply
* Re: [boot-time]
From: Marko Hoyer @ 2025-01-11 8:40 UTC (permalink / raw)
To: Rob Landley, Marko Hoyer, Bird, Tim, Shankari,
linux-embedded@vger.kernel.org
In-Reply-To: <c70841da-3fe6-4b58-a24f-579dc3cc5dbc@landley.net>
Am 11.01.25 um 00:15 schrieb Rob Landley:
> On 1/10/25 16:46, Marko Hoyer wrote:
>> So I think it is worth talking a bit about udev and options to deal
>> with it but adapting thinks a bit to todays world. I'm currently
>> registering for the wiki, maybe I can setup an initial page at some
>> time ...
>
> busybox mdev is a lot lighter weight, and can do pretty elaborate things.
>
> https://wiki.alpinelinux.org/wiki/Mdev
>
> https://github.com/fff7d1bc/mdev-like-a-boss
>
> The theory these days is you mount devtmpfs and then use mdev to add
> scriptable behavior to device insertion/removal events via netlink
> notifications.
>
> Rob
Hey Rob,
thx for the hint. Sounds good!
How is the enumeration of cold plugged devices realized in mdev? Is it
similar to udev triggering all devices in the complete device tree?
Marko
^ permalink raw reply
* Re: [boot-time]
From: Rob Landley @ 2025-01-10 23:15 UTC (permalink / raw)
To: Marko Hoyer, Bird, Tim, Shankari, linux-embedded@vger.kernel.org
In-Reply-To: <1c232997-79cf-4153-9b1c-6fcbc42121a9@freenet.de>
On 1/10/25 16:46, Marko Hoyer wrote:
> So I think it is worth talking a bit about udev and options to deal with
> it but adapting thinks a bit to todays world. I'm currently registering
> for the wiki, maybe I can setup an initial page at some time ...
busybox mdev is a lot lighter weight, and can do pretty elaborate things.
https://wiki.alpinelinux.org/wiki/Mdev
https://github.com/fff7d1bc/mdev-like-a-boss
The theory these days is you mount devtmpfs and then use mdev to add
scriptable behavior to device insertion/removal events via netlink
notifications.
Rob
^ permalink raw reply
* Re: [boot-time]
From: Marko Hoyer @ 2025-01-10 22:46 UTC (permalink / raw)
To: Bird, Tim, Shankari, linux-embedded@vger.kernel.org
In-Reply-To: <MW5PR13MB5632841906910ED152E7AE3CFD122@MW5PR13MB5632.namprd13.prod.outlook.com>
Hello Tim, all,
Am 08.01.25 um 19:33 schrieb Bird, Tim:
>> -----Original Message-----
>> From: Shankari <beingcap11@gmail.com>
>> Hi
>>
>> I wanted to provide an update on my recent contributions to the boot-time reduction project. I have recently started contributing and
>> am working with the beagleplay. I have been analyzing the boot time of the init process. Below is the output from the system log:
>>
>> debian@BeaglePlay:~$ dmesg | grep "init process"
>> [ 1.480490] Run /init as init process
>>
>> Moving forward, I plan to explore ways to modify the command line and further investigate the data used for SIG analysis. This will
>> help me gain a deeper understanding of the boot process and its performance characteristics.
>>
>> Please let me know if you have any suggestions or areas where I could focus my efforts.
> Hi Shankari,
>
> It sounds like you are off to a good start. I have something that needs to be done, that I think
> you can help with, and that matches where I believe you are in your status with being able
> to evaluate the kernel.
>
> In general, there's a lot of information on the elinux wiki which is stale, which needs to be
> updated or archived, or maybe even just removed.
>
> This section of the Boot Time page has a lot of material in this category:
> https://elinux.org/Boot_Time#kernel_speedups
>
> Can you validate the information on these 2 pages:
> * https://elinux.org/Disable_Console
> * https://elinux.org/Preset_LPJ
>
> This would consist of reading through the material, and testing the
> described techniques on your machine. This will involve booting the
> machine 2 ways, with a particular kernel command line option and without
> it, and then reporting back the final boot time for both. You can use
> the timestamp for the "init process" string as your final boot time, for the
> purposes of this exercise.
>
> Helping me to update the elinux wiki material on boot time would be
> an immense help, and is one of my main goals for the boot time SIG in 2025.
>
> Don't hesitate to ask questions if you have any.
>
> BTW - you can just report your findings to me and linux-embedded list, but
> alternatively (and even better) would be if you could also update the wiki
> pages themselves with your information based on recent kernels and hardware.
> To do this, you will need an elinux wiki account, which you can make online on
> elinux wiki.org by going to this page: https://elinux.org/Special:CreateAccount
>
> Anyone else reading this who wants to also participate in this project to
> update the elinux wiki boot time information, please contact me.
I'm just going through the "userspace application speed-up" parts in the
wiki.
To the udev stuff ("Avoid udev, it takes ..." and "If you still like udev"):
* both hints seem to be a bit old fashioned to me taking the kernels
devtmpfs into account
* In my experience, device nodes are created by the kernel today, I
never needed to use any kind of mknod call anymore
* udevs job today is mostly about
- setup of access rights and user and group ids of device nodes
- creating symlinks (e.g. for partitions named w/ their UUIDs, ...)
- loading kernel modules
- annotating metadata to uevents for interested userspace applications
* In any way, udev is still very expensive when used the conventional way
- Once udev is started, a trigger is send (by
systemd-udev-trigger.service or udevadm trigger, for those who don't
like systemd ;)) through the whole device tree to let all devices sent
uevents again so that udev can work on each device. This causes massive
CPU load for e few seconds you don't want to spend so early in the morning.
* There are options like selective triggering or moving the trigger back
in time and do the setup manuelly ...
So I think it is worth talking a bit about udev and options to deal with
it but adapting thinks a bit to todays world. I'm currently registering
for the wiki, maybe I can setup an initial page at some time ...
Btw: If I'm wrong with my experiene about devtmpf and mknode, please
correct me. And feel free to add other ideas to cope w/ udev in context
of startup time critical systems ...
Regards,
Marko
^ permalink raw reply
* Re: [boot-time]
From: Rob Landley @ 2025-01-09 22:31 UTC (permalink / raw)
To: Marko Hoyer, Bird, Tim, Shankari, linux-embedded@vger.kernel.org
In-Reply-To: <e772b530-234e-4b5c-94b4-085cb03689bd@freenet.de>
On 1/9/25 15:35, Marko Hoyer wrote:
> Am 09.01.25 um 22:10 schrieb Rob Landley:
>> Buffering or not in the char device is a driver choice. If your serial
>> hardware has a small FIFO buffer and the driver doesn't do its own
>> layer of output buffering (with a tasklet or something to copy the
>> data to the hardware), then the write() syscall will block waiting for
>> the data to go out. (Writes to filesystems stopped doing this back
>> around 2.0 or something, when they rewrote the vfs to be based on the
>> page cache and deentry cache, meaning ALL filesystem writes go through
>> that now unless you say O_DIRECT to _ask_ for it to block, which isn't
>> even always honored. But for some reason the TTY layer drives people
>> insane, and char devices have been given a wide berth...)
>
> Yeah looks like this is the case for RPi Zero W. I guess there is
> probably no buffer at all in the RPi serial driver / hw since every log
> line from systemd delays systemd for ~10ms (~80ms in baud9600 case).
Well there's gotta be a LITTLE fifo for input or you drop characters all
over the place.
(That's the reason Linus started writing Linux in the first place,
because minix's microkernel design couldn't keep up with serial input,
the overhead of the task switch to the userspace serial receive driver
process took too long and characters got dropped. So he wrote a terminal
program that booted from a floppy, and then taught it to read from and
write to the minix filesystem on his hard drive so he could download
stuff from usenet, then taught it to run "bash" so he didn't have to
reboot to mkdir/mv/rom, and that turned out to be 90% of the way to
getting it to run gcc...)
And serial hardware tends to be symmetrical about that: if it's got 16
chars of input buffer, it'll usually have 16 chars of output buffer. But
that's less than 1/50th of a second at 9600 baud...
(Fun detail: the input fifo often has a programmable watermark so you
can say "fill up this much before generating an interrupt, or if X timer
ticks pass by with no more input" so you don't get an interrupt every
character (and spend all your time entering and exiting the interrupt
code) BUT still have some leeway between the interrupt being generated
and the buffer filling up until it drops characters. The OUTPUT fifo can
do something similar, only from the other end (fill it all the way up,
then generate an interrupt when it drains to the watermark so you can
refill it before it empties and produces a gap in the output).
Programming serial devices can get slightly complicated...
> Btw: I can confirm the same for RPi3 w/ four cores. Difference is that
> something seems to go on in kernel in parallel to logs writing to serial
> but at a certain point the kernel is waiting again for lot of seconds
> probably for the serial device to finish transmission. Systemds delay is
> pretty much similar to the single core case.
Yeah, the point of a bottleneck is that's the part you're waiting for,
so speeding up the rest of it doesn't help so much.
Optimization is a whole thing. Spinlocks vs semaphores infuriate some
people (you're intentionally spinning wasting time?) so sometimes you
need to explain with analogies to get them to stop "helping".
You're standing at a train crossing, and a train is going past, it'll be
through in 10 minutes. If you walk towards the end of the train you'll
reach the end faster and can cross in only 7 minutes, but if you need to
come BACK HERE to where your road is you'll wind up walking 7 minutes,
crossing, walking 7 minutes back, and resume from here 14 minutes from
now instead of only 10, so being busy doing the wrong thing and then
just _undoing_ it again instead of waiting here ready to go is actually
_slower_ than the waiting.
>>> As said in another mail: I do not know a valid (production) use case
>>> in which kernel logs need to be dumped to a serial console. I regard
>>> this mechanism only as useful for development purposes (in which fast
>>> boot is probably not so relevant). Please correct me if I'm wrong,
>>> would be happy to learn about such use cases.
>>>
>>> Based on that I think option 3) is the best options for most cases.
>>
>> You can adjust the loglevel so they still go into dmesg but don't go
>> out to the console, which theoretically shouldn't be THAT slow? (At
>> least cpu limited rather than wait-for-hardware.)
>
> With quiet logs go into dmesg as well.
Which _used_ to be almost free back when it was just a ring buffer doing
a strlen() and two memcpy() at the wrap. But these days: dunno, haven't
benched it.
> But as said, i do not really see use cases to dump out these logs to a
> serial console in a boot time critical system on each production boot.
> Reading dmesg or systemd's journal after time critical things are done
> should be ok in most case.
The switch from printk(blah) to pr_loglevel(blah) was IN THEORY so you
could kconfig a minimum loglevel to retain, and all the macros below
that level would drop out of the kernel at compile time, reducing the
kernel image size significantly AND doing nice things with cache
locality and so on. (String processing is expensive, you traverse a lot
of data that goes through the memory bus and evicts cache lines from L1
and L2.)
Last I checked the kernel devs had broken it for some reason, but it
might be working again? (Or was a patch still out of tree...?) Anyway,
if you run out of ideas that's a thing to look for.
Data going across the memory bus is another one of those bottleneck
things, where it doesn't matter how fast your processor is clocked if
you're waiting for memory. An order of magnitude down from where we're
currently looking, but still a thing that comes up a lot once the real
low hanging fruit is dealt with...
Of course there's all sorts of "Loop unrolling! No, smaller L1 cache
footprint! Prefetch! No, spectre/meltdown!" pendulum nonsense I usually
treat roughly the same way as the man trying to cross the street in The
Pink Panther:
https://www.youtube.com/watch?v=nistdsACs3E
I once watched using a lookup table instead of calculating the value be
an optimization, then a pessimization, then an optimization, then a
pessimization, without even recompiling the binary (just upgrading the
hardware). Doing the simple thing is always at least excusable. (And
less to reverse engineer to understand WHY, and a good general argument
against the endless "this is not helping". Basically Chesterton's fence
in software: understanding why it's there lets you throw it out.)
Rob
^ permalink raw reply
* Re: [boot-time]
From: Marko Hoyer @ 2025-01-09 21:35 UTC (permalink / raw)
To: Rob Landley, Bird, Tim, Shankari, linux-embedded@vger.kernel.org
In-Reply-To: <a4b7f1e2-e362-42fe-b038-74bd79020221@landley.net>
Am 09.01.25 um 22:10 schrieb Rob Landley:
> On 1/9/25 07:02, Marko Hoyer wrote:
>> Am 09.01.25 um 00:26 schrieb Rob Landley:
>>> CONFIG_EARLY_PRINTK output is emitted before interrupts are enabled
>>> (last I checked they didn't kick in until RIGHT before PID 1 gets
>>> forked off), so the early output drivers spin waiting for the next
>>> character to go into the buffer (the memory mapped register ones
>>> look something like "while (MASK&*status); *output = *data++;" in a
>>> for loop) and the printk() call can't return until all the data has
>>> been queued to the serial hardware, so you spend a lot of time
>>> blocked in printk().
>>
>> Hi Rob,
>>
>> thx for the explanation, helps further!
>>
>> * This implementation would explain the observed behavior.
>>
>> * What I'm not understanding yet: logs from systemd delay systemd the
>> same way as seen in the kernel. Looks like the issue is not solved
>> even when PID 1 is started. As said, It can be something specific to
>> single core SoCs or even just to RPI Zero W. I'll check ...
>
> Buffering or not in the char device is a driver choice. If your serial
> hardware has a small FIFO buffer and the driver doesn't do its own
> layer of output buffering (with a tasklet or something to copy the
> data to the hardware), then the write() syscall will block waiting for
> the data to go out. (Writes to filesystems stopped doing this back
> around 2.0 or something, when they rewrote the vfs to be based on the
> page cache and deentry cache, meaning ALL filesystem writes go through
> that now unless you say O_DIRECT to _ask_ for it to block, which isn't
> even always honored. But for some reason the TTY layer drives people
> insane, and char devices have been given a wide berth...)
Yeah looks like this is the case for RPi Zero W. I guess there is
probably no buffer at all in the RPi serial driver / hw since every log
line from systemd delays systemd for ~10ms (~80ms in baud9600 case).
Btw: I can confirm the same for RPi3 w/ four cores. Difference is that
something seems to go on in kernel in parallel to logs writing to serial
but at a certain point the kernel is waiting again for lot of seconds
probably for the serial device to finish transmission. Systemds delay is
pretty much similar to the single core case.
>
> There's a similar issue with some xterms where "make -j16 build"
> spamming lots of output to a display terminal can run significantly
> slower than "make -j16 build | cat" because the linux pipe
> infrastructure inserts a pipe buffer (ulimit -p says 8 but I think
> that's _pages_ so 32k? Except in 2.6.11 it was 64k? Eh, not looking it
> up...) so the writes from each cc instance go into the pipe buffer and
> return immediately when it's not full, whereas writes to a terminal
> device block until the terminal has finished updating (which includes
> scrolling the screen).
>
> If I recall (many years ago), the kde terminal implementation included
> a buffer of its own (immediately returned before updating), and the
> gnome one didn't (blocked until x11 display update completed), so
> foreground builds were faster under kde.
>
> And the gnome guys' answer was to spray everything down with 3D
> acceleration so the GPU was scrolling the screen for you, because of
> course it was.
>
> Anyway, serialized latency has _always_ killed throughput, because
> it's a cost you can't get BACK. The kernel guys used to know this:
>
> https://yarchive.net/comp/linux/raid0.html
>
> Hence the old high school math problem: if you have 2 hours to go 100
> miles and you travel the first 40 miles at 20 miles per hour, how fast
> do you have to go the rest to make it on time? Answer: you'd have to
> instantaneously teleport because you spent 2 hours going 40 miles and
> your time is up with 60 miles left to go. Optimizing the wrong part
> DOESN'T HELP.
>
Absolutely correct.
>> As said in another mail: I do not know a valid (production) use case
>> in which kernel logs need to be dumped to a serial console. I regard
>> this mechanism only as useful for development purposes (in which fast
>> boot is probably not so relevant). Please correct me if I'm wrong,
>> would be happy to learn about such use cases.
>>
>> Based on that I think option 3) is the best options for most cases.
>
> You can adjust the loglevel so they still go into dmesg but don't go
> out to the console, which theoretically shouldn't be THAT slow? (At
> least cpu limited rather than wait-for-hardware.)
With quiet logs go into dmesg as well.
But as said, i do not really see use cases to dump out these logs to a
serial console in a boot time critical system on each production boot.
Reading dmesg or systemd's journal after time critical things are done
should be ok in most case.
So my recommendation to people who seek for fastboot potential:
* switch of kernel to console log during startup using quiet cmdline
option if you don't need these logs over serial essentially
* if you need them, your (Rob) options could be applied reducing the
effect as best as possible
>
>
> But keep in mind that a lot of kernel devs seem actively trying to
> sabotage embedded development, and the dmesg metadata is now multiple
> times the size of the actual payload.
>
> https://lkml.iu.edu/hypermail/linux/kernel/2412.0/00045.html
>
> So trimming the dmesg buffer size is probably ALSO a good idea on
> modern embedded systems, because WHAT THE FSCK? Also, I dunno how much
> cpu time all that metadata fapping takes. (Before systemd, this was
> just text going into a ring buffer. But systemd couldn't cope with
> that because
> https://www.theregister.com/2014/04/05/torvalds_sievers_dust_up/ and
> https://en.wikipedia.org/wiki/Systemd#History and of course
> https://www.linux.com/news/fun-photo-greg-kroah-hartman-crowned-systemd-hackfest/
> thus the data structures in the kernel had to become far more complex
> after about 30 years of NOT being like that, because systemd.)
>
> Rob
Marko
^ permalink raw reply
* Re: [boot-time]
From: Rob Landley @ 2025-01-09 21:10 UTC (permalink / raw)
To: Marko Hoyer, Bird, Tim, Marko Hoyer, Shankari,
linux-embedded@vger.kernel.org
In-Reply-To: <04614509-675d-43b4-82f8-d3b16797f559@freenet.de>
On 1/9/25 07:02, Marko Hoyer wrote:
> Am 09.01.25 um 00:26 schrieb Rob Landley:
>> CONFIG_EARLY_PRINTK output is emitted before interrupts are enabled
>> (last I checked they didn't kick in until RIGHT before PID 1 gets
>> forked off), so the early output drivers spin waiting for the next
>> character to go into the buffer (the memory mapped register ones look
>> something like "while (MASK&*status); *output = *data++;" in a for
>> loop) and the printk() call can't return until all the data has been
>> queued to the serial hardware, so you spend a lot of time blocked in
>> printk().
>
> Hi Rob,
>
> thx for the explanation, helps further!
>
> * This implementation would explain the observed behavior.
>
> * What I'm not understanding yet: logs from systemd delay systemd the
> same way as seen in the kernel. Looks like the issue is not solved even
> when PID 1 is started. As said, It can be something specific to single
> core SoCs or even just to RPI Zero W. I'll check ...
Buffering or not in the char device is a driver choice. If your serial
hardware has a small FIFO buffer and the driver doesn't do its own layer
of output buffering (with a tasklet or something to copy the data to the
hardware), then the write() syscall will block waiting for the data to
go out. (Writes to filesystems stopped doing this back around 2.0 or
something, when they rewrote the vfs to be based on the page cache and
deentry cache, meaning ALL filesystem writes go through that now unless
you say O_DIRECT to _ask_ for it to block, which isn't even always
honored. But for some reason the TTY layer drives people insane, and
char devices have been given a wide berth...)
There's a similar issue with some xterms where "make -j16 build"
spamming lots of output to a display terminal can run significantly
slower than "make -j16 build | cat" because the linux pipe
infrastructure inserts a pipe buffer (ulimit -p says 8 but I think
that's _pages_ so 32k? Except in 2.6.11 it was 64k? Eh, not looking it
up...) so the writes from each cc instance go into the pipe buffer and
return immediately when it's not full, whereas writes to a terminal
device block until the terminal has finished updating (which includes
scrolling the screen).
If I recall (many years ago), the kde terminal implementation included a
buffer of its own (immediately returned before updating), and the gnome
one didn't (blocked until x11 display update completed), so foreground
builds were faster under kde.
And the gnome guys' answer was to spray everything down with 3D
acceleration so the GPU was scrolling the screen for you, because of
course it was.
Anyway, serialized latency has _always_ killed throughput, because it's
a cost you can't get BACK. The kernel guys used to know this:
https://yarchive.net/comp/linux/raid0.html
Hence the old high school math problem: if you have 2 hours to go 100
miles and you travel the first 40 miles at 20 miles per hour, how fast
do you have to go the rest to make it on time? Answer: you'd have to
instantaneously teleport because you spent 2 hours going 40 miles and
your time is up with 60 miles left to go. Optimizing the wrong part
DOESN'T HELP.
> As said in another mail: I do not know a valid (production) use case in
> which kernel logs need to be dumped to a serial console. I regard this
> mechanism only as useful for development purposes (in which fast boot is
> probably not so relevant). Please correct me if I'm wrong, would be
> happy to learn about such use cases.
>
> Based on that I think option 3) is the best options for most cases.
You can adjust the loglevel so they still go into dmesg but don't go out
to the console, which theoretically shouldn't be THAT slow? (At least
cpu limited rather than wait-for-hardware.)
But keep in mind that a lot of kernel devs seem actively trying to
sabotage embedded development, and the dmesg metadata is now multiple
times the size of the actual payload.
https://lkml.iu.edu/hypermail/linux/kernel/2412.0/00045.html
So trimming the dmesg buffer size is probably ALSO a good idea on modern
embedded systems, because WHAT THE FSCK? Also, I dunno how much cpu time
all that metadata fapping takes. (Before systemd, this was just text
going into a ring buffer. But systemd couldn't cope with that because
https://www.theregister.com/2014/04/05/torvalds_sievers_dust_up/ and
https://en.wikipedia.org/wiki/Systemd#History and of course
https://www.linux.com/news/fun-photo-greg-kroah-hartman-crowned-systemd-hackfest/
thus the data structures in the kernel had to become far more complex
after about 30 years of NOT being like that, because systemd.)
Rob
^ permalink raw reply
* Re: boot markers ideas (was RE: [boot-time] jent_mod_init on beagleplay)
From: Francesco Valla @ 2025-01-09 16:23 UTC (permalink / raw)
To: Bird, Tim, Stephan Mueller, Shankari, Rob Landley; +Cc: Linux Embedded
In-Reply-To: <211b015d-44e5-4e64-9e44-5968d86dfac4@landley.net>
Hello Rob,
On Wednesday, 8 January 2025 at 00:40:13 Rob Landley <rob@landley.net> wrote:
> Is there a place to get this measuring script other than fishing it out
> of a webmail archive?
>
> https://www.spinics.net/lists/linux-embedded/msg04363.html
No, at the moment the mailing list is the only place I posted it. I am still
unsure if it should live on its own somewhere or it's better to try to augment
the bootchart.py script that lives inside the kernel sources [1].
> The most recent wiki mention in the archive subject lines was
> https://www.spinics.net/lists/linux-embedded/msg04336.html and trying to
> find this effort in elinux.org's top right search bar found
> https://elinux.org/Boot-up_Time_Reduction_Howto which "view history"
> says was last touched in 2010...
>
> Rob
>
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/power/pm-graph/bootgraph.py
Regards,
Francesco
^ permalink raw reply
* Re: [boot-time]
From: Geert Uytterhoeven @ 2025-01-09 13:27 UTC (permalink / raw)
To: Marko Hoyer
Cc: Bird, Tim, Marko Hoyer, Shankari, linux-embedded@vger.kernel.org
In-Reply-To: <7f8a09dd-3d99-4028-bb70-77464eb0cf77@freenet.de>
Hi Marko,
On Thu, Jan 9, 2025 at 1:49 PM Marko Hoyer <Marko.Hoyer@freenet.de> wrote:
> Take a look into the kmesg logs. Looks like there is a 8s delay at a
> certain point:
>
> [ 5.897018] input: C-Media Electronics Inc. USB Audio Device as
> /devices/platform/soc/20980000.usb/usb1/1-1/1-1:1.3/0003:0D8C:0014.0001/input/input1
> [ 6.016086] hid-generic 0003:0D8C:0014.0001: input,hidraw0: USB HID
> v1.00 Device [C-Media Electronics Inc. USB Audio Device] on
> usb-20980000.usb-1/input3[ 14.012174] printk: console [ttyS0] enabled
> [ 14.064965] bcm2835-wdt bcm2835-wdt: Broadcom BCM2835 watchdog timer
> [ 14.142795] bcm2835-power bcm2835-power: Broadcom BCM2835 power
> domains driver
> [ 14.232013] mmc-bcm2835 20300000.mmcnr: mmc_debug:0 mmc_debug2:0
Those eight seconds are the time needed for printing all
previously-collected and time-stamped kernel log lines to the serial
console.
BTW, only slightly related, but I have no better place to vent ;-)
On OrangeCrab running a 64 MHz VexRiscv softcore, I noticed another
big delay. With initcall_debug:
initcall pty_init+0x0/0x3c8 returned 0 after 185427581 usecs
Apparently this is due to:
CONFIG_LEGACY_PTYS=y
CONFIG_LEGACY_PTY_COUNT=256
So yes, almost one one second to set up one legacy pty, ugh...
Disabling CONFIG_LEGACY_PTYS fixed the issue.
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
^ permalink raw reply
* Re: [boot-time]
From: Marko Hoyer @ 2025-01-09 13:02 UTC (permalink / raw)
To: Rob Landley, Bird, Tim, Marko Hoyer, Shankari,
linux-embedded@vger.kernel.org
In-Reply-To: <ca1212ef-4d60-4947-9180-d2b3f127e0fb@landley.net>
Am 09.01.25 um 00:26 schrieb Rob Landley:
> On 1/8/25 15:19, Bird, Tim wrote:
>>> Cases:
>>>
>>> - #1 quiet: cmdline w/ quiet, no kernel or userspace output up to the
>>> serial login console
>>>
>>> - #2 normal: cmdline w/o quiet, serial console @115200 baud
>>>
>>> - #3 normal_baud9600: cmdline w/o quiet, serial console @9600 baud
>>>
>>>
>>> Main outcomes:
>>>
>>> - kernel timestamps "Run /sbin/init as init process"
>>>
>>> #1: "1.714458", #2: "3.011701", #3: "16.108101"
>>
>> Wow from 1.7 seconds to 16.1 seconds. That's a pretty huge
>> difference. I guess this particular technique is still
>> very relevant!
>
> CONFIG_EARLY_PRINTK output is emitted before interrupts are enabled
> (last I checked they didn't kick in until RIGHT before PID 1 gets
> forked off), so the early output drivers spin waiting for the next
> character to go into the buffer (the memory mapped register ones look
> something like "while (MASK&*status); *output = *data++;" in a for
> loop) and the printk() call can't return until all the data has been
> queued to the serial hardware, so you spend a lot of time blocked in
> printk().
Hi Rob,
thx for the explanation, helps further!
* This implementation would explain the observed behavior.
* What I'm not understanding yet: logs from systemd delay systemd the
same way as seen in the kernel. Looks like the issue is not solved even
when PID 1 is started. As said, It can be something specific to single
core SoCs or even just to RPI Zero W. I'll check ...
>
> With 9600 baud 8n1 output, 9600/9 = 1066 characters per second, or
> approximately a 1ms wait between each character, blocking in printk
> when the hardware FIFO buffer fills up, so 16k of output data takes 16
> seconds to write if the rest of the boot is doing NOTHING. Even a 1k
> hardware FIFO is only 1 second of output, and that's assuming all 1k
> is outgoing rather than split between in/out.
>
> Your options are:
>
> 1) disable early printk so it all goes into a malloced buffer until
> interrupts are enabled and it can be asynchronously flushed (meaning
> if something DOES go wrong in early boot you can't see it)
> 2) set your FIFO speed as fast as possible
> 3) have your default boot use the "quiet" option (similar to disabling
> EARLY_PRINTK but at least you have the option to yank quiet from your
> bootloader args without rebuilding the kernel.)
>
> Faster UART speeds mean shorter serial cables (although there's also 3
> volt vs 5 volt, wire thickness/capacitance, and some other stuff, Jeff
> Dionne walked me through the math last year but I don't have my notes
> in front of me). Modern hardware can do up to 4 megabits/second but
> outside "this serial chip immediately talks to a USB chip and then
> it's transported as USB with the funky noise-cancelling signaling over
> VERY twisted pair to actually leave the board"), I wouldn't trust that
> over any real length of cable.
>
> Alas
> https://tldp.org/HOWTO/Remote-Serial-Console-HOWTO/serial-distance.html
> is from the dawn of time and only goes up to 56k over wires made from
> recycled drainpipes.
> https://novatel.com/support/known-solutions/maximum-cable-length-vs-data-rate
> says 115200 is 2.5 meters. It LOOKS like it scales linearly with twice
> the speed being half the cable, so a megabit would be about 1 foot of
> serial cable before the bits get all mushy.
>
As said in another mail: I do not know a valid (production) use case in
which kernel logs need to be dumped to a serial console. I regard this
mechanism only as useful for development purposes (in which fast boot is
probably not so relevant). Please correct me if I'm wrong, would be
happy to learn about such use cases.
Based on that I think option 3) is the best options for most cases.
> Rob
Marko
^ permalink raw reply
* Re: [boot-time]
From: Marko Hoyer @ 2025-01-09 12:43 UTC (permalink / raw)
To: Bird, Tim, Marko Hoyer, Shankari, linux-embedded@vger.kernel.org
In-Reply-To: <MW5PR13MB5632E7FD9A5E76949625BA54FD122@MW5PR13MB5632.namprd13.prod.outlook.com>
Am 08.01.25 um 22:19 schrieb Bird, Tim:
>
>> -----Original Message-----
>> From: Marko Hoyer <mhoyer.oss-devel@freenet.de>
>> Am 08.01.25 um 19:33 schrieb Bird, Tim:
>>>> -----Original Message-----
>>>> From: Shankari <beingcap11@gmail.com>
>>>> Hi
>>>>
>>>> I wanted to provide an update on my recent contributions to the boot-time reduction project. I have recently started contributing
>> and
>>>> am working with the beagleplay. I have been analyzing the boot time of the init process. Below is the output from the system log:
>>>>
>>>> debian@BeaglePlay:~$ dmesg | grep "init process"
>>>> [ 1.480490] Run /init as init process
>>>>
>>>> Moving forward, I plan to explore ways to modify the command line and further investigate the data used for SIG analysis. This will
>>>> help me gain a deeper understanding of the boot process and its performance characteristics.
>>>>
>>>> Please let me know if you have any suggestions or areas where I could focus my efforts.
>>> Hi Shankari,
>>>
>>> It sounds like you are off to a good start. I have something that needs to be done, that I think
>>> you can help with, and that matches where I believe you are in your status with being able
>>> to evaluate the kernel.
>>>
>>> In general, there's a lot of information on the elinux wiki which is stale, which needs to be
>>> updated or archived, or maybe even just removed.
>>>
>>> This section of the Boot Time page has a lot of material in this category:
>>> https://elinux.org/Boot_Time#kernel_speedups
>>>
>>> Can you validate the information on these 2 pages:
>>> * https://elinux.org/Disable_Console
>>> * https://elinux.org/Preset_LPJ
>>>
>>> This would consist of reading through the material, and testing the
>>> described techniques on your machine. This will involve booting the
>>> machine 2 ways, with a particular kernel command line option and without
>>> it, and then reporting back the final boot time for both. You can use
>>> the timestamp for the "init process" string as your final boot time, for the
>>> purposes of this exercise.
>>>
>>> Helping me to update the elinux wiki material on boot time would be
>>> an immense help, and is one of my main goals for the boot time SIG in 2025.
>>>
>>> Don't hesitate to ask questions if you have any.
>>>
>>> BTW - you can just report your findings to me and linux-embedded list, but
>>> alternatively (and even better) would be if you could also update the wiki
>>> pages themselves with your information based on recent kernels and hardware.
>>> To do this, you will need an elinux wiki account, which you can make online on
>>> elinux wiki.org by going to this page: https://elinux.org/Special:CreateAccount
>>>
>>> Anyone else reading this who wants to also participate in this project to
>>> update the elinux wiki boot time information, please contact me.
>>> Thanks.
>>> -- Tim
>>>
>> Hi Tim, all,
>>
>> first time I'm posting here so hopefully everything is fine w/ my mail
>> format / attachment and so on ... If not, please give me some feedback
>> and guidance.
> Marko,
>
> Thanks for this great data!
>
> In general, I don't see a lot of attachments on kernel mailing lists.
> They don't bother me, and we aren't CC'ing LKML (that's a separate
> issue we should discuss - developers outside of embedded might
> want to see this data). I'll check later and see what lore does with this,
> but if no one complains, I don't see a problem with it. If someone
> does complain, I can provide file hosting either on the elinux wiki
> or the boot-time wiki, and we can link attachments like you've
> provided on this message from one of those places (to avoid
> putting attachments on kernel mailing lists).
Ok sounds good. I don't have really a place to publish data, so would
be good if you can find a way ...
>
>> To the "disable console" topic: I have some numbers in place for an RPI
>> Zero W, find dmesg dumps and systemd-analyze plots attached.
>>
>>
>> Environment:
>>
>> - RPi Zero W, kernel 5.15.24, systemd 247.3, customized debian
>>
>> - onboard UART used
>>
>>
>> Cases:
>>
>> - #1 quiet: cmdline w/ quiet, no kernel or userspace output up to the
>> serial login console
>>
>> - #2 normal: cmdline w/o quiet, serial console @115200 baud
>>
>> - #3 normal_baud9600: cmdline w/o quiet, serial console @9600 baud
>>
>>
>> Main outcomes:
>>
>> - kernel timestamps "Run /sbin/init as init process"
>>
>> #1: "1.714458", #2: "3.011701", #3: "16.108101"
> Wow from 1.7 seconds to 16.1 seconds. That's a pretty huge
> difference. I guess this particular technique is still
> very relevant!
>
>> Interpretation:
>>
>> * enabled serial console has significant impact in kernel boot time
>>
>> * reducing baud to 9600 induced some side effect, not sure what it is ...
> Did you see any other weird behavior besides the huge slowdown?
> I'll take a look at the amount of characters in your dmesg output and
> see if it can be linearly correlated to the baud rate, or if it seems something
> else is going on.
Take a look into the kmesg logs. Looks like there is a 8s delay at a
certain point:
[ 5.897018] input: C-Media Electronics Inc. USB Audio Device as
/devices/platform/soc/20980000.usb/usb1/1-1/1-1:1.3/0003:0D8C:0014.0001/input/input1
[ 6.016086] hid-generic 0003:0D8C:0014.0001: input,hidraw0: USB HID
v1.00 Device [C-Media Electronics Inc. USB Audio Device] on
usb-20980000.usb-1/input3[ 14.012174] printk: console [ttyS0] enabled
[ 14.064965] bcm2835-wdt bcm2835-wdt: Broadcom BCM2835 watchdog timer
[ 14.142795] bcm2835-power bcm2835-power: Broadcom BCM2835 power
domains driver
[ 14.232013] mmc-bcm2835 20300000.mmcnr: mmc_debug:0 mmc_debug2:0
Not sure if it really makes sense to dig further into this issue. Might
be something in the serial driver of the RPI. I don't see really a valid
use case, 9600 baud is nothing you really need and in anyway pushing
serial logs out via serial is nothing really needed in productive use
cases, just for development (correct me if I'm wrong).
>
>> - systemd startup
>>
>> * systemd drops 2 log lines per started unit to the console
>>
>> * seems serial output is not implemented asynchronously (see steps of
>> units in sd plot, ~10ms per unit w/ baud 115200, ~80ms per unit w/ baud9600
> I'm not sure what you're referring to here. Is the 'unit' you are talking about
> the graphing grid size, or are you referring to systemd units? The grid size
> seems to be 10ms per minor grid line in each plot.
unit -> systemd unit
* grid size per minor grid line is 100ms to me (1s per main grid line ->
see top of plot)
* take two units "user.slice" and "remote-fs.target"
* they should have basically no delay between each other -> you can see
it in quiet case
* w/ serial on, you see delays (10ms, 80ms) which in fact are caused
just by dropping the log lines to serial kernel log
* overall delay in user space startup sums up significantly
>
>> Side notes:
>>
>> * I remember similar behavior w/ imx.6 SoCs
>>
>> * Maybe this issues is not seen on other SoCs (maybe w/ another hw
>> implementation of the UART)
>>
>> * Maybe this issues is only seen in single core machines (I can double
>> check w/ a PI3 or orange pi zero once)
>>
>> Hope this helps.
> It helps a lot. Thanks for this data!
>
> I think the 'Disable Console' technique will continue to stay as one of the first
> things we recommend for developers working on their board's boot time.
Yes I agree. Let me check out other boards and especially multi-core
systems if the issues is happening there as well ...
>
> To others developers - I'd like to see more data like this on other systems
> as well. So please keep submitting your data to this list.
>
> Thanks,
> -- Tim
>
Regards,
Marko
^ permalink raw reply
* RE: [boot-time]
From: Bird, Tim @ 2025-01-09 2:23 UTC (permalink / raw)
To: Rob Landley, Shankari, linux-embedded@vger.kernel.org
In-Reply-To: <7c4d8fa2-9b12-465a-9c23-0503cb086365@landley.net>
> -----Original Message-----
> From: Rob Landley <rob@landley.net>
> On 1/8/25 12:33, Bird, Tim wrote:
> > In general, there's a lot of information on the elinux wiki which is stale, which needs to be
> > updated or archived, or maybe even just removed.
> >
> > This section of the Boot Time page has a lot of material in this category:
> > https://elinux.org/Boot_Time#kernel_speedups
>
> That page says "grab-boot-data.sh - see
> http://birdcloud.org/boot-time/Boot-time_Tools" but the link is 404.
Hmmm. That should be 'https', not 'http'. When I click on it
my Chrome browser takes me to the https site (silently autocorrecting it?)
I've fixed the link on the elinux site. Sorry about that.
Here is a direct link to the tool, but it's worth getting to the page to read
what kernel command line options it needs.
https://birdcloud.org/boot-time/Boot-time_Tools
https://birdcloud.org/boot-time-files/grab-boot-data.sh
>
> > Anyone else reading this who wants to also participate in this project to
> > update the elinux wiki boot time information, please contact me.
>
> Maybe? What do you need?
I am always looking for more data. If you have a system you are running
Linux on, and can edit the kernel command, I'd be very happy to have
you run grab-boot-data.sh. It will collect a bunch of boot data and system
data, and send it to the birdcloud.org wiki, where I'm doing data analysis.
You can run the tool in a "don't send" mode and look at what it's about
to send, if you are concerned about privacy.
Thanks!
-- Tim
^ permalink raw reply
* Re: [boot-time]
From: Rob Landley @ 2025-01-08 23:26 UTC (permalink / raw)
To: Bird, Tim, Marko Hoyer, Shankari, linux-embedded@vger.kernel.org
In-Reply-To: <MW5PR13MB5632E7FD9A5E76949625BA54FD122@MW5PR13MB5632.namprd13.prod.outlook.com>
On 1/8/25 15:19, Bird, Tim wrote:
>> Cases:
>>
>> - #1 quiet: cmdline w/ quiet, no kernel or userspace output up to the
>> serial login console
>>
>> - #2 normal: cmdline w/o quiet, serial console @115200 baud
>>
>> - #3 normal_baud9600: cmdline w/o quiet, serial console @9600 baud
>>
>>
>> Main outcomes:
>>
>> - kernel timestamps "Run /sbin/init as init process"
>>
>> #1: "1.714458", #2: "3.011701", #3: "16.108101"
>
> Wow from 1.7 seconds to 16.1 seconds. That's a pretty huge
> difference. I guess this particular technique is still
> very relevant!
CONFIG_EARLY_PRINTK output is emitted before interrupts are enabled
(last I checked they didn't kick in until RIGHT before PID 1 gets forked
off), so the early output drivers spin waiting for the next character to
go into the buffer (the memory mapped register ones look something like
"while (MASK&*status); *output = *data++;" in a for loop) and the
printk() call can't return until all the data has been queued to the
serial hardware, so you spend a lot of time blocked in printk().
With 9600 baud 8n1 output, 9600/9 = 1066 characters per second, or
approximately a 1ms wait between each character, blocking in printk when
the hardware FIFO buffer fills up, so 16k of output data takes 16
seconds to write if the rest of the boot is doing NOTHING. Even a 1k
hardware FIFO is only 1 second of output, and that's assuming all 1k is
outgoing rather than split between in/out.
Your options are:
1) disable early printk so it all goes into a malloced buffer until
interrupts are enabled and it can be asynchronously flushed (meaning if
something DOES go wrong in early boot you can't see it)
2) set your FIFO speed as fast as possible
3) have your default boot use the "quiet" option (similar to disabling
EARLY_PRINTK but at least you have the option to yank quiet from your
bootloader args without rebuilding the kernel.)
Faster UART speeds mean shorter serial cables (although there's also 3
volt vs 5 volt, wire thickness/capacitance, and some other stuff, Jeff
Dionne walked me through the math last year but I don't have my notes in
front of me). Modern hardware can do up to 4 megabits/second but outside
"this serial chip immediately talks to a USB chip and then it's
transported as USB with the funky noise-cancelling signaling over VERY
twisted pair to actually leave the board"), I wouldn't trust that over
any real length of cable.
Alas
https://tldp.org/HOWTO/Remote-Serial-Console-HOWTO/serial-distance.html
is from the dawn of time and only goes up to 56k over wires made from
recycled drainpipes.
https://novatel.com/support/known-solutions/maximum-cable-length-vs-data-rate
says 115200 is 2.5 meters. It LOOKS like it scales linearly with twice
the speed being half the cable, so a megabit would be about 1 foot of
serial cable before the bits get all mushy.
Rob
^ permalink raw reply
* Re: [boot-time]
From: Rob Landley @ 2025-01-08 23:00 UTC (permalink / raw)
To: Bird, Tim, Shankari, linux-embedded@vger.kernel.org
In-Reply-To: <MW5PR13MB5632841906910ED152E7AE3CFD122@MW5PR13MB5632.namprd13.prod.outlook.com>
On 1/8/25 12:33, Bird, Tim wrote:
> In general, there's a lot of information on the elinux wiki which is stale, which needs to be
> updated or archived, or maybe even just removed.
>
> This section of the Boot Time page has a lot of material in this category:
> https://elinux.org/Boot_Time#kernel_speedups
That page says "grab-boot-data.sh - see
http://birdcloud.org/boot-time/Boot-time_Tools" but the link is 404.
> Anyone else reading this who wants to also participate in this project to
> update the elinux wiki boot time information, please contact me.
Maybe? What do you need?
Rob
^ permalink raw reply
* RE: [boot-time]
From: Bird, Tim @ 2025-01-08 21:19 UTC (permalink / raw)
To: Marko Hoyer, Shankari, linux-embedded@vger.kernel.org
In-Reply-To: <b0dd83c8-eb23-494b-8f23-ea8c084405a6@freenet.de>
> -----Original Message-----
> From: Marko Hoyer <mhoyer.oss-devel@freenet.de>
> Am 08.01.25 um 19:33 schrieb Bird, Tim:
> >> -----Original Message-----
> >> From: Shankari <beingcap11@gmail.com>
> >> Hi
> >>
> >> I wanted to provide an update on my recent contributions to the boot-time reduction project. I have recently started contributing
> and
> >> am working with the beagleplay. I have been analyzing the boot time of the init process. Below is the output from the system log:
> >>
> >> debian@BeaglePlay:~$ dmesg | grep "init process"
> >> [ 1.480490] Run /init as init process
> >>
> >> Moving forward, I plan to explore ways to modify the command line and further investigate the data used for SIG analysis. This will
> >> help me gain a deeper understanding of the boot process and its performance characteristics.
> >>
> >> Please let me know if you have any suggestions or areas where I could focus my efforts.
> > Hi Shankari,
> >
> > It sounds like you are off to a good start. I have something that needs to be done, that I think
> > you can help with, and that matches where I believe you are in your status with being able
> > to evaluate the kernel.
> >
> > In general, there's a lot of information on the elinux wiki which is stale, which needs to be
> > updated or archived, or maybe even just removed.
> >
> > This section of the Boot Time page has a lot of material in this category:
> > https://elinux.org/Boot_Time#kernel_speedups
> >
> > Can you validate the information on these 2 pages:
> > * https://elinux.org/Disable_Console
> > * https://elinux.org/Preset_LPJ
> >
> > This would consist of reading through the material, and testing the
> > described techniques on your machine. This will involve booting the
> > machine 2 ways, with a particular kernel command line option and without
> > it, and then reporting back the final boot time for both. You can use
> > the timestamp for the "init process" string as your final boot time, for the
> > purposes of this exercise.
> >
> > Helping me to update the elinux wiki material on boot time would be
> > an immense help, and is one of my main goals for the boot time SIG in 2025.
> >
> > Don't hesitate to ask questions if you have any.
> >
> > BTW - you can just report your findings to me and linux-embedded list, but
> > alternatively (and even better) would be if you could also update the wiki
> > pages themselves with your information based on recent kernels and hardware.
> > To do this, you will need an elinux wiki account, which you can make online on
> > elinux wiki.org by going to this page: https://elinux.org/Special:CreateAccount
> >
> > Anyone else reading this who wants to also participate in this project to
> > update the elinux wiki boot time information, please contact me.
> > Thanks.
> > -- Tim
> >
> Hi Tim, all,
>
> first time I'm posting here so hopefully everything is fine w/ my mail
> format / attachment and so on ... If not, please give me some feedback
> and guidance.
Marko,
Thanks for this great data!
In general, I don't see a lot of attachments on kernel mailing lists.
They don't bother me, and we aren't CC'ing LKML (that's a separate
issue we should discuss - developers outside of embedded might
want to see this data). I'll check later and see what lore does with this,
but if no one complains, I don't see a problem with it. If someone
does complain, I can provide file hosting either on the elinux wiki
or the boot-time wiki, and we can link attachments like you've
provided on this message from one of those places (to avoid
putting attachments on kernel mailing lists).
>
> To the "disable console" topic: I have some numbers in place for an RPI
> Zero W, find dmesg dumps and systemd-analyze plots attached.
>
>
> Environment:
>
> - RPi Zero W, kernel 5.15.24, systemd 247.3, customized debian
>
> - onboard UART used
>
>
> Cases:
>
> - #1 quiet: cmdline w/ quiet, no kernel or userspace output up to the
> serial login console
>
> - #2 normal: cmdline w/o quiet, serial console @115200 baud
>
> - #3 normal_baud9600: cmdline w/o quiet, serial console @9600 baud
>
>
> Main outcomes:
>
> - kernel timestamps "Run /sbin/init as init process"
>
> #1: "1.714458", #2: "3.011701", #3: "16.108101"
Wow from 1.7 seconds to 16.1 seconds. That's a pretty huge
difference. I guess this particular technique is still
very relevant!
>
> Interpretation:
>
> * enabled serial console has significant impact in kernel boot time
>
> * reducing baud to 9600 induced some side effect, not sure what it is ...
Did you see any other weird behavior besides the huge slowdown?
I'll take a look at the amount of characters in your dmesg output and
see if it can be linearly correlated to the baud rate, or if it seems something
else is going on.
> - systemd startup
>
> * systemd drops 2 log lines per started unit to the console
>
> * seems serial output is not implemented asynchronously (see steps of
> units in sd plot, ~10ms per unit w/ baud 115200, ~80ms per unit w/ baud9600
I'm not sure what you're referring to here. Is the 'unit' you are talking about
the graphing grid size, or are you referring to systemd units? The grid size
seems to be 10ms per minor grid line in each plot.
>
> Side notes:
>
> * I remember similar behavior w/ imx.6 SoCs
>
> * Maybe this issues is not seen on other SoCs (maybe w/ another hw
> implementation of the UART)
>
> * Maybe this issues is only seen in single core machines (I can double
> check w/ a PI3 or orange pi zero once)
>
> Hope this helps.
It helps a lot. Thanks for this data!
I think the 'Disable Console' technique will continue to stay as one of the first
things we recommend for developers working on their board's boot time.
To others developers - I'd like to see more data like this on other systems
as well. So please keep submitting your data to this list.
Thanks,
-- Tim
^ permalink raw reply
* Re: [boot-time]
From: Marko Hoyer @ 2025-01-08 20:39 UTC (permalink / raw)
To: Bird, Tim, Shankari, linux-embedded@vger.kernel.org
In-Reply-To: <MW5PR13MB5632841906910ED152E7AE3CFD122@MW5PR13MB5632.namprd13.prod.outlook.com>
[-- Attachment #1: Type: text/plain, Size: 4087 bytes --]
Am 08.01.25 um 19:33 schrieb Bird, Tim:
>> -----Original Message-----
>> From: Shankari <beingcap11@gmail.com>
>> Hi
>>
>> I wanted to provide an update on my recent contributions to the boot-time reduction project. I have recently started contributing and
>> am working with the beagleplay. I have been analyzing the boot time of the init process. Below is the output from the system log:
>>
>> debian@BeaglePlay:~$ dmesg | grep "init process"
>> [ 1.480490] Run /init as init process
>>
>> Moving forward, I plan to explore ways to modify the command line and further investigate the data used for SIG analysis. This will
>> help me gain a deeper understanding of the boot process and its performance characteristics.
>>
>> Please let me know if you have any suggestions or areas where I could focus my efforts.
> Hi Shankari,
>
> It sounds like you are off to a good start. I have something that needs to be done, that I think
> you can help with, and that matches where I believe you are in your status with being able
> to evaluate the kernel.
>
> In general, there's a lot of information on the elinux wiki which is stale, which needs to be
> updated or archived, or maybe even just removed.
>
> This section of the Boot Time page has a lot of material in this category:
> https://elinux.org/Boot_Time#kernel_speedups
>
> Can you validate the information on these 2 pages:
> * https://elinux.org/Disable_Console
> * https://elinux.org/Preset_LPJ
>
> This would consist of reading through the material, and testing the
> described techniques on your machine. This will involve booting the
> machine 2 ways, with a particular kernel command line option and without
> it, and then reporting back the final boot time for both. You can use
> the timestamp for the "init process" string as your final boot time, for the
> purposes of this exercise.
>
> Helping me to update the elinux wiki material on boot time would be
> an immense help, and is one of my main goals for the boot time SIG in 2025.
>
> Don't hesitate to ask questions if you have any.
>
> BTW - you can just report your findings to me and linux-embedded list, but
> alternatively (and even better) would be if you could also update the wiki
> pages themselves with your information based on recent kernels and hardware.
> To do this, you will need an elinux wiki account, which you can make online on
> elinux wiki.org by going to this page: https://elinux.org/Special:CreateAccount
>
> Anyone else reading this who wants to also participate in this project to
> update the elinux wiki boot time information, please contact me.
> Thanks.
> -- Tim
>
Hi Tim, all,
first time I'm posting here so hopefully everything is fine w/ my mail
format / attachment and so on ... If not, please give me some feedback
and guidance.
To the "disable console" topic: I have some numbers in place for an RPI
Zero W, find dmesg dumps and systemd-analyze plots attached.
Environment:
- RPi Zero W, kernel 5.15.24, systemd 247.3, customized debian
- onboard UART used
Cases:
- #1 quiet: cmdline w/ quiet, no kernel or userspace output up to the
serial login console
- #2 normal: cmdline w/o quiet, serial console @115200 baud
- #3 normal_baud9600: cmdline w/o quiet, serial console @9600 baud
Main outcomes:
- kernel timestamps "Run /sbin/init as init process"
#1: "1.714458", #2: "3.011701", #3: "16.108101"
Interpretation:
* enabled serial console has significant impact in kernel boot time
* reducing baud to 9600 induced some side effect, not sure what it is ...
- systemd startup
* systemd drops 2 log lines per started unit to the console
* seems serial output is not implemented asynchronously (see steps of
units in sd plot, ~10ms per unit w/ baud 115200, ~80ms per unit w/ baud9600
Side notes:
* I remember similar behavior w/ imx.6 SoCs
* Maybe this issues is not seen on other SoCs (maybe w/ another hw
implementation of the UART)
* Maybe this issues is only seen in single core machines (I can double
check w/ a PI3 or orange pi zero once)
Hope this helps.
Regards,
Marko
[-- Attachment #2: fastboot_disable-console_rpi-zero-w.tgz --]
[-- Type: application/x-compressed-tar, Size: 25145 bytes --]
^ permalink raw reply
* RE: [boot-time]
From: Bird, Tim @ 2025-01-08 18:33 UTC (permalink / raw)
To: Shankari, linux-embedded@vger.kernel.org
In-Reply-To: <CAORPcfVRobA+u5q7aPboC=3iY8dibDUB0920Z=Z0VgpQEupKJw@mail.gmail.com>
> -----Original Message-----
> From: Shankari <beingcap11@gmail.com>
> Hi
>
> I wanted to provide an update on my recent contributions to the boot-time reduction project. I have recently started contributing and
> am working with the beagleplay. I have been analyzing the boot time of the init process. Below is the output from the system log:
>
> debian@BeaglePlay:~$ dmesg | grep "init process"
> [ 1.480490] Run /init as init process
>
> Moving forward, I plan to explore ways to modify the command line and further investigate the data used for SIG analysis. This will
> help me gain a deeper understanding of the boot process and its performance characteristics.
>
> Please let me know if you have any suggestions or areas where I could focus my efforts.
Hi Shankari,
It sounds like you are off to a good start. I have something that needs to be done, that I think
you can help with, and that matches where I believe you are in your status with being able
to evaluate the kernel.
In general, there's a lot of information on the elinux wiki which is stale, which needs to be
updated or archived, or maybe even just removed.
This section of the Boot Time page has a lot of material in this category:
https://elinux.org/Boot_Time#kernel_speedups
Can you validate the information on these 2 pages:
* https://elinux.org/Disable_Console
* https://elinux.org/Preset_LPJ
This would consist of reading through the material, and testing the
described techniques on your machine. This will involve booting the
machine 2 ways, with a particular kernel command line option and without
it, and then reporting back the final boot time for both. You can use
the timestamp for the "init process" string as your final boot time, for the
purposes of this exercise.
Helping me to update the elinux wiki material on boot time would be
an immense help, and is one of my main goals for the boot time SIG in 2025.
Don't hesitate to ask questions if you have any.
BTW - you can just report your findings to me and linux-embedded list, but
alternatively (and even better) would be if you could also update the wiki
pages themselves with your information based on recent kernels and hardware.
To do this, you will need an elinux wiki account, which you can make online on
elinux wiki.org by going to this page: https://elinux.org/Special:CreateAccount
Anyone else reading this who wants to also participate in this project to
update the elinux wiki boot time information, please contact me.
Thanks.
-- Tim
^ permalink raw reply
* Re: boot markers ideas (was RE: [boot-time] jent_mod_init on beagleplay)
From: Rob Landley @ 2025-01-07 23:40 UTC (permalink / raw)
To: Bird, Tim, Francesco Valla, Stephan Mueller, Shankari; +Cc: Linux Embedded
In-Reply-To: <MW5PR13MB5632780514DC36D18EE9AC33FD112@MW5PR13MB5632.namprd13.prod.outlook.com>
On 1/7/25 15:42, Bird, Tim wrote:
>
>
>> -----Original Message-----
>> From: Francesco Valla <francesco@valla.it>
> ...
>>> I wanted to provide an update on my recent contributions to the BeaglePlay
>>> project. I have recently started contributing and have been analyzing the
>>> boot time of the init process. Below is the output from the system log:
>>>
>>> debian@BeaglePlay:~$ dmesg | grep "init process"
>>> [ 1.480490] Run /init as init process
>>>
> ...
>
>>
>> I started the same journey around a week ago, and it's proving a valuable
>> experience.
>>
>> After some trimming, I reached a (reported) init startup time of around 0.6s:
>>
>> root@beagleplay-ti:~# dmesg|grep /sbin
>> [ 0.630046] Run /sbin/init as init process
>
>
> I'm starting to think that we should be using this particular kernel message
> as a consistent "boot marker" for kernel startup time. It's certainly not the
> end of when the kernel is doing stuff on startup, but I do believe it marks
> one of the important transition points (start of user space activity), and
> it seems to be ubiquitously reported.
>
> At Plumbers, I had proposed to introduce a set of specific instrumentation
> points in the kernel and use those as key checkpoints. But IMHO it's good
> to use existing printk messages where possible, since they're already present - no
> additional instrumentation needed.
Is there a place to get this measuring script other than fishing it out
of a webmail archive?
https://www.spinics.net/lists/linux-embedded/msg04363.html
The most recent wiki mention in the archive subject lines was
https://www.spinics.net/lists/linux-embedded/msg04336.html and trying to
find this effort in elinux.org's top right search bar found
https://elinux.org/Boot-up_Time_Reduction_Howto which "view history"
says was last touched in 2010...
Rob
P.S. Meanwhile, python 3 goes in quite the opposite direction:
https://landley.net/notes-2024.html#09-04-2024
^ permalink raw reply
* RE: [boot-time] jent_mod_init on beagleplay (was RE: [boot-time] [RFC] analyze-initcall-debug.py - a tool to analyze the initcall debug output
From: Bird, Tim @ 2025-01-07 22:40 UTC (permalink / raw)
To: Stephan Mueller, Shankari, Francesco Valla; +Cc: Linux Embedded
In-Reply-To: <2972828.e9J7NaK4W3@tauon>
> -----Original Message-----
> From: Stephan Mueller <smueller@chronox.de>
> Am Donnerstag, 2. Januar 2025, 11:33:08 CET schrieb Francesco Valla:
>
> Hi Francesco,
>
> > That would be wonderful! Whenever you have the time, please let me know what
> > analysis you need.
> >
>
> Ok, some background: the Jitter RNG technically has 2 noise sources which are
> sampled concurrently:
>
> 1. variations of the execution of CPU instructions
>
> 2. variations of memory access times
>
> For (1) the Jitter RNG has a fixed set of instructions it performs the
> execution time measurements for: the SHA-3 conditioning operation
> (specifically the Keccak sponge function). For that, it performs a given set
> of Keccak operations.
>
> For (2) the Jitter RNG allocates a fixed set of memory and simply reads /
> writes data there and measures this timing.
>
>
> For (1), the more instructions are samples, the higher the entropy gathered.
> This means more time is required to sample that entropy. I.e. when you
> increase the number of measured Keccak operations, you get more entropy.
>
> For (2), when the memory is large enough to "spill over" into the next type of
> memory (from L1 to L2 to L3 to RAM), the higher the entropy gathered.
>
>
> So, for (2), to get more entropy, is invariant from the execution time. But
> for (1), the entropy rate depends on the execution time.
>
>
> Thus, what you want is to try to reduce the time spend for (1).
>
>
> The key now is that the overall entropy rate the Jitter RNG requires for
> functioning must be such that when gathering 256 bits of data from it, it
> contains 256 bits of entropy.
>
>
> Now, there are 2 "knobs" to turn via Kconfig:
>
> - the oversampling rate (OSR): given that the individual number of rounds for
> (1) and the number of accesses for (2) are kept the same, the OSR causes the
> Jitter RNG to multiply the round counts. For example, the baseline with OSR ==
> 1 is that for gathering 256 bits of entropy, 256 times both noise sources are
> sampled. For an OSR of, say, 3, to get 256 bits of entropy, 3 * 256 = 768
> times both noise sources are sampled. This value was changed from 1 to 3 for
> 6.11 because there were reports on some CPUs that the Jitter RNG did not
> produce sufficient entropy - most CPUs, however, can perfectly live with OSR
> == 1.
>
> - the amount of memory for (2) can be increased. The default is 2kBytes which
> usually means that the L1 can fully handle the accesses.
Very interesting. Thanks.
>
>
> ======
>
>
> Now given the description, what can you do? I would apply the following steps:
>
> 1. measure whether the timer your system has is really a high-res timer (the
> higher the resolution, the higher the detected variations and thus the
> entropy)
>
> 2. assuming that test 1 shows a high res timer, reduce the OSR back to 1
> (CRYPTO_JITTERENTROPY_OSR) and measure the entropy rate -
>
> 3. if test 2 shows insufficient entropy, increase the amount of memory
> (CRYPTO_JITTERENTROPY_MEMSIZE_*) and measure the entropy rate
>
>
>
> The tool for measuring the entropy rate is given in [1] - check the README as
> you need to enable a kernel config option to get an interface into the Jitter
> RNG from user space. As you may not have the analysis tool, you may give the
> data to me and I can analyze it.
>
>
> More details on tuning the Jitter RNG is given in [2] - it discusses to the
> user space variant, but applies to kernel as well.
>
> [1] https://github.com/smuellerDD/jitterentropy-library/tree/master/tests/raw-entropy/recording_runtime_kernelspace
>
> [2] https://github.com/smuellerDD/jitterentropy-library/tree/master/tests/raw-entropy#approach-to-solve-insufficient-entropy
First - sorry for the slow response. I'm just getting back from the holidays.
And now... Wow! thanks for that detailed response. I will dive into your response a bit
deeper, soon, to try to analyze how it may affect our approach. But let me back up a bit
and describe our overall approach.
I think Francesco answered your query about what we are trying to do here, and what we
want from you very well. Let me add a bit of context, and some more thoughts.
In general, when trying to improve boot time, we look at 3 general techniques:
1. remove something
2. defer something
3. optimize something
Some of the work in optimizing Linux for embedded is finding kernel features that are not
needed at all for a particular use case, and just removing them (usually via the kernel config,
but sometimes via a command line option). Other times, a feature may be desirable, but
the critical capability that must come up quickly (like a camera) does not depend on that
item. In that case, we can defer the loading of a feature until after the critical capability is
operational and there is no harm done.
Finally, some times it may be possible to optimize the initialization of something, so it
takes less time (by using pre-calculated values, or optimizing the initialization process).
When deferring the initialization of a feature, sometimes it is possible to break it up
into parts, such that only the most critical part is initialized first, and the rest of it can
be initialized later.
In the case of jent_mod_init, I have a few ideas that I'd like to investigate, and your expertise
with the driver is very helpful to see which things make sense and which don't.
First, I've deferred loading the jitter entropy module, by telling the kernel to defer its initcall
(using an out-of-tree patch). That works to eliminate the cost of jent_mod_init during
the critical boot phase. Another method might be to just load the jitter entropy module
sometime later in the boot sequence. To be honest, this approach, while effective at booting
more quickly, makes me nervous because it means that the RNG is not properly seeded
(if that's the correct term), until possibly a few seconds after boot. If nothing uses the entropy
before then, I don't see a problem. But there may be flows of the entropy data or RNG values
data at kernel startup time that I'm not aware of. And I don't want to break security for
the kernel by being sloppy.
So my first question would be: how bad is it for the RNG if the entropy it not collected
until a few seconds after booting?
Related to this would be: Do you have an idea which things are using the RNG in early boot
that would be affected by a temporary lack of initialization of entropy? Or is it possible to monitor
the use of the RNG, to get a report of possibly affected RNG users?
Another idea I've had is this - could we save some entropy from a previous
boot, to pre-seed the RNG, to avoid having jent_mod_init do it's entropy gathering?
My idea would be something like this: check if there's existing entropy from a previous
boot, and if so seed the RNG from that (and wipe out the data from the previous boot).
If there isn't entropy from a previous boot, work as currently designed to gather the
entropy from the measurements for the current boot.
I'm working on general mechanisms to transfer data from a previous kernel runtime instantiation
to the next, but this particular transfer would involve saving some entropy to some kind
of persistent storage on machine shutdown. (Right now I'm looking at using the bootconfig
mechanism store the data and re-convey it to a newly booting kernel instance).
There are, of course, some security ramifications to this approach as well. Trusting the
saved data has its own issues.
I would be interested to hear your thoughts on this. How much data would need to be transferred?
Would loading the data, parsing it, and injecting it ultimately consume more time than
the generation of the data? We are, after all, only talking about ~50ms here.
Note that this ability to capture information from one boot, and convey it to a subsequent
boot is a feature that I plan to work on independent of this specific scenario.
Anyway, I hope that this answers some questions.
Thanks for any additional insights you can provide.
-- Tim
^ permalink raw reply
* boot markers ideas (was RE: [boot-time] jent_mod_init on beagleplay)
From: Bird, Tim @ 2025-01-07 21:42 UTC (permalink / raw)
To: Francesco Valla, Stephan Mueller, Shankari; +Cc: Linux Embedded
In-Reply-To: <22372644.4csPzL39Zc@fedora.fritz.box>
> -----Original Message-----
> From: Francesco Valla <francesco@valla.it>
...
> > I wanted to provide an update on my recent contributions to the BeaglePlay
> > project. I have recently started contributing and have been analyzing the
> > boot time of the init process. Below is the output from the system log:
> >
> > debian@BeaglePlay:~$ dmesg | grep "init process"
> > [ 1.480490] Run /init as init process
> >
...
>
> I started the same journey around a week ago, and it's proving a valuable
> experience.
>
> After some trimming, I reached a (reported) init startup time of around 0.6s:
>
> root@beagleplay-ti:~# dmesg|grep /sbin
> [ 0.630046] Run /sbin/init as init process
I'm starting to think that we should be using this particular kernel message
as a consistent "boot marker" for kernel startup time. It's certainly not the
end of when the kernel is doing stuff on startup, but I do believe it marks
one of the important transition points (start of user space activity), and
it seems to be ubiquitously reported.
At Plumbers, I had proposed to introduce a set of specific instrumentation
points in the kernel and use those as key checkpoints. But IMHO it's good
to use existing printk messages where possible, since they're already present - no
additional instrumentation needed.
I'm going to do some additional research on how consistently other printk strings
show up in dmesg boot reports, to see if there are some other strings
that can be used a boot markers, or if it would still make sense to introduce some.
I think a boot marker for the start of each initcall phase is probably useful.
Thanks for these reports of the timestamps for the "init process" strings.
-- Tim
^ permalink raw reply
* RE: [boot-time] jent_mod_init on beagleplay (was RE: [boot-time] [RFC] analyze-initcall-debug.py - a tool to analyze the initcall debug output
From: Bird, Tim @ 2025-01-07 21:27 UTC (permalink / raw)
To: Francesco Valla, Linux Embedded; +Cc: smueller@chronox.de
In-Reply-To: <27104545.1r3eYUQgxm@fedora.fritz.box>
> -----Original Message-----
> From: Francesco Valla <francesco@valla.it>
>
> On Tuesday, 3 December 2024 at 22: 24: 28 Bird, Tim <Tim. Bird@ sony. com> wrote: > > -----Original Message----- > > From: Francesco
> Valla <francesco@ valla. it>
> On Tuesday, 3 December 2024 at 22:24:28 Bird, Tim <Tim.Bird@sony.com> wrote:
> > > -----Original Message-----
> > > From: Francesco Valla <francesco@valla.it>
> > > Dear fellow boot time optimizers,
> > >
...
> > > Here is a python script which can analyze a dmesg output with initcall_debug
> > > option enabled and extract some useful information. It can for example be used
> > > to analyze the output of the grab-boot-data.sh tool that Tim presented on this
> > > list [1] just a few days ago.
> > >
> > > Usage is very simple, as the output of dmesg can be piped directly to it:
> > >
> > > dmesg | analyze-initcall-debug.py
> > >
> > > If no option is specified, it outputs a brief summary, like the following one
> > > (obtained on my Beagleplay):
> > >
> > > 1758 drivers has been initialized, of which 1758 before userspace
> > > 119 probes happened outside of the init of their driver
> > > 0 deferred probes pending
> > > ---
> > > Top 10 init/probes durations:
> > > * 30200000.dss -> 523002us
> > > * deferred_probe_initcall -> 487483us
> > > * fd00000.gpu -> 162859us
> > > * 8000f00.mdio -> 142521us
> > > * 44043000.system-controller -> 71390us
> > > * 2-004c -> 71178us
> > > * 40900000.crypto -> 59350us
> > > * 8000000.ethernet -> 58419us
> > > * 44043000.system-controller:clock-controller -> 56599us
> > > * jent_mod_init -> 52140us
> >
> > This is odd. On my beagleplay board, jent_mod_init only took 32326us.
> > (see https://birdcloud.org/boot-time-files/boot-data-timslab-bp1-241203-134136.txt)
> >
> > I'm assuming that we have similar hardware, but possibly different configs,
> > kernel cmdlines or kernel versions. I'll take a look at this and see if I can
> > figure out what the difference is between our systems, that causes the
> > difference in the duration for this function.
> >
> > This initcall (jent_mod_init) is on my list of initcalls to research to see if they
> > can be improved or deferred. I haven't seen any degradation in system
> > behavior by deferring this initcall on my system, but that could be from
> > lack of my system doing some related operation that needs the RNG data, or my
> > own ignorance of the effects of this call.
> >
> > Can someone comment on what items or operations might depend on jent_mod_init
> > in early boot? In particular, would we expect any cryptographic problems if
> > the initcall was deferred to a few seconds after booting?
> >
> > It can be configured as a module, which makes me think that maybe loading
> > it sometime in very late boot (or even later) is OK.
> >
> > jent_mod_init is defined in crypto/jitterentropy-kcapi.c, and controlled by
> > CONFIG_CRYPTO_JITTERENTROPY (See crypto/Kconfig)
> >
> > It appears to do some random number generator seeding by measuring
> > the jitter of a compressiong operation in the kernel. I assume some of the cryptography
> > configs affect the duration of the operations. The initcall duration results on
> > my beagleplay seem to be relatively consistent (in the 32ms range consistently),
> > and from bobs_lab, on machine sam1, the duration range is also consistent
> > (at between 4800 and 5200 usecs).
> >
> > I'd be interested to see if the range is consistent between runs on other machines
> >
> > Francesco - Can you submit another boot-data file (or 2 or 3) for francsecoslab-beagleplay, to
> > provide some data on this?
> >
>
> I actually lost this request and only found it this morning when reviewing past
> emails. My apologies.
>
> Anyway: I just uploaded another boot-data file (of a kernel with some
> optimizations), but can confirm that the time spent inside jent_mod_init across
> multiple runs appears consistent - or should I say fixed? - at 52ms.
Thanks. This is very useful. I appreciate you following up on this.
-- Tim
^ permalink raw reply
* Re: [boot-time] jent_mod_init on beagleplay (was RE: [boot-time] [RFC] analyze-initcall-debug.py - a tool to analyze the initcall debug output
From: Stephan Müller @ 2025-01-04 15:45 UTC (permalink / raw)
To: Shankari, Francesco Valla; +Cc: Linux Embedded, Bird, Tim
In-Reply-To: <3506889.LZWGnKmheA@fedora.fritz.box>
Am Freitag, 3. Januar 2025, 18:23:19 Mitteleuropäische Normalzeit schrieb
Francesco Valla:
Hi Francesco,
> > ======
> >
> >
> > Now given the description, what can you do? I would apply the following
> > steps:
> >
> > 1. measure whether the timer your system has is really a high-res timer
> > (the higher the resolution, the higher the detected variations and thus
> > the entropy)
>
> Resolution reported by clock_getres() is 1ns. Is this sufficient?
>
It should, but it is in relationship to the complexity of the CPU itself.
> > 2. assuming that test 1 shows a high res timer, reduce the OSR back to 1
> > (CRYPTO_JITTERENTROPY_OSR) and measure the entropy rate -
>
> Turned out my system already had the OSR set to 1, since CONFIG_CRYPTO_FIPS
> was set to N.
>
> > 3. if test 2 shows insufficient entropy, increase the amount of memory
> > (CRYPTO_JITTERENTROPY_MEMSIZE_*) and measure the entropy rate
> >
> >
> >
> > The tool for measuring the entropy rate is given in [1] - check the README
> > as you need to enable a kernel config option to get an interface into the
> > Jitter RNG from user space. As you may not have the analysis tool, you
> > may give the data to me and I can analyze it.
>
> Here are the results (with default parameters for processdata.sh:
>
...
> min(H_original, 4 X H_bitstring): 3.168741
That last value, the min-value, is the key: it must be larger than 1/OSR -
when it gets close to 1/OSR, the health tests start to flag errors once in a
while (with a probability rate of around 2**-30 time stamps). So, you have OSR
set to 1 which is already the lowest value supported by the Jitter RNG. Thus,
there is unfortunately not much more you can do to increase the performance
during boot time.
I thought OSR was set to 3 in your environment.
> min(H_original, 8 X H_bitstring): 4.473812
Ciao
Stephan
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox