* RE: [boot-time] [not found] <CAORPcfVRobA+u5q7aPboC=3iY8dibDUB0920Z=Z0VgpQEupKJw@mail.gmail.com> @ 2025-01-08 18:33 ` Bird, Tim 2025-01-08 20:39 ` [boot-time] Marko Hoyer ` (2 more replies) 0 siblings, 3 replies; 21+ messages in thread From: Bird, Tim @ 2025-01-08 18:33 UTC (permalink / raw) To: Shankari, linux-embedded@vger.kernel.org > -----Original Message----- > From: Shankari <beingcap11@gmail.com> > Hi > > I wanted to provide an update on my recent contributions to the boot-time reduction project. I have recently started contributing and > am working with the beagleplay. I have been analyzing the boot time of the init process. Below is the output from the system log: > > debian@BeaglePlay:~$ dmesg | grep "init process" > [ 1.480490] Run /init as init process > > Moving forward, I plan to explore ways to modify the command line and further investigate the data used for SIG analysis. This will > help me gain a deeper understanding of the boot process and its performance characteristics. > > Please let me know if you have any suggestions or areas where I could focus my efforts. Hi Shankari, It sounds like you are off to a good start. I have something that needs to be done, that I think you can help with, and that matches where I believe you are in your status with being able to evaluate the kernel. In general, there's a lot of information on the elinux wiki which is stale, which needs to be updated or archived, or maybe even just removed. This section of the Boot Time page has a lot of material in this category: https://elinux.org/Boot_Time#kernel_speedups Can you validate the information on these 2 pages: * https://elinux.org/Disable_Console * https://elinux.org/Preset_LPJ This would consist of reading through the material, and testing the described techniques on your machine. This will involve booting the machine 2 ways, with a particular kernel command line option and without it, and then reporting back the final boot time for both. You can use the timestamp for the "init process" string as your final boot time, for the purposes of this exercise. Helping me to update the elinux wiki material on boot time would be an immense help, and is one of my main goals for the boot time SIG in 2025. Don't hesitate to ask questions if you have any. BTW - you can just report your findings to me and linux-embedded list, but alternatively (and even better) would be if you could also update the wiki pages themselves with your information based on recent kernels and hardware. To do this, you will need an elinux wiki account, which you can make online on elinux wiki.org by going to this page: https://elinux.org/Special:CreateAccount Anyone else reading this who wants to also participate in this project to update the elinux wiki boot time information, please contact me. Thanks. -- Tim ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [boot-time] 2025-01-08 18:33 ` [boot-time] Bird, Tim @ 2025-01-08 20:39 ` Marko Hoyer 2025-01-08 21:19 ` [boot-time] Bird, Tim 2025-01-08 23:00 ` [boot-time] Rob Landley 2025-01-10 22:46 ` [boot-time] Marko Hoyer 2 siblings, 1 reply; 21+ messages in thread From: Marko Hoyer @ 2025-01-08 20:39 UTC (permalink / raw) To: Bird, Tim, Shankari, linux-embedded@vger.kernel.org [-- Attachment #1: Type: text/plain, Size: 4087 bytes --] Am 08.01.25 um 19:33 schrieb Bird, Tim: >> -----Original Message----- >> From: Shankari <beingcap11@gmail.com> >> Hi >> >> I wanted to provide an update on my recent contributions to the boot-time reduction project. I have recently started contributing and >> am working with the beagleplay. I have been analyzing the boot time of the init process. Below is the output from the system log: >> >> debian@BeaglePlay:~$ dmesg | grep "init process" >> [ 1.480490] Run /init as init process >> >> Moving forward, I plan to explore ways to modify the command line and further investigate the data used for SIG analysis. This will >> help me gain a deeper understanding of the boot process and its performance characteristics. >> >> Please let me know if you have any suggestions or areas where I could focus my efforts. > Hi Shankari, > > It sounds like you are off to a good start. I have something that needs to be done, that I think > you can help with, and that matches where I believe you are in your status with being able > to evaluate the kernel. > > In general, there's a lot of information on the elinux wiki which is stale, which needs to be > updated or archived, or maybe even just removed. > > This section of the Boot Time page has a lot of material in this category: > https://elinux.org/Boot_Time#kernel_speedups > > Can you validate the information on these 2 pages: > * https://elinux.org/Disable_Console > * https://elinux.org/Preset_LPJ > > This would consist of reading through the material, and testing the > described techniques on your machine. This will involve booting the > machine 2 ways, with a particular kernel command line option and without > it, and then reporting back the final boot time for both. You can use > the timestamp for the "init process" string as your final boot time, for the > purposes of this exercise. > > Helping me to update the elinux wiki material on boot time would be > an immense help, and is one of my main goals for the boot time SIG in 2025. > > Don't hesitate to ask questions if you have any. > > BTW - you can just report your findings to me and linux-embedded list, but > alternatively (and even better) would be if you could also update the wiki > pages themselves with your information based on recent kernels and hardware. > To do this, you will need an elinux wiki account, which you can make online on > elinux wiki.org by going to this page: https://elinux.org/Special:CreateAccount > > Anyone else reading this who wants to also participate in this project to > update the elinux wiki boot time information, please contact me. > Thanks. > -- Tim > Hi Tim, all, first time I'm posting here so hopefully everything is fine w/ my mail format / attachment and so on ... If not, please give me some feedback and guidance. To the "disable console" topic: I have some numbers in place for an RPI Zero W, find dmesg dumps and systemd-analyze plots attached. Environment: - RPi Zero W, kernel 5.15.24, systemd 247.3, customized debian - onboard UART used Cases: - #1 quiet: cmdline w/ quiet, no kernel or userspace output up to the serial login console - #2 normal: cmdline w/o quiet, serial console @115200 baud - #3 normal_baud9600: cmdline w/o quiet, serial console @9600 baud Main outcomes: - kernel timestamps "Run /sbin/init as init process" #1: "1.714458", #2: "3.011701", #3: "16.108101" Interpretation: * enabled serial console has significant impact in kernel boot time * reducing baud to 9600 induced some side effect, not sure what it is ... - systemd startup * systemd drops 2 log lines per started unit to the console * seems serial output is not implemented asynchronously (see steps of units in sd plot, ~10ms per unit w/ baud 115200, ~80ms per unit w/ baud9600 Side notes: * I remember similar behavior w/ imx.6 SoCs * Maybe this issues is not seen on other SoCs (maybe w/ another hw implementation of the UART) * Maybe this issues is only seen in single core machines (I can double check w/ a PI3 or orange pi zero once) Hope this helps. Regards, Marko [-- Attachment #2: fastboot_disable-console_rpi-zero-w.tgz --] [-- Type: application/x-compressed-tar, Size: 25145 bytes --] ^ permalink raw reply [flat|nested] 21+ messages in thread
* RE: [boot-time] 2025-01-08 20:39 ` [boot-time] Marko Hoyer @ 2025-01-08 21:19 ` Bird, Tim 2025-01-08 23:26 ` [boot-time] Rob Landley 2025-01-09 12:43 ` [boot-time] Marko Hoyer 0 siblings, 2 replies; 21+ messages in thread From: Bird, Tim @ 2025-01-08 21:19 UTC (permalink / raw) To: Marko Hoyer, Shankari, linux-embedded@vger.kernel.org > -----Original Message----- > From: Marko Hoyer <mhoyer.oss-devel@freenet.de> > Am 08.01.25 um 19:33 schrieb Bird, Tim: > >> -----Original Message----- > >> From: Shankari <beingcap11@gmail.com> > >> Hi > >> > >> I wanted to provide an update on my recent contributions to the boot-time reduction project. I have recently started contributing > and > >> am working with the beagleplay. I have been analyzing the boot time of the init process. Below is the output from the system log: > >> > >> debian@BeaglePlay:~$ dmesg | grep "init process" > >> [ 1.480490] Run /init as init process > >> > >> Moving forward, I plan to explore ways to modify the command line and further investigate the data used for SIG analysis. This will > >> help me gain a deeper understanding of the boot process and its performance characteristics. > >> > >> Please let me know if you have any suggestions or areas where I could focus my efforts. > > Hi Shankari, > > > > It sounds like you are off to a good start. I have something that needs to be done, that I think > > you can help with, and that matches where I believe you are in your status with being able > > to evaluate the kernel. > > > > In general, there's a lot of information on the elinux wiki which is stale, which needs to be > > updated or archived, or maybe even just removed. > > > > This section of the Boot Time page has a lot of material in this category: > > https://elinux.org/Boot_Time#kernel_speedups > > > > Can you validate the information on these 2 pages: > > * https://elinux.org/Disable_Console > > * https://elinux.org/Preset_LPJ > > > > This would consist of reading through the material, and testing the > > described techniques on your machine. This will involve booting the > > machine 2 ways, with a particular kernel command line option and without > > it, and then reporting back the final boot time for both. You can use > > the timestamp for the "init process" string as your final boot time, for the > > purposes of this exercise. > > > > Helping me to update the elinux wiki material on boot time would be > > an immense help, and is one of my main goals for the boot time SIG in 2025. > > > > Don't hesitate to ask questions if you have any. > > > > BTW - you can just report your findings to me and linux-embedded list, but > > alternatively (and even better) would be if you could also update the wiki > > pages themselves with your information based on recent kernels and hardware. > > To do this, you will need an elinux wiki account, which you can make online on > > elinux wiki.org by going to this page: https://elinux.org/Special:CreateAccount > > > > Anyone else reading this who wants to also participate in this project to > > update the elinux wiki boot time information, please contact me. > > Thanks. > > -- Tim > > > Hi Tim, all, > > first time I'm posting here so hopefully everything is fine w/ my mail > format / attachment and so on ... If not, please give me some feedback > and guidance. Marko, Thanks for this great data! In general, I don't see a lot of attachments on kernel mailing lists. They don't bother me, and we aren't CC'ing LKML (that's a separate issue we should discuss - developers outside of embedded might want to see this data). I'll check later and see what lore does with this, but if no one complains, I don't see a problem with it. If someone does complain, I can provide file hosting either on the elinux wiki or the boot-time wiki, and we can link attachments like you've provided on this message from one of those places (to avoid putting attachments on kernel mailing lists). > > To the "disable console" topic: I have some numbers in place for an RPI > Zero W, find dmesg dumps and systemd-analyze plots attached. > > > Environment: > > - RPi Zero W, kernel 5.15.24, systemd 247.3, customized debian > > - onboard UART used > > > Cases: > > - #1 quiet: cmdline w/ quiet, no kernel or userspace output up to the > serial login console > > - #2 normal: cmdline w/o quiet, serial console @115200 baud > > - #3 normal_baud9600: cmdline w/o quiet, serial console @9600 baud > > > Main outcomes: > > - kernel timestamps "Run /sbin/init as init process" > > #1: "1.714458", #2: "3.011701", #3: "16.108101" Wow from 1.7 seconds to 16.1 seconds. That's a pretty huge difference. I guess this particular technique is still very relevant! > > Interpretation: > > * enabled serial console has significant impact in kernel boot time > > * reducing baud to 9600 induced some side effect, not sure what it is ... Did you see any other weird behavior besides the huge slowdown? I'll take a look at the amount of characters in your dmesg output and see if it can be linearly correlated to the baud rate, or if it seems something else is going on. > - systemd startup > > * systemd drops 2 log lines per started unit to the console > > * seems serial output is not implemented asynchronously (see steps of > units in sd plot, ~10ms per unit w/ baud 115200, ~80ms per unit w/ baud9600 I'm not sure what you're referring to here. Is the 'unit' you are talking about the graphing grid size, or are you referring to systemd units? The grid size seems to be 10ms per minor grid line in each plot. > > Side notes: > > * I remember similar behavior w/ imx.6 SoCs > > * Maybe this issues is not seen on other SoCs (maybe w/ another hw > implementation of the UART) > > * Maybe this issues is only seen in single core machines (I can double > check w/ a PI3 or orange pi zero once) > > Hope this helps. It helps a lot. Thanks for this data! I think the 'Disable Console' technique will continue to stay as one of the first things we recommend for developers working on their board's boot time. To others developers - I'd like to see more data like this on other systems as well. So please keep submitting your data to this list. Thanks, -- Tim ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [boot-time] 2025-01-08 21:19 ` [boot-time] Bird, Tim @ 2025-01-08 23:26 ` Rob Landley 2025-01-09 13:02 ` [boot-time] Marko Hoyer 2025-01-09 12:43 ` [boot-time] Marko Hoyer 1 sibling, 1 reply; 21+ messages in thread From: Rob Landley @ 2025-01-08 23:26 UTC (permalink / raw) To: Bird, Tim, Marko Hoyer, Shankari, linux-embedded@vger.kernel.org On 1/8/25 15:19, Bird, Tim wrote: >> Cases: >> >> - #1 quiet: cmdline w/ quiet, no kernel or userspace output up to the >> serial login console >> >> - #2 normal: cmdline w/o quiet, serial console @115200 baud >> >> - #3 normal_baud9600: cmdline w/o quiet, serial console @9600 baud >> >> >> Main outcomes: >> >> - kernel timestamps "Run /sbin/init as init process" >> >> #1: "1.714458", #2: "3.011701", #3: "16.108101" > > Wow from 1.7 seconds to 16.1 seconds. That's a pretty huge > difference. I guess this particular technique is still > very relevant! CONFIG_EARLY_PRINTK output is emitted before interrupts are enabled (last I checked they didn't kick in until RIGHT before PID 1 gets forked off), so the early output drivers spin waiting for the next character to go into the buffer (the memory mapped register ones look something like "while (MASK&*status); *output = *data++;" in a for loop) and the printk() call can't return until all the data has been queued to the serial hardware, so you spend a lot of time blocked in printk(). With 9600 baud 8n1 output, 9600/9 = 1066 characters per second, or approximately a 1ms wait between each character, blocking in printk when the hardware FIFO buffer fills up, so 16k of output data takes 16 seconds to write if the rest of the boot is doing NOTHING. Even a 1k hardware FIFO is only 1 second of output, and that's assuming all 1k is outgoing rather than split between in/out. Your options are: 1) disable early printk so it all goes into a malloced buffer until interrupts are enabled and it can be asynchronously flushed (meaning if something DOES go wrong in early boot you can't see it) 2) set your FIFO speed as fast as possible 3) have your default boot use the "quiet" option (similar to disabling EARLY_PRINTK but at least you have the option to yank quiet from your bootloader args without rebuilding the kernel.) Faster UART speeds mean shorter serial cables (although there's also 3 volt vs 5 volt, wire thickness/capacitance, and some other stuff, Jeff Dionne walked me through the math last year but I don't have my notes in front of me). Modern hardware can do up to 4 megabits/second but outside "this serial chip immediately talks to a USB chip and then it's transported as USB with the funky noise-cancelling signaling over VERY twisted pair to actually leave the board"), I wouldn't trust that over any real length of cable. Alas https://tldp.org/HOWTO/Remote-Serial-Console-HOWTO/serial-distance.html is from the dawn of time and only goes up to 56k over wires made from recycled drainpipes. https://novatel.com/support/known-solutions/maximum-cable-length-vs-data-rate says 115200 is 2.5 meters. It LOOKS like it scales linearly with twice the speed being half the cable, so a megabit would be about 1 foot of serial cable before the bits get all mushy. Rob ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [boot-time] 2025-01-08 23:26 ` [boot-time] Rob Landley @ 2025-01-09 13:02 ` Marko Hoyer 2025-01-09 21:10 ` [boot-time] Rob Landley 0 siblings, 1 reply; 21+ messages in thread From: Marko Hoyer @ 2025-01-09 13:02 UTC (permalink / raw) To: Rob Landley, Bird, Tim, Marko Hoyer, Shankari, linux-embedded@vger.kernel.org Am 09.01.25 um 00:26 schrieb Rob Landley: > On 1/8/25 15:19, Bird, Tim wrote: >>> Cases: >>> >>> - #1 quiet: cmdline w/ quiet, no kernel or userspace output up to the >>> serial login console >>> >>> - #2 normal: cmdline w/o quiet, serial console @115200 baud >>> >>> - #3 normal_baud9600: cmdline w/o quiet, serial console @9600 baud >>> >>> >>> Main outcomes: >>> >>> - kernel timestamps "Run /sbin/init as init process" >>> >>> #1: "1.714458", #2: "3.011701", #3: "16.108101" >> >> Wow from 1.7 seconds to 16.1 seconds. That's a pretty huge >> difference. I guess this particular technique is still >> very relevant! > > CONFIG_EARLY_PRINTK output is emitted before interrupts are enabled > (last I checked they didn't kick in until RIGHT before PID 1 gets > forked off), so the early output drivers spin waiting for the next > character to go into the buffer (the memory mapped register ones look > something like "while (MASK&*status); *output = *data++;" in a for > loop) and the printk() call can't return until all the data has been > queued to the serial hardware, so you spend a lot of time blocked in > printk(). Hi Rob, thx for the explanation, helps further! * This implementation would explain the observed behavior. * What I'm not understanding yet: logs from systemd delay systemd the same way as seen in the kernel. Looks like the issue is not solved even when PID 1 is started. As said, It can be something specific to single core SoCs or even just to RPI Zero W. I'll check ... > > With 9600 baud 8n1 output, 9600/9 = 1066 characters per second, or > approximately a 1ms wait between each character, blocking in printk > when the hardware FIFO buffer fills up, so 16k of output data takes 16 > seconds to write if the rest of the boot is doing NOTHING. Even a 1k > hardware FIFO is only 1 second of output, and that's assuming all 1k > is outgoing rather than split between in/out. > > Your options are: > > 1) disable early printk so it all goes into a malloced buffer until > interrupts are enabled and it can be asynchronously flushed (meaning > if something DOES go wrong in early boot you can't see it) > 2) set your FIFO speed as fast as possible > 3) have your default boot use the "quiet" option (similar to disabling > EARLY_PRINTK but at least you have the option to yank quiet from your > bootloader args without rebuilding the kernel.) > > Faster UART speeds mean shorter serial cables (although there's also 3 > volt vs 5 volt, wire thickness/capacitance, and some other stuff, Jeff > Dionne walked me through the math last year but I don't have my notes > in front of me). Modern hardware can do up to 4 megabits/second but > outside "this serial chip immediately talks to a USB chip and then > it's transported as USB with the funky noise-cancelling signaling over > VERY twisted pair to actually leave the board"), I wouldn't trust that > over any real length of cable. > > Alas > https://tldp.org/HOWTO/Remote-Serial-Console-HOWTO/serial-distance.html > is from the dawn of time and only goes up to 56k over wires made from > recycled drainpipes. > https://novatel.com/support/known-solutions/maximum-cable-length-vs-data-rate > says 115200 is 2.5 meters. It LOOKS like it scales linearly with twice > the speed being half the cable, so a megabit would be about 1 foot of > serial cable before the bits get all mushy. > As said in another mail: I do not know a valid (production) use case in which kernel logs need to be dumped to a serial console. I regard this mechanism only as useful for development purposes (in which fast boot is probably not so relevant). Please correct me if I'm wrong, would be happy to learn about such use cases. Based on that I think option 3) is the best options for most cases. > Rob Marko ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [boot-time] 2025-01-09 13:02 ` [boot-time] Marko Hoyer @ 2025-01-09 21:10 ` Rob Landley 2025-01-09 21:35 ` [boot-time] Marko Hoyer 0 siblings, 1 reply; 21+ messages in thread From: Rob Landley @ 2025-01-09 21:10 UTC (permalink / raw) To: Marko Hoyer, Bird, Tim, Marko Hoyer, Shankari, linux-embedded@vger.kernel.org On 1/9/25 07:02, Marko Hoyer wrote: > Am 09.01.25 um 00:26 schrieb Rob Landley: >> CONFIG_EARLY_PRINTK output is emitted before interrupts are enabled >> (last I checked they didn't kick in until RIGHT before PID 1 gets >> forked off), so the early output drivers spin waiting for the next >> character to go into the buffer (the memory mapped register ones look >> something like "while (MASK&*status); *output = *data++;" in a for >> loop) and the printk() call can't return until all the data has been >> queued to the serial hardware, so you spend a lot of time blocked in >> printk(). > > Hi Rob, > > thx for the explanation, helps further! > > * This implementation would explain the observed behavior. > > * What I'm not understanding yet: logs from systemd delay systemd the > same way as seen in the kernel. Looks like the issue is not solved even > when PID 1 is started. As said, It can be something specific to single > core SoCs or even just to RPI Zero W. I'll check ... Buffering or not in the char device is a driver choice. If your serial hardware has a small FIFO buffer and the driver doesn't do its own layer of output buffering (with a tasklet or something to copy the data to the hardware), then the write() syscall will block waiting for the data to go out. (Writes to filesystems stopped doing this back around 2.0 or something, when they rewrote the vfs to be based on the page cache and deentry cache, meaning ALL filesystem writes go through that now unless you say O_DIRECT to _ask_ for it to block, which isn't even always honored. But for some reason the TTY layer drives people insane, and char devices have been given a wide berth...) There's a similar issue with some xterms where "make -j16 build" spamming lots of output to a display terminal can run significantly slower than "make -j16 build | cat" because the linux pipe infrastructure inserts a pipe buffer (ulimit -p says 8 but I think that's _pages_ so 32k? Except in 2.6.11 it was 64k? Eh, not looking it up...) so the writes from each cc instance go into the pipe buffer and return immediately when it's not full, whereas writes to a terminal device block until the terminal has finished updating (which includes scrolling the screen). If I recall (many years ago), the kde terminal implementation included a buffer of its own (immediately returned before updating), and the gnome one didn't (blocked until x11 display update completed), so foreground builds were faster under kde. And the gnome guys' answer was to spray everything down with 3D acceleration so the GPU was scrolling the screen for you, because of course it was. Anyway, serialized latency has _always_ killed throughput, because it's a cost you can't get BACK. The kernel guys used to know this: https://yarchive.net/comp/linux/raid0.html Hence the old high school math problem: if you have 2 hours to go 100 miles and you travel the first 40 miles at 20 miles per hour, how fast do you have to go the rest to make it on time? Answer: you'd have to instantaneously teleport because you spent 2 hours going 40 miles and your time is up with 60 miles left to go. Optimizing the wrong part DOESN'T HELP. > As said in another mail: I do not know a valid (production) use case in > which kernel logs need to be dumped to a serial console. I regard this > mechanism only as useful for development purposes (in which fast boot is > probably not so relevant). Please correct me if I'm wrong, would be > happy to learn about such use cases. > > Based on that I think option 3) is the best options for most cases. You can adjust the loglevel so they still go into dmesg but don't go out to the console, which theoretically shouldn't be THAT slow? (At least cpu limited rather than wait-for-hardware.) But keep in mind that a lot of kernel devs seem actively trying to sabotage embedded development, and the dmesg metadata is now multiple times the size of the actual payload. https://lkml.iu.edu/hypermail/linux/kernel/2412.0/00045.html So trimming the dmesg buffer size is probably ALSO a good idea on modern embedded systems, because WHAT THE FSCK? Also, I dunno how much cpu time all that metadata fapping takes. (Before systemd, this was just text going into a ring buffer. But systemd couldn't cope with that because https://www.theregister.com/2014/04/05/torvalds_sievers_dust_up/ and https://en.wikipedia.org/wiki/Systemd#History and of course https://www.linux.com/news/fun-photo-greg-kroah-hartman-crowned-systemd-hackfest/ thus the data structures in the kernel had to become far more complex after about 30 years of NOT being like that, because systemd.) Rob ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [boot-time] 2025-01-09 21:10 ` [boot-time] Rob Landley @ 2025-01-09 21:35 ` Marko Hoyer 2025-01-09 22:31 ` [boot-time] Rob Landley 0 siblings, 1 reply; 21+ messages in thread From: Marko Hoyer @ 2025-01-09 21:35 UTC (permalink / raw) To: Rob Landley, Bird, Tim, Shankari, linux-embedded@vger.kernel.org Am 09.01.25 um 22:10 schrieb Rob Landley: > On 1/9/25 07:02, Marko Hoyer wrote: >> Am 09.01.25 um 00:26 schrieb Rob Landley: >>> CONFIG_EARLY_PRINTK output is emitted before interrupts are enabled >>> (last I checked they didn't kick in until RIGHT before PID 1 gets >>> forked off), so the early output drivers spin waiting for the next >>> character to go into the buffer (the memory mapped register ones >>> look something like "while (MASK&*status); *output = *data++;" in a >>> for loop) and the printk() call can't return until all the data has >>> been queued to the serial hardware, so you spend a lot of time >>> blocked in printk(). >> >> Hi Rob, >> >> thx for the explanation, helps further! >> >> * This implementation would explain the observed behavior. >> >> * What I'm not understanding yet: logs from systemd delay systemd the >> same way as seen in the kernel. Looks like the issue is not solved >> even when PID 1 is started. As said, It can be something specific to >> single core SoCs or even just to RPI Zero W. I'll check ... > > Buffering or not in the char device is a driver choice. If your serial > hardware has a small FIFO buffer and the driver doesn't do its own > layer of output buffering (with a tasklet or something to copy the > data to the hardware), then the write() syscall will block waiting for > the data to go out. (Writes to filesystems stopped doing this back > around 2.0 or something, when they rewrote the vfs to be based on the > page cache and deentry cache, meaning ALL filesystem writes go through > that now unless you say O_DIRECT to _ask_ for it to block, which isn't > even always honored. But for some reason the TTY layer drives people > insane, and char devices have been given a wide berth...) Yeah looks like this is the case for RPi Zero W. I guess there is probably no buffer at all in the RPi serial driver / hw since every log line from systemd delays systemd for ~10ms (~80ms in baud9600 case). Btw: I can confirm the same for RPi3 w/ four cores. Difference is that something seems to go on in kernel in parallel to logs writing to serial but at a certain point the kernel is waiting again for lot of seconds probably for the serial device to finish transmission. Systemds delay is pretty much similar to the single core case. > > There's a similar issue with some xterms where "make -j16 build" > spamming lots of output to a display terminal can run significantly > slower than "make -j16 build | cat" because the linux pipe > infrastructure inserts a pipe buffer (ulimit -p says 8 but I think > that's _pages_ so 32k? Except in 2.6.11 it was 64k? Eh, not looking it > up...) so the writes from each cc instance go into the pipe buffer and > return immediately when it's not full, whereas writes to a terminal > device block until the terminal has finished updating (which includes > scrolling the screen). > > If I recall (many years ago), the kde terminal implementation included > a buffer of its own (immediately returned before updating), and the > gnome one didn't (blocked until x11 display update completed), so > foreground builds were faster under kde. > > And the gnome guys' answer was to spray everything down with 3D > acceleration so the GPU was scrolling the screen for you, because of > course it was. > > Anyway, serialized latency has _always_ killed throughput, because > it's a cost you can't get BACK. The kernel guys used to know this: > > https://yarchive.net/comp/linux/raid0.html > > Hence the old high school math problem: if you have 2 hours to go 100 > miles and you travel the first 40 miles at 20 miles per hour, how fast > do you have to go the rest to make it on time? Answer: you'd have to > instantaneously teleport because you spent 2 hours going 40 miles and > your time is up with 60 miles left to go. Optimizing the wrong part > DOESN'T HELP. > Absolutely correct. >> As said in another mail: I do not know a valid (production) use case >> in which kernel logs need to be dumped to a serial console. I regard >> this mechanism only as useful for development purposes (in which fast >> boot is probably not so relevant). Please correct me if I'm wrong, >> would be happy to learn about such use cases. >> >> Based on that I think option 3) is the best options for most cases. > > You can adjust the loglevel so they still go into dmesg but don't go > out to the console, which theoretically shouldn't be THAT slow? (At > least cpu limited rather than wait-for-hardware.) With quiet logs go into dmesg as well. But as said, i do not really see use cases to dump out these logs to a serial console in a boot time critical system on each production boot. Reading dmesg or systemd's journal after time critical things are done should be ok in most case. So my recommendation to people who seek for fastboot potential: * switch of kernel to console log during startup using quiet cmdline option if you don't need these logs over serial essentially * if you need them, your (Rob) options could be applied reducing the effect as best as possible > > > But keep in mind that a lot of kernel devs seem actively trying to > sabotage embedded development, and the dmesg metadata is now multiple > times the size of the actual payload. > > https://lkml.iu.edu/hypermail/linux/kernel/2412.0/00045.html > > So trimming the dmesg buffer size is probably ALSO a good idea on > modern embedded systems, because WHAT THE FSCK? Also, I dunno how much > cpu time all that metadata fapping takes. (Before systemd, this was > just text going into a ring buffer. But systemd couldn't cope with > that because > https://www.theregister.com/2014/04/05/torvalds_sievers_dust_up/ and > https://en.wikipedia.org/wiki/Systemd#History and of course > https://www.linux.com/news/fun-photo-greg-kroah-hartman-crowned-systemd-hackfest/ > thus the data structures in the kernel had to become far more complex > after about 30 years of NOT being like that, because systemd.) > > Rob Marko ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [boot-time] 2025-01-09 21:35 ` [boot-time] Marko Hoyer @ 2025-01-09 22:31 ` Rob Landley 0 siblings, 0 replies; 21+ messages in thread From: Rob Landley @ 2025-01-09 22:31 UTC (permalink / raw) To: Marko Hoyer, Bird, Tim, Shankari, linux-embedded@vger.kernel.org On 1/9/25 15:35, Marko Hoyer wrote: > Am 09.01.25 um 22:10 schrieb Rob Landley: >> Buffering or not in the char device is a driver choice. If your serial >> hardware has a small FIFO buffer and the driver doesn't do its own >> layer of output buffering (with a tasklet or something to copy the >> data to the hardware), then the write() syscall will block waiting for >> the data to go out. (Writes to filesystems stopped doing this back >> around 2.0 or something, when they rewrote the vfs to be based on the >> page cache and deentry cache, meaning ALL filesystem writes go through >> that now unless you say O_DIRECT to _ask_ for it to block, which isn't >> even always honored. But for some reason the TTY layer drives people >> insane, and char devices have been given a wide berth...) > > Yeah looks like this is the case for RPi Zero W. I guess there is > probably no buffer at all in the RPi serial driver / hw since every log > line from systemd delays systemd for ~10ms (~80ms in baud9600 case). Well there's gotta be a LITTLE fifo for input or you drop characters all over the place. (That's the reason Linus started writing Linux in the first place, because minix's microkernel design couldn't keep up with serial input, the overhead of the task switch to the userspace serial receive driver process took too long and characters got dropped. So he wrote a terminal program that booted from a floppy, and then taught it to read from and write to the minix filesystem on his hard drive so he could download stuff from usenet, then taught it to run "bash" so he didn't have to reboot to mkdir/mv/rom, and that turned out to be 90% of the way to getting it to run gcc...) And serial hardware tends to be symmetrical about that: if it's got 16 chars of input buffer, it'll usually have 16 chars of output buffer. But that's less than 1/50th of a second at 9600 baud... (Fun detail: the input fifo often has a programmable watermark so you can say "fill up this much before generating an interrupt, or if X timer ticks pass by with no more input" so you don't get an interrupt every character (and spend all your time entering and exiting the interrupt code) BUT still have some leeway between the interrupt being generated and the buffer filling up until it drops characters. The OUTPUT fifo can do something similar, only from the other end (fill it all the way up, then generate an interrupt when it drains to the watermark so you can refill it before it empties and produces a gap in the output). Programming serial devices can get slightly complicated... > Btw: I can confirm the same for RPi3 w/ four cores. Difference is that > something seems to go on in kernel in parallel to logs writing to serial > but at a certain point the kernel is waiting again for lot of seconds > probably for the serial device to finish transmission. Systemds delay is > pretty much similar to the single core case. Yeah, the point of a bottleneck is that's the part you're waiting for, so speeding up the rest of it doesn't help so much. Optimization is a whole thing. Spinlocks vs semaphores infuriate some people (you're intentionally spinning wasting time?) so sometimes you need to explain with analogies to get them to stop "helping". You're standing at a train crossing, and a train is going past, it'll be through in 10 minutes. If you walk towards the end of the train you'll reach the end faster and can cross in only 7 minutes, but if you need to come BACK HERE to where your road is you'll wind up walking 7 minutes, crossing, walking 7 minutes back, and resume from here 14 minutes from now instead of only 10, so being busy doing the wrong thing and then just _undoing_ it again instead of waiting here ready to go is actually _slower_ than the waiting. >>> As said in another mail: I do not know a valid (production) use case >>> in which kernel logs need to be dumped to a serial console. I regard >>> this mechanism only as useful for development purposes (in which fast >>> boot is probably not so relevant). Please correct me if I'm wrong, >>> would be happy to learn about such use cases. >>> >>> Based on that I think option 3) is the best options for most cases. >> >> You can adjust the loglevel so they still go into dmesg but don't go >> out to the console, which theoretically shouldn't be THAT slow? (At >> least cpu limited rather than wait-for-hardware.) > > With quiet logs go into dmesg as well. Which _used_ to be almost free back when it was just a ring buffer doing a strlen() and two memcpy() at the wrap. But these days: dunno, haven't benched it. > But as said, i do not really see use cases to dump out these logs to a > serial console in a boot time critical system on each production boot. > Reading dmesg or systemd's journal after time critical things are done > should be ok in most case. The switch from printk(blah) to pr_loglevel(blah) was IN THEORY so you could kconfig a minimum loglevel to retain, and all the macros below that level would drop out of the kernel at compile time, reducing the kernel image size significantly AND doing nice things with cache locality and so on. (String processing is expensive, you traverse a lot of data that goes through the memory bus and evicts cache lines from L1 and L2.) Last I checked the kernel devs had broken it for some reason, but it might be working again? (Or was a patch still out of tree...?) Anyway, if you run out of ideas that's a thing to look for. Data going across the memory bus is another one of those bottleneck things, where it doesn't matter how fast your processor is clocked if you're waiting for memory. An order of magnitude down from where we're currently looking, but still a thing that comes up a lot once the real low hanging fruit is dealt with... Of course there's all sorts of "Loop unrolling! No, smaller L1 cache footprint! Prefetch! No, spectre/meltdown!" pendulum nonsense I usually treat roughly the same way as the man trying to cross the street in The Pink Panther: https://www.youtube.com/watch?v=nistdsACs3E I once watched using a lookup table instead of calculating the value be an optimization, then a pessimization, then an optimization, then a pessimization, without even recompiling the binary (just upgrading the hardware). Doing the simple thing is always at least excusable. (And less to reverse engineer to understand WHY, and a good general argument against the endless "this is not helping". Basically Chesterton's fence in software: understanding why it's there lets you throw it out.) Rob ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [boot-time] 2025-01-08 21:19 ` [boot-time] Bird, Tim 2025-01-08 23:26 ` [boot-time] Rob Landley @ 2025-01-09 12:43 ` Marko Hoyer 2025-01-09 13:27 ` [boot-time] Geert Uytterhoeven 1 sibling, 1 reply; 21+ messages in thread From: Marko Hoyer @ 2025-01-09 12:43 UTC (permalink / raw) To: Bird, Tim, Marko Hoyer, Shankari, linux-embedded@vger.kernel.org Am 08.01.25 um 22:19 schrieb Bird, Tim: > >> -----Original Message----- >> From: Marko Hoyer <mhoyer.oss-devel@freenet.de> >> Am 08.01.25 um 19:33 schrieb Bird, Tim: >>>> -----Original Message----- >>>> From: Shankari <beingcap11@gmail.com> >>>> Hi >>>> >>>> I wanted to provide an update on my recent contributions to the boot-time reduction project. I have recently started contributing >> and >>>> am working with the beagleplay. I have been analyzing the boot time of the init process. Below is the output from the system log: >>>> >>>> debian@BeaglePlay:~$ dmesg | grep "init process" >>>> [ 1.480490] Run /init as init process >>>> >>>> Moving forward, I plan to explore ways to modify the command line and further investigate the data used for SIG analysis. This will >>>> help me gain a deeper understanding of the boot process and its performance characteristics. >>>> >>>> Please let me know if you have any suggestions or areas where I could focus my efforts. >>> Hi Shankari, >>> >>> It sounds like you are off to a good start. I have something that needs to be done, that I think >>> you can help with, and that matches where I believe you are in your status with being able >>> to evaluate the kernel. >>> >>> In general, there's a lot of information on the elinux wiki which is stale, which needs to be >>> updated or archived, or maybe even just removed. >>> >>> This section of the Boot Time page has a lot of material in this category: >>> https://elinux.org/Boot_Time#kernel_speedups >>> >>> Can you validate the information on these 2 pages: >>> * https://elinux.org/Disable_Console >>> * https://elinux.org/Preset_LPJ >>> >>> This would consist of reading through the material, and testing the >>> described techniques on your machine. This will involve booting the >>> machine 2 ways, with a particular kernel command line option and without >>> it, and then reporting back the final boot time for both. You can use >>> the timestamp for the "init process" string as your final boot time, for the >>> purposes of this exercise. >>> >>> Helping me to update the elinux wiki material on boot time would be >>> an immense help, and is one of my main goals for the boot time SIG in 2025. >>> >>> Don't hesitate to ask questions if you have any. >>> >>> BTW - you can just report your findings to me and linux-embedded list, but >>> alternatively (and even better) would be if you could also update the wiki >>> pages themselves with your information based on recent kernels and hardware. >>> To do this, you will need an elinux wiki account, which you can make online on >>> elinux wiki.org by going to this page: https://elinux.org/Special:CreateAccount >>> >>> Anyone else reading this who wants to also participate in this project to >>> update the elinux wiki boot time information, please contact me. >>> Thanks. >>> -- Tim >>> >> Hi Tim, all, >> >> first time I'm posting here so hopefully everything is fine w/ my mail >> format / attachment and so on ... If not, please give me some feedback >> and guidance. > Marko, > > Thanks for this great data! > > In general, I don't see a lot of attachments on kernel mailing lists. > They don't bother me, and we aren't CC'ing LKML (that's a separate > issue we should discuss - developers outside of embedded might > want to see this data). I'll check later and see what lore does with this, > but if no one complains, I don't see a problem with it. If someone > does complain, I can provide file hosting either on the elinux wiki > or the boot-time wiki, and we can link attachments like you've > provided on this message from one of those places (to avoid > putting attachments on kernel mailing lists). Ok sounds good. I don't have really a place to publish data, so would be good if you can find a way ... > >> To the "disable console" topic: I have some numbers in place for an RPI >> Zero W, find dmesg dumps and systemd-analyze plots attached. >> >> >> Environment: >> >> - RPi Zero W, kernel 5.15.24, systemd 247.3, customized debian >> >> - onboard UART used >> >> >> Cases: >> >> - #1 quiet: cmdline w/ quiet, no kernel or userspace output up to the >> serial login console >> >> - #2 normal: cmdline w/o quiet, serial console @115200 baud >> >> - #3 normal_baud9600: cmdline w/o quiet, serial console @9600 baud >> >> >> Main outcomes: >> >> - kernel timestamps "Run /sbin/init as init process" >> >> #1: "1.714458", #2: "3.011701", #3: "16.108101" > Wow from 1.7 seconds to 16.1 seconds. That's a pretty huge > difference. I guess this particular technique is still > very relevant! > >> Interpretation: >> >> * enabled serial console has significant impact in kernel boot time >> >> * reducing baud to 9600 induced some side effect, not sure what it is ... > Did you see any other weird behavior besides the huge slowdown? > I'll take a look at the amount of characters in your dmesg output and > see if it can be linearly correlated to the baud rate, or if it seems something > else is going on. Take a look into the kmesg logs. Looks like there is a 8s delay at a certain point: [ 5.897018] input: C-Media Electronics Inc. USB Audio Device as /devices/platform/soc/20980000.usb/usb1/1-1/1-1:1.3/0003:0D8C:0014.0001/input/input1 [ 6.016086] hid-generic 0003:0D8C:0014.0001: input,hidraw0: USB HID v1.00 Device [C-Media Electronics Inc. USB Audio Device] on usb-20980000.usb-1/input3[ 14.012174] printk: console [ttyS0] enabled [ 14.064965] bcm2835-wdt bcm2835-wdt: Broadcom BCM2835 watchdog timer [ 14.142795] bcm2835-power bcm2835-power: Broadcom BCM2835 power domains driver [ 14.232013] mmc-bcm2835 20300000.mmcnr: mmc_debug:0 mmc_debug2:0 Not sure if it really makes sense to dig further into this issue. Might be something in the serial driver of the RPI. I don't see really a valid use case, 9600 baud is nothing you really need and in anyway pushing serial logs out via serial is nothing really needed in productive use cases, just for development (correct me if I'm wrong). > >> - systemd startup >> >> * systemd drops 2 log lines per started unit to the console >> >> * seems serial output is not implemented asynchronously (see steps of >> units in sd plot, ~10ms per unit w/ baud 115200, ~80ms per unit w/ baud9600 > I'm not sure what you're referring to here. Is the 'unit' you are talking about > the graphing grid size, or are you referring to systemd units? The grid size > seems to be 10ms per minor grid line in each plot. unit -> systemd unit * grid size per minor grid line is 100ms to me (1s per main grid line -> see top of plot) * take two units "user.slice" and "remote-fs.target" * they should have basically no delay between each other -> you can see it in quiet case * w/ serial on, you see delays (10ms, 80ms) which in fact are caused just by dropping the log lines to serial kernel log * overall delay in user space startup sums up significantly > >> Side notes: >> >> * I remember similar behavior w/ imx.6 SoCs >> >> * Maybe this issues is not seen on other SoCs (maybe w/ another hw >> implementation of the UART) >> >> * Maybe this issues is only seen in single core machines (I can double >> check w/ a PI3 or orange pi zero once) >> >> Hope this helps. > It helps a lot. Thanks for this data! > > I think the 'Disable Console' technique will continue to stay as one of the first > things we recommend for developers working on their board's boot time. Yes I agree. Let me check out other boards and especially multi-core systems if the issues is happening there as well ... > > To others developers - I'd like to see more data like this on other systems > as well. So please keep submitting your data to this list. > > Thanks, > -- Tim > Regards, Marko ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [boot-time] 2025-01-09 12:43 ` [boot-time] Marko Hoyer @ 2025-01-09 13:27 ` Geert Uytterhoeven 0 siblings, 0 replies; 21+ messages in thread From: Geert Uytterhoeven @ 2025-01-09 13:27 UTC (permalink / raw) To: Marko Hoyer Cc: Bird, Tim, Marko Hoyer, Shankari, linux-embedded@vger.kernel.org Hi Marko, On Thu, Jan 9, 2025 at 1:49 PM Marko Hoyer <Marko.Hoyer@freenet.de> wrote: > Take a look into the kmesg logs. Looks like there is a 8s delay at a > certain point: > > [ 5.897018] input: C-Media Electronics Inc. USB Audio Device as > /devices/platform/soc/20980000.usb/usb1/1-1/1-1:1.3/0003:0D8C:0014.0001/input/input1 > [ 6.016086] hid-generic 0003:0D8C:0014.0001: input,hidraw0: USB HID > v1.00 Device [C-Media Electronics Inc. USB Audio Device] on > usb-20980000.usb-1/input3[ 14.012174] printk: console [ttyS0] enabled > [ 14.064965] bcm2835-wdt bcm2835-wdt: Broadcom BCM2835 watchdog timer > [ 14.142795] bcm2835-power bcm2835-power: Broadcom BCM2835 power > domains driver > [ 14.232013] mmc-bcm2835 20300000.mmcnr: mmc_debug:0 mmc_debug2:0 Those eight seconds are the time needed for printing all previously-collected and time-stamped kernel log lines to the serial console. BTW, only slightly related, but I have no better place to vent ;-) On OrangeCrab running a 64 MHz VexRiscv softcore, I noticed another big delay. With initcall_debug: initcall pty_init+0x0/0x3c8 returned 0 after 185427581 usecs Apparently this is due to: CONFIG_LEGACY_PTYS=y CONFIG_LEGACY_PTY_COUNT=256 So yes, almost one one second to set up one legacy pty, ugh... Disabling CONFIG_LEGACY_PTYS fixed the issue. Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [boot-time] 2025-01-08 18:33 ` [boot-time] Bird, Tim 2025-01-08 20:39 ` [boot-time] Marko Hoyer @ 2025-01-08 23:00 ` Rob Landley 2025-01-09 2:23 ` [boot-time] Bird, Tim 2025-01-10 22:46 ` [boot-time] Marko Hoyer 2 siblings, 1 reply; 21+ messages in thread From: Rob Landley @ 2025-01-08 23:00 UTC (permalink / raw) To: Bird, Tim, Shankari, linux-embedded@vger.kernel.org On 1/8/25 12:33, Bird, Tim wrote: > In general, there's a lot of information on the elinux wiki which is stale, which needs to be > updated or archived, or maybe even just removed. > > This section of the Boot Time page has a lot of material in this category: > https://elinux.org/Boot_Time#kernel_speedups That page says "grab-boot-data.sh - see http://birdcloud.org/boot-time/Boot-time_Tools" but the link is 404. > Anyone else reading this who wants to also participate in this project to > update the elinux wiki boot time information, please contact me. Maybe? What do you need? Rob ^ permalink raw reply [flat|nested] 21+ messages in thread
* RE: [boot-time] 2025-01-08 23:00 ` [boot-time] Rob Landley @ 2025-01-09 2:23 ` Bird, Tim 0 siblings, 0 replies; 21+ messages in thread From: Bird, Tim @ 2025-01-09 2:23 UTC (permalink / raw) To: Rob Landley, Shankari, linux-embedded@vger.kernel.org > -----Original Message----- > From: Rob Landley <rob@landley.net> > On 1/8/25 12:33, Bird, Tim wrote: > > In general, there's a lot of information on the elinux wiki which is stale, which needs to be > > updated or archived, or maybe even just removed. > > > > This section of the Boot Time page has a lot of material in this category: > > https://elinux.org/Boot_Time#kernel_speedups > > That page says "grab-boot-data.sh - see > http://birdcloud.org/boot-time/Boot-time_Tools" but the link is 404. Hmmm. That should be 'https', not 'http'. When I click on it my Chrome browser takes me to the https site (silently autocorrecting it?) I've fixed the link on the elinux site. Sorry about that. Here is a direct link to the tool, but it's worth getting to the page to read what kernel command line options it needs. https://birdcloud.org/boot-time/Boot-time_Tools https://birdcloud.org/boot-time-files/grab-boot-data.sh > > > Anyone else reading this who wants to also participate in this project to > > update the elinux wiki boot time information, please contact me. > > Maybe? What do you need? I am always looking for more data. If you have a system you are running Linux on, and can edit the kernel command, I'd be very happy to have you run grab-boot-data.sh. It will collect a bunch of boot data and system data, and send it to the birdcloud.org wiki, where I'm doing data analysis. You can run the tool in a "don't send" mode and look at what it's about to send, if you are concerned about privacy. Thanks! -- Tim ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [boot-time] 2025-01-08 18:33 ` [boot-time] Bird, Tim 2025-01-08 20:39 ` [boot-time] Marko Hoyer 2025-01-08 23:00 ` [boot-time] Rob Landley @ 2025-01-10 22:46 ` Marko Hoyer 2025-01-10 23:15 ` [boot-time] Rob Landley 2 siblings, 1 reply; 21+ messages in thread From: Marko Hoyer @ 2025-01-10 22:46 UTC (permalink / raw) To: Bird, Tim, Shankari, linux-embedded@vger.kernel.org Hello Tim, all, Am 08.01.25 um 19:33 schrieb Bird, Tim: >> -----Original Message----- >> From: Shankari <beingcap11@gmail.com> >> Hi >> >> I wanted to provide an update on my recent contributions to the boot-time reduction project. I have recently started contributing and >> am working with the beagleplay. I have been analyzing the boot time of the init process. Below is the output from the system log: >> >> debian@BeaglePlay:~$ dmesg | grep "init process" >> [ 1.480490] Run /init as init process >> >> Moving forward, I plan to explore ways to modify the command line and further investigate the data used for SIG analysis. This will >> help me gain a deeper understanding of the boot process and its performance characteristics. >> >> Please let me know if you have any suggestions or areas where I could focus my efforts. > Hi Shankari, > > It sounds like you are off to a good start. I have something that needs to be done, that I think > you can help with, and that matches where I believe you are in your status with being able > to evaluate the kernel. > > In general, there's a lot of information on the elinux wiki which is stale, which needs to be > updated or archived, or maybe even just removed. > > This section of the Boot Time page has a lot of material in this category: > https://elinux.org/Boot_Time#kernel_speedups > > Can you validate the information on these 2 pages: > * https://elinux.org/Disable_Console > * https://elinux.org/Preset_LPJ > > This would consist of reading through the material, and testing the > described techniques on your machine. This will involve booting the > machine 2 ways, with a particular kernel command line option and without > it, and then reporting back the final boot time for both. You can use > the timestamp for the "init process" string as your final boot time, for the > purposes of this exercise. > > Helping me to update the elinux wiki material on boot time would be > an immense help, and is one of my main goals for the boot time SIG in 2025. > > Don't hesitate to ask questions if you have any. > > BTW - you can just report your findings to me and linux-embedded list, but > alternatively (and even better) would be if you could also update the wiki > pages themselves with your information based on recent kernels and hardware. > To do this, you will need an elinux wiki account, which you can make online on > elinux wiki.org by going to this page: https://elinux.org/Special:CreateAccount > > Anyone else reading this who wants to also participate in this project to > update the elinux wiki boot time information, please contact me. I'm just going through the "userspace application speed-up" parts in the wiki. To the udev stuff ("Avoid udev, it takes ..." and "If you still like udev"): * both hints seem to be a bit old fashioned to me taking the kernels devtmpfs into account * In my experience, device nodes are created by the kernel today, I never needed to use any kind of mknod call anymore * udevs job today is mostly about - setup of access rights and user and group ids of device nodes - creating symlinks (e.g. for partitions named w/ their UUIDs, ...) - loading kernel modules - annotating metadata to uevents for interested userspace applications * In any way, udev is still very expensive when used the conventional way - Once udev is started, a trigger is send (by systemd-udev-trigger.service or udevadm trigger, for those who don't like systemd ;)) through the whole device tree to let all devices sent uevents again so that udev can work on each device. This causes massive CPU load for e few seconds you don't want to spend so early in the morning. * There are options like selective triggering or moving the trigger back in time and do the setup manuelly ... So I think it is worth talking a bit about udev and options to deal with it but adapting thinks a bit to todays world. I'm currently registering for the wiki, maybe I can setup an initial page at some time ... Btw: If I'm wrong with my experiene about devtmpf and mknode, please correct me. And feel free to add other ideas to cope w/ udev in context of startup time critical systems ... Regards, Marko ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [boot-time] 2025-01-10 22:46 ` [boot-time] Marko Hoyer @ 2025-01-10 23:15 ` Rob Landley 2025-01-11 8:40 ` [boot-time] Marko Hoyer 0 siblings, 1 reply; 21+ messages in thread From: Rob Landley @ 2025-01-10 23:15 UTC (permalink / raw) To: Marko Hoyer, Bird, Tim, Shankari, linux-embedded@vger.kernel.org On 1/10/25 16:46, Marko Hoyer wrote: > So I think it is worth talking a bit about udev and options to deal with > it but adapting thinks a bit to todays world. I'm currently registering > for the wiki, maybe I can setup an initial page at some time ... busybox mdev is a lot lighter weight, and can do pretty elaborate things. https://wiki.alpinelinux.org/wiki/Mdev https://github.com/fff7d1bc/mdev-like-a-boss The theory these days is you mount devtmpfs and then use mdev to add scriptable behavior to device insertion/removal events via netlink notifications. Rob ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [boot-time] 2025-01-10 23:15 ` [boot-time] Rob Landley @ 2025-01-11 8:40 ` Marko Hoyer 2025-01-11 17:56 ` [boot-time] Rob Landley 0 siblings, 1 reply; 21+ messages in thread From: Marko Hoyer @ 2025-01-11 8:40 UTC (permalink / raw) To: Rob Landley, Marko Hoyer, Bird, Tim, Shankari, linux-embedded@vger.kernel.org Am 11.01.25 um 00:15 schrieb Rob Landley: > On 1/10/25 16:46, Marko Hoyer wrote: >> So I think it is worth talking a bit about udev and options to deal >> with it but adapting thinks a bit to todays world. I'm currently >> registering for the wiki, maybe I can setup an initial page at some >> time ... > > busybox mdev is a lot lighter weight, and can do pretty elaborate things. > > https://wiki.alpinelinux.org/wiki/Mdev > > https://github.com/fff7d1bc/mdev-like-a-boss > > The theory these days is you mount devtmpfs and then use mdev to add > scriptable behavior to device insertion/removal events via netlink > notifications. > > Rob Hey Rob, thx for the hint. Sounds good! How is the enumeration of cold plugged devices realized in mdev? Is it similar to udev triggering all devices in the complete device tree? Marko ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [boot-time] 2025-01-11 8:40 ` [boot-time] Marko Hoyer @ 2025-01-11 17:56 ` Rob Landley 2025-01-11 18:57 ` [boot-time] Bird, Tim 0 siblings, 1 reply; 21+ messages in thread From: Rob Landley @ 2025-01-11 17:56 UTC (permalink / raw) To: Marko Hoyer, Bird, Tim, Shankari, linux-embedded@vger.kernel.org On 1/11/25 02:40, Marko Hoyer wrote: > > Am 11.01.25 um 00:15 schrieb Rob Landley: >> On 1/10/25 16:46, Marko Hoyer wrote: >>> So I think it is worth talking a bit about udev and options to deal >>> with it but adapting thinks a bit to todays world. I'm currently >>> registering for the wiki, maybe I can setup an initial page at some >>> time ... >> >> busybox mdev is a lot lighter weight, and can do pretty elaborate things. >> >> https://wiki.alpinelinux.org/wiki/Mdev >> >> https://github.com/fff7d1bc/mdev-like-a-boss >> >> The theory these days is you mount devtmpfs and then use mdev to add >> scriptable behavior to device insertion/removal events via netlink >> notifications. >> >> Rob > > Hey Rob, > > thx for the hint. Sounds good! > > How is the enumeration of cold plugged devices realized in mdev? Is it > similar to udev triggering all devices in the complete device tree? mdev -s will scan /sys for current devices, it can be run as a hotplug helper, and there's netlink support somewhere but I've never used it. https://busybox.net/downloads/BusyBox.html#mdev The hotplug helper doesn't require a persistent demon like netlink does (or udev), at boot time you "mdev -s" to scan, then when you register a hotplug helper the kernel spawns a new process with environment variables whenever there's something new to do. The downside is if a lot of events come in rapidly it can spawn a lot of processes in parallel which makes sequencing difficult, which is why the netlink API exists as an alternative, but that doesn't really happen in systems I've put together, so... I wrote some introductory documentation about this back in 2007. It's a bit stale, and never REALLY got finished, but... https://landley.net/kdocs/local/hotplug2.html That's the context within which sysfs happened. /dev and /sys serve different purposes: /dev shows the device drivers' view of the system, full of devices that don't actually exist like /dev/null, or five devices for one piece of hardware (partitions), meanwhile a device that shows up but doesn't have a driver bound to it yet won't be in /dev at all. This is half the reason the old demon-managed "devfs" failed, it was CONCEPTUALLY wrong. (The other half was it used crazy solaris names for everything so people looking for /dev/hda1 couldn't find it and had to deal with some 9 character long monstrosity instead. Plus Linux isn't a microkernel so expecting a userspace demon to be necessary for /dev to _exist_ was just silly, and also led to some problems booting the system because were does that demon get its information from, eh?) sysfs is a hardware view of the system, where /sys/devices/pci0000:00 is full of what bus probing found, and /sys/block/sda/dev contains "8:0" (major and minor number) when a driver binds to something and goes "mine", but something still had to mknod that. devtmpfs is a synthetic filesystem that just DOES that, when a new "dev" node shows up under /sys/class or /sys/block it creates the apropriate char or block device under dev with that major/minor and the same name the directory uses (which is provided by the driver). Oh, "synthetic" filesystem is one of the four times of filesystem: block backed, char/pipe backed, ram backed, and synthetic. I wrote documentation about that a very long time ago... https://landley.net/toybox/doc/mount.html Rob ^ permalink raw reply [flat|nested] 21+ messages in thread
* RE: [boot-time] 2025-01-11 17:56 ` [boot-time] Rob Landley @ 2025-01-11 18:57 ` Bird, Tim 2025-01-12 1:03 ` [boot-time] Rob Landley 0 siblings, 1 reply; 21+ messages in thread From: Bird, Tim @ 2025-01-11 18:57 UTC (permalink / raw) To: Rob Landley, Marko Hoyer, Shankari, linux-embedded@vger.kernel.org > -----Original Message----- > From: Rob Landley <rob@landley.net> > > On 1/11/25 02: 40, Marko Hoyer wrote: > > Am 11. 01. 25 um 00: 15 schrieb Rob Landley: >> On 1/10/25 16: 46, Marko Hoyer wrote: >>> > So I think it is worth talking a bit about udev and options to deal >>> with it but > On 1/11/25 02:40, Marko Hoyer wrote: > > > > Am 11.01.25 um 00:15 schrieb Rob Landley: > >> On 1/10/25 16:46, Marko Hoyer wrote: > >>> So I think it is worth talking a bit about udev and options to deal > >>> with it but adapting thinks a bit to todays world. I'm currently > >>> registering for the wiki, maybe I can setup an initial page at some > >>> time ... > >> > >> busybox mdev is a lot lighter weight, and can do pretty elaborate things. > >> > >> https://wiki.alpinelinux.org/wiki/Mdev > >> > >> https://github.com/fff7d1bc/mdev-like-a-boss > >> > >> The theory these days is you mount devtmpfs and then use mdev to add > >> scriptable behavior to device insertion/removal events via netlink > >> notifications. > >> > >> Rob > > > > Hey Rob, > > > > thx for the hint. Sounds good! > > > > How is the enumeration of cold plugged devices realized in mdev? Is it > > similar to udev triggering all devices in the complete device tree? > > mdev -s will scan /sys for current devices, it can be run as a hotplug > helper, and there's netlink support somewhere but I've never used it. > > https://busybox.net/downloads/BusyBox.html#mdev > > The hotplug helper doesn't require a persistent demon like netlink does > (or udev), at boot time you "mdev -s" to scan, then when you register a > hotplug helper the kernel spawns a new process with environment > variables whenever there's something new to do. The downside is if a lot > of events come in rapidly it can spawn a lot of processes in parallel > which makes sequencing difficult, which is why the netlink API exists as > an alternative, but that doesn't really happen in systems I've put > together, so... > > I wrote some introductory documentation about this back in 2007. It's a > bit stale, and never REALLY got finished, but... > > https://landley.net/kdocs/local/hotplug2.html > > That's the context within which sysfs happened. > > /dev and /sys serve different purposes: /dev shows the device drivers' > view of the system, full of devices that don't actually exist like > /dev/null, or five devices for one piece of hardware (partitions), > meanwhile a device that shows up but doesn't have a driver bound to it > yet won't be in /dev at all. This is half the reason the old > demon-managed "devfs" failed, it was CONCEPTUALLY wrong. (The other half > was it used crazy solaris names for everything so people looking for > /dev/hda1 couldn't find it and had to deal with some 9 character long > monstrosity instead. Plus Linux isn't a microkernel so expecting a > userspace demon to be necessary for /dev to _exist_ was just silly, and > also led to some problems booting the system because were does that > demon get its information from, eh?) > > sysfs is a hardware view of the system, where /sys/devices/pci0000:00 is > full of what bus probing found, and /sys/block/sda/dev contains "8:0" > (major and minor number) when a driver binds to something and goes > "mine", but something still had to mknod that. > > devtmpfs is a synthetic filesystem that just DOES that, when a new "dev" > node shows up under /sys/class or /sys/block it creates the apropriate > char or block device under dev with that major/minor and the same name > the directory uses (which is provided by the driver). > > Oh, "synthetic" filesystem is one of the four times of filesystem: block > backed, char/pipe backed, ram backed, and synthetic. I wrote > documentation about that a very long time ago... > > https://landley.net/toybox/doc/mount.html Hey Rob, This is a great review of /dev, /sys and the different ways that /dev gets populated. For a lot of embedded Linux devices, the only bus where new items can show up dynamically is USB. I've always thought the best solution (in terms of boot time) was to use static nodes in /dev during early boot (that is, just mknod the /dev nodes in the rootfs manually, and have them be present before the kernel even runs). No dynamic discovery or boot-time population of /dev needed. Then, sometime later, use either mdev or devtmpfs to accumulate (and remove?) other runtime-plugged devices. This can be done after the time-critical phase of booting. Does this overall approach work, or is there some in-kernel connections that may be missing if the dynamic tools are not used from startup? Thanks, -- Tim ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [boot-time] 2025-01-11 18:57 ` [boot-time] Bird, Tim @ 2025-01-12 1:03 ` Rob Landley 2025-01-12 10:11 ` [boot-time] Marko Hoyer 0 siblings, 1 reply; 21+ messages in thread From: Rob Landley @ 2025-01-12 1:03 UTC (permalink / raw) To: Bird, Tim, Marko Hoyer, Shankari, linux-embedded@vger.kernel.org On 1/11/25 12:57, Bird, Tim wrote: > Hey Rob, This is a great review of /dev, /sys and the different > ways that /dev gets populated. Feel free to link stuff from wikis or some such. The newest of those documents was written in 2007. > For a lot of embedded Linux devices, the only bus where > new items can show up dynamically is USB. Yup, /sys/bus/usb/devices is in there too and when a driver binds to them, they wind up in /sys/block and such as well. (you USED to have to seprately mount a usbfs under /sys but they finally acknowledged that was silly about 5 years ago, hence https://askubuntu.com/questions/1218321/if-usbfs-has-been-deprecated-then-why-is-sys-bus-usb-drivers-usbfs-directory-p) When a driver DOESN'T automatically bind to them it gets a bit complicated, and one of the things mdev can be configured to do is act as a firmware loader! Which is just... Ahem, there are YEARS of poor design decisions the kernel guys made, where they ignored a mechanism they already had an implemented something more complicated. The mechanism whereby the kernel opens a firmware file and read it directly out of the filesystem instead of calling a hotplug helper was... I'm just going to gloss over that. Anyway, I saw patches go by ala https://patchwork.ozlabs.org/project/buildroot/patch/1436188175-7912-1-git-send-email-luca@lucaceresoli.net/ which says it's from 2015. I haven't really tried to do it because I often know how the plumbing works and usually just implement a ten line hack rather than looking up how to configure the more generic tool. (Even the generic tool grew out of something I wrote 10 years earlier...) Anyway, the kernel's "request module" plumbing tends to do a "give me usb-vendorID-deviceID thing, and then there's alias plumbing lookup that figures out what module name to insmod for that, and at various points I've seen said alias lookup plumbing A) in the kernel, B) in module headers, C) in modprobe config files under /etc or /lib or something. *shrug* I build static kernels when given a choice, and don't source hardware that needs drivers WITHOUT built-in firmware, so I am WAY out of date on that stuff. I remember enough to look it up but not the details off the top of my head. And I say that as someone who really SHOULD care more: http://lists.landley.net/pipermail/toybox-landley.net/2024-October/030549.html It's on the todo list... > I've always thought the best solution (in terms of boot time) > was to use static nodes in /dev during early boot (that is, just > mknod the /dev nodes in the rootfs manually, and have them be present > before the kernel even runs). No dynamic discovery or boot-time > population of /dev needed. The system knows what devices are available. While you can mknod a major:minor node the kernel doesn't have a driver for, if you open it you get some sort of -EWTF where the kernel goes "nope". The kernel has a CONFIG_DEVTMPFS_MOUNT that automatically mounts devtmpfs on /dev, but for SOME reason it doesn't apply to initramfs. I've been irregularly posting a patch to MAKE it apply on and off for most of a decade now: https://lkml.iu.edu/hypermail/linux/kernel/2005.1/09399.html And it's part of my mkroot kernel patches: https://landley.net/bin/mkroot/latest/linux-patches/0003-Wire-up-CONFIG_DEVTMPFS_MOUNT-to-initramfs.patch But Greg KH. Oh well... > Then, sometime later, use either > mdev or devtmpfs to accumulate (and remove?) other > runtime-plugged devices. This can be done after the time-critical > phase of booting. > > Does this overall approach work, or is there some in-kernel > connections that may be missing if the dynamic tools are > not used from startup? I mean it more or less works, it's just... pointless manual maintenance of something the kernel does for you in a very small amount of code? (In devtmpfs, the /dev node being there means something. In a static /dev, it doesn't.) So: I blather on a lot about my mkroot project, which is a 400 line bash script that builds tiny linux system that boots to shell prompt (mostly under qemu) on a dozen different architectures. And the init script in that starts here: https://github.com/landley/toybox/blob/master/mkroot/mkroot.sh#L102 And you see how the script does this setup (which is only needed when you don't apply my patch to the kernel): if ! mountpoint -q dev; then mount -t devtmpfs dev dev [ $$ -eq 1 ] && ! 2>/dev/null <0 && exec 0<>/dev/console 1>&0 2>&1 for i in ,fd /0,stdin /1,stdout /2,stderr do ln -sf /proc/self/fd${i/,*/} dev/${i/*,/}; done mkdir -p dev/shm chmod +t /dev/shm fi (Don't ask me why devtmfs doesn't automatically have a "shm" directory with the sticky bit, it's a subclass of tmpfs so it does the right thing when it's there. And the middle two lines are just making /dev/{stdin,stdout,stderr} because toysh doesn't special case /dev/stdin like bash does so needs it in the filesystem if you're gonna use that.) But that exec redirect line hit a bug, because when you don't have devtmpfs automounting but ALSO don't have /dev/console in your statically linked initramfs image... oh here, I explained it when I fixed it: https://github.com/landley/toybox/commit/0b2d5c2bb3f1 https://landley.net/notes-2024.html#06-08-2024 tl;dr in usr/main.c the kernel tries to open /dev/console and fails if it's not already there, so PID 1 starts with stdin, stdout, and stderr closed, so if something goes wrong in early boot init can't tell you why it failed. This ONLY happens with static initramfs (created with cpio as a normal user and you can't mknod without root access), because of COURSE the kernel has two different codepaths for static vs dynamic, and the dynamic one does a manual fixup for this issue. No really! https://github.com/torvalds/linux/blob/master/init/noinitramfs.c#L18 And of course the code doing the fixup runs for ANY system that doesn't have a static linked initramfs, including one where the bootloader points it at an EXTERNAL cpio.gz image like qemu -initrd blah.cpio.gz, because it checks for the external one AFTER doing the fixup. (Pop quiz: if you have static _and_ external initrd, and they include the same file, does the old one prevent the new one from extracting, does the new one replace the old one, or does the new one APPEND to the old one? At various points in history, it's done ALL THREE! I forget which is current, replace I think?) So external initramfs and built-in initramfs behave DIFFERENTLY in a subtle sharp edge ath I've complained at them about for YEARS... *shrug* I make this stuff work in as simple a way as I know how. Been doing it for an embarrassingly long time now. But you _reach_ simple by process of elimination, and proving a negative is a lot of work. > Thanks, > -- Tim > Rob ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [boot-time] 2025-01-12 1:03 ` [boot-time] Rob Landley @ 2025-01-12 10:11 ` Marko Hoyer 2025-01-12 13:39 ` [boot-time] Francesco Valla 2025-01-12 18:35 ` [boot-time] Rob Landley 0 siblings, 2 replies; 21+ messages in thread From: Marko Hoyer @ 2025-01-12 10:11 UTC (permalink / raw) To: Rob Landley, Bird, Tim, Marko Hoyer, Shankari, linux-embedded@vger.kernel.org Am 12.01.25 um 02:03 schrieb Rob Landley: > On 1/11/25 12:57, Bird, Tim wrote: >> Hey Rob, This is a great review of /dev, /sys and the different >> ways that /dev gets populated. > > Feel free to link stuff from wikis or some such. The newest of those > documents was written in 2007. > >> For a lot of embedded Linux devices, the only bus where >> new items can show up dynamically is USB. SDCARD readers connected via MMC are common in automtove head units as well ... > > Yup, /sys/bus/usb/devices is in there too and when a driver binds to > them, they wind up in /sys/block and such as well. (you USED to have > to seprately mount a usbfs under /sys but they finally acknowledged > that was silly about 5 years ago, hence > https://askubuntu.com/questions/1218321/if-usbfs-has-been-deprecated-then-why-is-sys-bus-usb-drivers-usbfs-directory-p) > > When a driver DOESN'T automatically bind to them it gets a bit > complicated, and one of the things mdev can be configured to do is act > as a firmware loader! Which is just... Ahem, there are YEARS of poor > design decisions the kernel guys made, where they ignored a mechanism > they already had an implemented something more complicated. The > mechanism whereby the kernel opens a firmware file and read it > directly out of the filesystem instead of calling a hotplug helper > was... I'm just going to gloss over that. WIFI & Bluetooth devices often use this firmware mechanism. And yes I agree, it looks a bit ** ugly** seeing the kernel loading a firmware file from /lib/firmware searching it in the root file system w/o knowing the state of it during boot ... For WIFI and bluetooth I do not see a big issue here since I'd prevent putting such features on a critical chain by system design in any way since bringing them up and (re)connecting external devices is time consuming by nature. Nothing you shall need to wait for ... > Anyway, I saw patches go by ala > https://patchwork.ozlabs.org/project/buildroot/patch/1436188175-7912-1-git-send-email-luca@lucaceresoli.net/ > which says it's from 2015. I haven't really tried to do it because I > often know how the plumbing works and usually just implement a ten > line hack rather than looking up how to configure the more generic > tool. (Even the generic tool grew out of something I wrote 10 years > earlier...) > > Anyway, the kernel's "request module" plumbing tends to do a "give me > usb-vendorID-deviceID thing, and then there's alias plumbing lookup > that figures out what module name to insmod for that, and at various > points I've seen said alias lookup plumbing A) in the kernel, B) in > module headers, C) in modprobe config files under /etc or /lib or > something. > > *shrug* I build static kernels when given a choice, and don't source > hardware that needs drivers WITHOUT built-in firmware, so I am WAY out > of date on that stuff. Compiling in modules vs. loading them later from user space is a trade-off. The effect of putting stuff into modules is to keep the kernel small which helps you in the "unpacking & loading kernel" phase before the kernel is actually started. Having an 1MB unpacked kernel is significantly a difference to a 5MB one. On the other hand, my experience is that there is lot of overhead (CPU time and IO) loading modules from user space. So it really only makes sense, if you have drivers to load at a point in time during startup where you have enough time and resources left. > I remember enough to look it up but not the details off the top of my > head. And I say that as someone who really SHOULD care more: > > http://lists.landley.net/pipermail/toybox-landley.net/2024-October/030549.html > > > It's on the todo list... > >> I've always thought the best solution (in terms of boot time) >> was to use static nodes in /dev during early boot (that is, just >> mknod the /dev nodes in the rootfs manually, and have them be present >> before the kernel even runs). No dynamic discovery or boot-time >> population of /dev needed. > > The system knows what devices are available. While you can mknod a > major:minor node the kernel doesn't have a driver for, if you open it > you get some sort of -EWTF where the kernel goes "nope". > > The kernel has a CONFIG_DEVTMPFS_MOUNT that automatically mounts > devtmpfs on /dev, but for SOME reason it doesn't apply to initramfs. > I've been irregularly posting a patch to MAKE it apply on and off for > most of a decade now: > > https://lkml.iu.edu/hypermail/linux/kernel/2005.1/09399.html > > And it's part of my mkroot kernel patches: > > https://landley.net/bin/mkroot/latest/linux-patches/0003-Wire-up-CONFIG_DEVTMPFS_MOUNT-to-initramfs.patch > > > But Greg KH. Oh well... > >> Then, sometime later, use either >> mdev or devtmpfs to accumulate (and remove?) other >> runtime-plugged devices. This can be done after the time-critical >> phase of booting. >> >> Does this overall approach work, or is there some in-kernel >> connections that may be missing if the dynamic tools are >> not used from startup? > > I mean it more or less works, it's just... pointless manual > maintenance of something the kernel does for you in a very small > amount of code? (In devtmpfs, the /dev node being there means > something. In a static /dev, it doesn't.) I agree. There is kind of dynamic device enumeration done by the kernel drivers anyway once loaded. Any data structures to devices are build up internally. Nothing you can save ... I'm even not sure how devtmpfs can be combined w/ your static devnodes you created in any kind of persistent partition. And if you even can get the kernel accepting your partition to use as /dev, you need to have it writeable for the case of dynamics you might need (usb for instance) which does not really go well with a read only RFS ... You could ... overlay fs ... well no, I think this goes into a wrong direction -> too complicated ;) > > So: I blather on a lot about my mkroot project, which is a 400 line > bash script that builds tiny linux system that boots to shell prompt > (mostly under qemu) on a dozen different architectures. And the init > script in that starts here: > > https://github.com/landley/toybox/blob/master/mkroot/mkroot.sh#L102 > > And you see how the script does this setup (which is only needed when > you don't apply my patch to the kernel): > > if ! mountpoint -q dev; then > mount -t devtmpfs dev dev > [ $$ -eq 1 ] && ! 2>/dev/null <0 && exec 0<>/dev/console 1>&0 2>&1 > for i in ,fd /0,stdin /1,stdout /2,stderr > do ln -sf /proc/self/fd${i/,*/} dev/${i/*,/}; done > mkdir -p dev/shm > chmod +t /dev/shm > fi > > (Don't ask me why devtmfs doesn't automatically have a "shm" directory > with the sticky bit, it's a subclass of tmpfs so it does the right > thing when it's there. And the middle two lines are just making > /dev/{stdin,stdout,stderr} because toysh doesn't special case > /dev/stdin like bash does so needs it in the filesystem if you're > gonna use that.) > > But that exec redirect line hit a bug, because when you don't have > devtmpfs automounting but ALSO don't have /dev/console in your > statically linked initramfs image... oh here, I explained it when I > fixed it: > > https://github.com/landley/toybox/commit/0b2d5c2bb3f1 > > https://landley.net/notes-2024.html#06-08-2024 > > tl;dr in usr/main.c the kernel tries to open /dev/console and fails if > it's not already there, so PID 1 starts with stdin, stdout, and stderr > closed, so if something goes wrong in early boot init can't tell you > why it failed. This ONLY happens with static initramfs (created with > cpio as a normal user and you can't mknod without root access), > because of COURSE the kernel has two different codepaths for static vs > dynamic, and the dynamic one does a manual fixup for this issue. No > really! > > https://github.com/torvalds/linux/blob/master/init/noinitramfs.c#L18 > > And of course the code doing the fixup runs for ANY system that > doesn't have a static linked initramfs, including one where the > bootloader points it at an EXTERNAL cpio.gz image like qemu -initrd > blah.cpio.gz, because it checks for the external one AFTER doing the > fixup. (Pop quiz: if you have static _and_ external initrd, and they > include the same file, does the old one prevent the new one from > extracting, does the new one replace the old one, or does the new one > APPEND to the old one? At various points in history, it's done ALL > THREE! I forget which is current, replace I think?) So external > initramfs and built-in initramfs behave DIFFERENTLY in a subtle sharp > edge ath I've complained at them about for YEARS... > > *shrug* I make this stuff work in as simple a way as I know how. Been > doing it for an embarrassingly long time now. But you _reach_ simple > by process of elimination, and proving a negative is a lot of work. > >> Thanks, >> -- Tim >> > > Rob > To summarize from my point of view: * It's worth talking a bit about the effect of udev and about alternatives * "mdev" is surely worth being named as an potential option besides "selective triggering" and "static setup and moving triggers back in time" * I wouldn't regard mknode as an real alternative in todays system * In addition I can imagine is "modules loading" vs. "compiling in drivers" something which is worth mentioning * Once I've access to the wiki, I can try to put these ideas into an initial structure filled up w/ info we discussed in this thread Marko ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [boot-time] 2025-01-12 10:11 ` [boot-time] Marko Hoyer @ 2025-01-12 13:39 ` Francesco Valla 2025-01-12 18:35 ` [boot-time] Rob Landley 1 sibling, 0 replies; 21+ messages in thread From: Francesco Valla @ 2025-01-12 13:39 UTC (permalink / raw) To: Rob Landley, Bird, Tim, Marko Hoyer, Shankari, linux-embedded@vger.kernel.org On Sunday, 12 January 2025 at 11:11:44 Marko Hoyer <mhoyer.oss-devel@freenet.de> wrote: > > Am 12.01.25 um 02:03 schrieb Rob Landley: > > On 1/11/25 12:57, Bird, Tim wrote: > >> Hey Rob, This is a great review of /dev, /sys and the different > >> ways that /dev gets populated. > > > > Feel free to link stuff from wikis or some such. The newest of those > > documents was written in 2007. > > > >> For a lot of embedded Linux devices, the only bus where > >> new items can show up dynamically is USB. > > SDCARD readers connected via MMC are common in automtove head units as > well ... > > > > > > Yup, /sys/bus/usb/devices is in there too and when a driver binds to > > them, they wind up in /sys/block and such as well. (you USED to have > > to seprately mount a usbfs under /sys but they finally acknowledged > > that was silly about 5 years ago, hence > > https://askubuntu.com/questions/1218321/if-usbfs-has-been-deprecated-then-why-is-sys-bus-usb-drivers-usbfs-directory-p) > > > > When a driver DOESN'T automatically bind to them it gets a bit > > complicated, and one of the things mdev can be configured to do is act > > as a firmware loader! Which is just... Ahem, there are YEARS of poor > > design decisions the kernel guys made, where they ignored a mechanism > > they already had an implemented something more complicated. The > > mechanism whereby the kernel opens a firmware file and read it > > directly out of the filesystem instead of calling a hotplug helper > > was... I'm just going to gloss over that. > > WIFI & Bluetooth devices often use this firmware mechanism. And yes I > agree, it looks a bit ** ugly** seeing the kernel loading a firmware > file from /lib/firmware searching it in the root file system w/o > knowing the state of it during boot ... For WIFI and bluetooth I do not > see a big issue here since I'd prevent putting such features on a > critical chain by system design in any way since bringing them up and > (re)connecting external devices is time consuming by nature. Nothing you > shall need to wait for ... > The whole "try to access the rootfs during boot" domain is an area worth investigating, as it *should* be simple to track the actual init state and directly skip the accesses that aren't going to succeed. I recently stumbled for example on the Ethernet PHY core trying to load modules during init [1], but the firmware loading is another of such examples. > To summarize from my point of view: > > * It's worth talking a bit about the effect of udev and about alternatives > > * "mdev" is surely worth being named as an potential option besides > "selective triggering" and "static setup and moving triggers back in time" > > * I wouldn't regard mknode as an real alternative in todays system > Another approach that in my opinion is worth mentioning is: no udev/mdev at all. In a couple of embedded products with a very limited scope I simply decided to use devtmpfs + manual insmod + a simple bash script for USB automounting registered as hotplug handler. Very few dependencies, no boot time parsing of configuration files. It took a bit to configure the init sequence, but the result was/is very satisfying. > * In addition I can imagine is "modules loading" vs. "compiling in > drivers" something which is worth mentioning > > * Once I've access to the wiki, I can try to put these ideas into an > initial structure filled up w/ info we discussed in this thread > > Marko > > > [1] https://lore.kernel.org/netdev/SJ0PR18MB5216A8D227B2B3651DB9AC0DDB152@SJ0PR18MB5216.namprd18.prod.outlook.com/T/ --- Regards, Francesco ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [boot-time] 2025-01-12 10:11 ` [boot-time] Marko Hoyer 2025-01-12 13:39 ` [boot-time] Francesco Valla @ 2025-01-12 18:35 ` Rob Landley 1 sibling, 0 replies; 21+ messages in thread From: Rob Landley @ 2025-01-12 18:35 UTC (permalink / raw) To: Marko Hoyer, Bird, Tim, Shankari, linux-embedded@vger.kernel.org [-- Attachment #1: Type: text/plain, Size: 13643 bytes --] On 1/12/25 04:11, Marko Hoyer wrote: > Am 12.01.25 um 02:03 schrieb Rob Landley: >> On 1/11/25 12:57, Bird, Tim wrote: >>> Hey Rob, This is a great review of /dev, /sys and the different >>> ways that /dev gets populated. >> >> Feel free to link stuff from wikis or some such. The newest of those >> documents was written in 2007. >> >>> For a lot of embedded Linux devices, the only bus where >>> new items can show up dynamically is USB. > > SDCARD readers connected via MMC are common in automtove head units as > well ... But do they give an insertion/removal notification that can generate an interrupt rather than needing to be polled? (Last couple of boards I poked at didn't, but it was cheap hardware...) >> When a driver DOESN'T automatically bind to them it gets a bit >> complicated, and one of the things mdev can be configured to do is act >> as a firmware loader! Which is just... Ahem, there are YEARS of poor >> design decisions the kernel guys made, where they ignored a mechanism >> they already had an implemented something more complicated. The >> mechanism whereby the kernel opens a firmware file and read it >> directly out of the filesystem instead of calling a hotplug helper >> was... I'm just going to gloss over that. > > WIFI & Bluetooth devices often use this firmware mechanism. The wifi and bluetooth _hardware_ is always there though. Transciever link toggle is more or less a media insertion/removal event, which is a slightly different hotplug mechanism. Ogres, onions... layers. > And yes I > agree, it looks a bit ** ugly** seeing the kernel loading a firmware > file from /lib/firmware searching it in the root file system w/o > knowing the state of it during boot ... They already HAD the hotplug helper mechanism and initramfs! You could already CALL A LOADER and some of us had that working and DEPLOYED before they built a whole new mechanism for "the kernel reaches out and reads a file out of the userspace view of the filesystem from kernel space without a process context to do it in like the ELF loader has, don't ask me what this means for containers and namespaces..." (Ok, they wanted to load firmware before PID 1 launched, but they were already breaking the drivers into separate probe/init sections so you could probe before were started and init after interrupts were started and launching PID 1 is the first thing that happens after interrupts are enabled (we have a scheduler now, the idle task can fork off PID 1 and PID 0 can run pause() in a loop. Except between those two the kernel launches a zillion "kernel threads" including the tasklets and deferred device initialization and so on...) It wasn't just awkward, it was unnecessary. (And it DOES NOT SOLVE the underlying licensing issue of "this firmware is not gpl, I am bundling it into a statically linked initramfs, is this "mere aggregation", let's see what a judge has to say! Meanwhile Bradley is in court ACTIVELY ARGUING that there's no difference between GPLv2 and GPLv3 and that the complete lack of any copyright holders willing to sign on to his increasingly extreme enforcement views isn't a problem because GPLv2 is a contract despite the complete absence of things like "privity of contract"... No really: https://blog.tidelift.com/will-the-new-judicial-ruling-in-the-vizio-lawsuit-strengthen-the-gpl I got dragged into this recently to spend a day telling a camera "no, Bradley's full of it", and yes he flew in to sit at the other end of the table for some reason: https://landley.net/notes-2024.html#24-06-2024 Sigh. There's a reason I do 0BSD these days: https://landley.net/toybox/license.html > For WIFI and bluetooth I do not > see a big issue here since I'd prevent putting such features on a > critical chain by system design in any way since bringing them up and > (re)connecting external devices is time consuming by nature. Nothing you > shall need to wait for ... Except that reconnection mostly happens in software. The _hardware_ you're talking to stays connected. It's a resource acquisition/allocation problem sure, but closer to partition re-scanning. *shrug* The asynchronous notifications that something happened behind your back come in through similar mechanisms, but if that's ALL we were dealing with we wouldn't have needed most of this plumbing. (Although that was ANOTHER fun failure of the old devfs: /dev/eth0 isn't common, thanks to Bill Joy somehow not really understanding unix in 1979. And of course renaming /dev/hda to /dev/sda is a big deal from a compatibility perspective, but the <strike>devfsd2</strike> systemd guys deciding that eth0: is now potato03x1: or some such? That's just fine, who cares about compatibility with that...) > Compiling in modules vs. loading them later from user space is a trade- > off. The effect of putting stuff into modules is to keep the kernel > small which helps you in the "unpacking & loading kernel" phase before > the kernel is actually started. Having an 1MB unpacked kernel is > significantly a difference to a 5MB one. If you can avoid ever loading the module, you may come out ahead. (Modulo why are you shipping it then, still needs storage.) Last I checked the actual module unloading was still a NOP half the time (the memory stays pinned) and marks your kernel "tainted" if you ever actually do it, which is not a vote of confidence in the codepath if you ask me. But I had toybox insmod working years ago, the question is toybox _modprobe_ is still in pending because modprobe pulls fairly extensive shenanigans I am not personally familiar with and have to learn how to use before I can implement them, and they just seem like TERRIBLE IDEAS: https://github.com/landley/toybox/issues/522 > On the other hand, my > experience is that there is lot of overhead (CPU time and IO) loading > modules from user space. So it really only makes sense, if you have > drivers to load at a point in time during startup where you have enough > time and resources left. The kernel boot process is already fairly heavily asynchronous, which is why your shell prompt gets buried with "link up" notifications spamming the console after it prints the $ and so on. That's why mkroot's init script does echo 3 > /proc/sys/kernel/printk before the exec handoff to whatever inherits PID 1 from the setup script: https://github.com/landley/toybox/blob/0.8.11/mkroot/mkroot.sh#L133 Because if it's a shell, and we don't do that, you won't see the prompt under the noise. >> I mean it more or less works, it's just... pointless manual >> maintenance of something the kernel does for you in a very small >> amount of code? (In devtmpfs, the /dev node being there means >> something. In a static /dev, it doesn't.) > > I agree. There is kind of dynamic device enumeration done by the kernel > drivers anyway once loaded. Any data structures to devices are build up > internally. Nothing you can save ... I spent YEARS convincing the android guys to look at devtmpfs, initramfs, container plumbing... (Keep in mind Google bought Android Inc. in 2005 and shipped the first phone at the end of 2008, meaning their main development effort predated most of this plumbing and they had to retrofit it in much later.) No idea how much impact I had and how much they would have eventually done anyway, but the main guy I was having those conversations with WAS the android base OS maintainer, so... Most recent was probably: http://lists.landley.net/pipermail/toybox-landley.net/2022-August/029139.html You'd think the early boot stuff was fairly straightfoward, but I keep winding up being the one to manually fix crap like: https://lkml.iu.edu/hypermail/linux/kernel/1306.3/04204.html And then YEARS LATER, it's me who has to: https://lore.kernel.org/lkml/8244c75f-445e-b15b-9dbf-266e7ca666e2@landley.net/ And then it had to be rewritten to remove my taint: https://lkml.iu.edu/hypermail/linux/kernel/2311.1/01821.html https://lkml.iu.edu/hypermail/linux/kernel/2311.2/02938.html Let alone obvious polishing nonsense like: https://lkml.iu.edu/hypermail/linux/kernel/1705.0/02640.html (Which only went in because Andrew Morton picked it up despite Greg KH doing his usual stonewalling of literally anything from me. Oh well.) Anyway, there's a reason I'm not really a kernel developer. When I try to engage with them myself, "crickets chirp" is pretty much the GOOD outcome... https://lkml.iu.edu/hypermail/linux/kernel/1707.2/01797.html Ahem. I'll stop now. > I'm even not sure how devtmpfs can be combined w/ your static devnodes > you created in any kind of persistent partition. You could mount your own /tmp and do mdev -s into it. That's what we used to do back around 2005: https://lkml.iu.edu/hypermail/linux/kernel/0512.0/1326.html (Also, when devtmpfs first went in, if you modified a node (touch, chattr, etc) then it wouldn't delete it and your management tool would have to delete it via hotplug removal event handling. So you could PIN nodes, I was just never clear on why you'd want to. It probably still does that?) > And if you even can get > the kernel accepting your partition to use as /dev, Kernel doesn't care. > you need to have it > writeable for the case of dynamics you might need (usb for instance) > which does not really go well with a read only RFS ... You could ... > overlay fs ... well no, I think this goes into a wrong direction -> too > complicated ;) If you just have a /tmp dir in initramfs with some starting nodes initialized via the cpio extractor, and then have something like mdev add things on top of that as they're hotplugged, initramfs is inherently writeable thus the /tmp dir would be. There's a race condition where "I booted a device with USB already plugged into it before powerup, when is the hotplug event delivered and is it before the hotplug handler is registered", which I cared deeply about in 2005 and no longer remember the details of. I could try to dig them up out of my blog and the busybox/kernel mailing lists if you care? > To summarize from my point of view: > > * It's worth talking a bit about the effect of udev and about alternatives I am not a fan of udev, for reasons that are part technical and part "oh those assholes" rant path I'm trying to avoid going down. > * "mdev" is surely worth being named as an potential option besides > "selective triggering" and "static setup and moving triggers back in time" > > * I wouldn't regard mknode as an real alternative in todays system It still comes up from time to time, usually when initializing containers. (Because devtmpfs in containers does NOT give a proper container-local view of its namespace.) Once upon a time, you could use the linux kernel's built in initramfs generation plumbing to create a cpio with arbitrary contents by providing simple text snippets to supplement their scanner, including a /dev/console entry created as a normal user (without running as root!). But of COURSE the kernel developers removed the ability, and I patched it back in (attached), and then went "no, not fighting that fight"... > * In addition I can imagine is "modules loading" vs. "compiling in > drivers" something which is worth mentioning There's buckets of domain expertise there and I have like 1/3 of what I'd need to be confident there. (I know where to look it up, but have never considered it a good thing. Half the point of modules was to load/unload drivers for testing without reboots, and I just boot cycle a system under qemu or KVM when I can, and boot cycle a physical board when I can't because fiddling with modules really doesn't HELP my workflow. YMMV...) The main other reason modules persist is out-of-tree drivers, usually not under GPL, which have been under systematic attack for well over a decade and the people still doing it have large teams writing shim code. Most "let's use modules" decisions _since_ then boil down to either 1) "this is a generic PC hardware distro and I have no idea what hardware will be on there, and building every possible module into the kernel wastes a couple dozen megabytes of RAM on a system" 2) This mechanism exists, there must be a reason, therefore I should definitely use it because it's there. (They built _mechanisms_ to prevent you from upgrading modules without upgrading the kernel they plug into. Note that the description of CONFIG_MODVERSIONS says that WITHOUT it you can't have even slight version skew. That's without MODULE_SIG and MODULE_SRCVERSION_ALL and so on.) By the way, you can provide "module arguments" on the kernel command line, write to things like /sys/module/psmouse/parameters/rate after the driver's up... > * Once I've access to the wiki, I can try to put these ideas into an > initial structure filled up w/ info we discussed in this thread > > Marko Good luck. You know what we REALLY need a new version of? A rewrite of: https://landley.net/kdocs/mirror/lki-single.html With sections for each architecture. (And if you tried to write one, you'd hate Raspberry Pi as much as I do! Although https://forums.raspberrypi.com/viewtopic.php?t=357536 is extremely promising, and a far sight better than https://github.com/christinaa/rpi-open-firmware ever got to. Although I haven't really dug into the details of what's still proprietary black box spyware subtly bugging your board with "system management mode" hijacks, and what they actually managed to work around despite not having hardware documentation for broadcom chips...) Rob [-- Attachment #2: 0011-gen_init_cpio-regression.patch --] [-- Type: text/x-patch, Size: 1824 bytes --] From: Rob Landley <rob@landley.net> Date: Fri, 06 Oct 2023 02:56:19 -0500 Subject: [PATCH] Add gen_initramfs.sh -O Add a -O option to output the list instead of the archive. (You can specify -o after -O to produce both.) For 15 years gen_initramfs_list.sh produced a text output format that other things consumed and modified and fed back to the kernel, then the script changed to consume the list internally and produce the cpio archive directly. (Why they didn't just change gen_init_cpio.c to traverse directories itself if they were going to take away the ability to filter the list is an open question. Maybe it could handle filenames with spaces in them if they'd done that? And why "squash" in-band signalling instead of the -1 I submitted, which doesn't conflict with existing users because integers aren't valid usernames...) Signed-off-by: Rob Landley <rob@landley.net> --- usr/gen_initramfs.sh | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/usr/gen_initramfs.sh b/usr/gen_initramfs.sh index 14b5782f961a..8f75988a5799 100755 --- a/usr/gen_initramfs.sh +++ b/usr/gen_initramfs.sh @@ -15,6 +15,7 @@ usage() { cat << EOF Usage: $0 [-o <file>] [-l <dep_list>] [-u <uid>] [-g <gid>] {-d | <cpio_source>} ... + -O <file> Output annotated file list instead of archive -o <file> Create initramfs file named <file> by using gen_init_cpio -l <dep_list> Create dependency list named <dep_list> -u <uid> User ID to map to user ID 0 (root). @@ -206,6 +207,15 @@ while [ $# -gt 0 ]; do echo "deps_initramfs := \\" > $dep_list shift ;; + "-O") # Output annotated file list + unset output + trap - EXIT + [ "$1" = "-" ] && + cpio_list="/dev/stdout" || + cpio_list="$1" + shift + ;; + "-o") # generate cpio image named $1 output="$1" shift ^ permalink raw reply related [flat|nested] 21+ messages in thread
end of thread, other threads:[~2025-01-12 18:36 UTC | newest]
Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <CAORPcfVRobA+u5q7aPboC=3iY8dibDUB0920Z=Z0VgpQEupKJw@mail.gmail.com>
2025-01-08 18:33 ` [boot-time] Bird, Tim
2025-01-08 20:39 ` [boot-time] Marko Hoyer
2025-01-08 21:19 ` [boot-time] Bird, Tim
2025-01-08 23:26 ` [boot-time] Rob Landley
2025-01-09 13:02 ` [boot-time] Marko Hoyer
2025-01-09 21:10 ` [boot-time] Rob Landley
2025-01-09 21:35 ` [boot-time] Marko Hoyer
2025-01-09 22:31 ` [boot-time] Rob Landley
2025-01-09 12:43 ` [boot-time] Marko Hoyer
2025-01-09 13:27 ` [boot-time] Geert Uytterhoeven
2025-01-08 23:00 ` [boot-time] Rob Landley
2025-01-09 2:23 ` [boot-time] Bird, Tim
2025-01-10 22:46 ` [boot-time] Marko Hoyer
2025-01-10 23:15 ` [boot-time] Rob Landley
2025-01-11 8:40 ` [boot-time] Marko Hoyer
2025-01-11 17:56 ` [boot-time] Rob Landley
2025-01-11 18:57 ` [boot-time] Bird, Tim
2025-01-12 1:03 ` [boot-time] Rob Landley
2025-01-12 10:11 ` [boot-time] Marko Hoyer
2025-01-12 13:39 ` [boot-time] Francesco Valla
2025-01-12 18:35 ` [boot-time] Rob Landley
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).