Smallest Rust Hello World and other size coding shenanigans

September 16, 2023 · 1436 words · 7 minutes

Editor’s note

A quick note about this post. This is my first post ever written, and while I am not exactly ashamed for it, I am also not proud of it. A rewrite has been in limbo for a couple months now, however I am too lazy to delete the thing and start over. So I’ve only edited it a bit so it’s more readable.

I’ve been size coding some Rust, trying to make the smallest possible “Hello World!” program. After a few days, I’ve managed to squeeze it down and make a mere 149 Byte Hello World!, the smallest 64bit one out there in Rust¹. I got bored so I also made a 540 byte brainfuck interpreter.

There’s two main parts to this post, the one you can use to make your own projects output smaller binaries and a retelling of that I did to make the 149B Hello World!.

To get started, I will create a new empty project, and compare the binary size with each subsequent step.

cargo init --vcs=none shl

In the main.rs we are simply going to print out Hello World! and quit.

1
2
3


fn main() {
  println!("Hello, World!");
}

cargo b
cargo b --release
ls -lh target/*/shl

Building the project in debug mode yields 3.4MB binary, which is inexcusably big, thankfuly the release mode reduced the size to 390KB. But still, 390 000 bytes to print a 13 byte string?

The only other file automatically generated for us is the Cargo.toml manifest

For more information about these flags, check out the official Cargo Book.

13
14
15
16
17


[profile.release]
# Optimize for size
opt-level = "z"
# OR Optimize for performance
# opt-level = "3"

This setting sets the optimization level to optimize for size. Valid options range from 0 to 3, s and z. Level 3 is default in release mode, so there’s no point explicitly defining it.

18
19


lto = "fat"
# lto = "thin" # Or this, one may be faster/smaller

The before mentioned optimization level work on individual translation units (files) from the compiler, those are then linked (merged) into one monolith that is then turned into instructions through codegen. Link Time Optimization works on all of the code, at once, and optimizes stuff the compiler would not be able to find.

Image ilustrating LLVM codegen process, retrieved from https://github.com/python/cpython/issues/96761#issuecomment-1250345271

“Fat” LTO refers to LLVM’s old way of doing LTO, while “thin” LTO is newer, quicker to build, more parallel and sometimes even faster? I tried both but “Fat” LTO was smaller.

20

strip = true

Removes Debug symbols, including variable, function and class names, source code file names and line references. Useful to turn on, unless your code is crashing in production.

21

codegen-units = 1

Do not split compilation into multiple smaller parts, increases the effectivity of other options at the cost of compilation speed.

22

panic = "abort"

Instead of unwinding the stack and providing a stack trace on panic, just simply exit.

Thus, here is the entire Cargo.toml for your projects

13
14
15
16
17
18
19
20
21
22


[profile.release]
# Optimize for size
opt-level = "z"
# OR Optimize for performance
# opt-level = "3"
lto = "fat"
# lto = "thin" # Or this, one may be faster/smaller
strip = true
codegen-units = 1
panic = "abort"

With these, the size reduces furter to… 287 KB, down 3KB. You would see a bigger reduction if the application did something, and it would also be faster.

NOTE

If you’re trying to make something really high performance, check out Profile-guided Optimization.

NOTE

If you want LTO as an optional compilation feature (e.g. takes too long on CI/CD servers), you can make profiles that inherit other profiles. Composition over Inheritance anyone?

The STDs and STDon’ts #

The Rust Standard Library (or std for short) is very useful if you don’t want to reinvent the wheel, and is plenty fast, but at the cost of size. For those who don’t know, the entire standard library is statically linked when compiling, even if you only need a single println. When this becomes a problem, you have two options:

Build it #

The reasonable thing you might be want to try is to rebuild the standard library, although you will need the nightly toolchain (which might also make your files smaller and code faster) and lots of patience.

cargo +nightly build --release -Z build-std=panic_abort,std -Z build-std-features="optimize_for_size" --target x86_64-unknown-linux-gnu
ll target/x86_64-unknown-linux-gnu/release/shl

This works out to a nice 43KB. That is good enough you might be able to make a static web server for a RP2354.

Ditching it #

Simply adding

#![no_std]

to the top of your main.rs will completely remove the standard library. You still have access to functions like printf from libc, and there are many crates that will work without std. Without the standard library, you will also need to define your own panic handler.

As there is now no println macro to use, I will import libc and use printf.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21


#![no_main]
#![no_std]

#![feature(rustc_private)]
extern crate libc;

#[no_mangle]
pub extern "C" fn main(_argc: isize, _argv: *const *const u8) -> isize {
    const HELLO: &'static str = "Hello, World!\n\0";

    unsafe {
        libc::printf(HELLO.as_ptr() as *const _);
    }

    0
}

#[panic_handler]
fn my_panic(_info: &core::panic::PanicInfo) -> ! {
    loop {}
}

Building with the nightly toolchain yields a 14224 byte (or 14 KB) large file. Not bad, but not great either, as you rely on libc.

Cutting Down to the Bone (and beyond) #

We have reached the furthest point where every day projects can go, but there’s still plenty of fat we can shave off.

At this stage, the ELF file has 25 sections, even though you only need a few. The easiest way is to define some rust compiler flags.

export RUSTFLAGS="-Ctarget-cpu=native -Clink-args=-nostartfiles -Crelocation-model=static
-Clink-args=-Wl,-n,-N,--no-dynamic-linker,--no-pie,--build-id=none,--no-eh-frame-hdr"

We tell rustc to target the native CPU (aka enable stuff like SSE3), do not link libc, only statically link, and pass linker flags that tell ld to not page align sections, not to use position-independent code, and remove any build ID.

Since we have removed libc, we need to use inline assembly and Linux syscalls to write to the standard output.

#![no_std]
#![no_main]

const MSG: &'static str = "Hello, World!\n";

use core::arch::asm;

#[no_mangle]
pub extern "C" fn _start(_argc: isize, _argv: *const *const u8) {
    write_to_std_out(MSG.as_ptr(), MSG.len());

    exit(0);
}

fn write_to_std_out(string_pointer: *const u8, string_length: usize) {
    unsafe {
        asm!(
            "syscall",
            in("rax") 1, // write syscall number
            in("rdi") 1, // stdout file descriptor, 2 is stderr
            in("rsi") string_pointer,
            in("rdx") string_length,
            out("rcx") _, // clobbered by syscalls
            out("r11") _, // clobbered by syscalls
            lateout("rax") _, // clobbered by syscalls, 
            // if you can't print more than once, you are missing this
        );
    }
}

fn exit(code: i32) {
    unsafe {
        asm!(
            "syscall",
            in("rax") 60,
            in("rdi") code,
            options(noreturn)
        );
    }
}

#[panic_handler]
fn panic(_: &core::panic::PanicInfo) -> ! {
    loop {}
}

With this, we get an even smaller file with only 624 bytes and 7 sections, which is still 7 too much.

We can remove the comment section with a simple command.

objcopy -R .comment target/release/shl target/release/shl

Now we are even closer, just 496 bytes. Looking at the file in a hex editor such as ImHex, you may notice quite a lot of padding after the actual code.

We can remove this by using sstrip utility from the ELF kickers collection.

sstrip target/release/shl

That’s better, we are now at 215 bytes. The only thing left to do now is to transplant the ELF header. Not only can we remove one program header, but also store data in some unused fields. I wrote a simple Python script that does exactly that, and is available in the repo.

Looking at it now, we have made a 149 byte binary from scratch, and not only that, all of the jankier modifications are programmatic so they can be repeated at any time, without manual modifications.

If you want to play around with the binary, it’s on the GitHub under WTFPL.

Ideas #

Put the whole message into one buffer, one less syscall
Remove ud2 instructions
Listen to some good music
Make a Rust 4k chess engine?
???