Short story long: Hello world, C++

8 minute read

Published:

Every person who writes code some day wrote a hello world programm. Usually it is copying the text in any editor, save, compile and run. In this post we will have a deeper look in compilation stages and undertsnd what heppens, once you press enter button.

A simple hello world in c++ looks like this:

#include <iostream>;

int main()
{
    std::cout << "Hello world!\n";
    return 0;
}

And to compile it, using clang++, you need to run:

clang++ main.cpp -o main.out

But what exactly does compiler to generate the exacutable? We can see phases by running command

clang++ -ccc-print-phases main.cpp

And the result of the command is:

            +- 0: input, "main.cpp", c++
         +- 1: preprocessor, {0}, c++-cpp-output
      +- 2: compiler, {1}, ir
   +- 3: backend, {2}, assembler
+- 4: assembler, {3}, object
5: linker, {4}, image

So let’s dive into it und better unserstand each step. Let’s go!

Step 1: Preprocessor

Preprocessor takes the souce code, in our case it is a main.cpp and executes preprocessor directives.

In main.cpp we have 1 preprocessor directive include. This directive means taking the file iostream and just insert the content of the iostream file into the main.cpp in the line immidiately after the directive.

The function runs recursively until all includes are covered.

C++ support variaty of preprocessor directives.

The outcome of the preprocessing in our case is one big text file, that has all code we need to run the programm.

The cool thing is that you can actually get the file, just run:

clang++ -E main.cpp
gedit main.prep

In my case, I have 28424 lines of code in the file. So as you can see, even very small code of the hello world could create a giant file.

Step 2: Compiler

The compiler takes the source code file and generates the assembly code. It is splitted into 4 steps.

Step 2.1 Frontend

The responsibility of the frontned is to parse the souce code. Compilers usually support multiple languages, such as C, C++ and Java. So for each language it has own frontend.

Step 2.1.1 Lexer

As the first step, compiler generates the sequence of tockens from the source file. Tocken is a valid C++ symbol.

The lexer will generate in total 120852 lines of code.

You can generate it by running:

clang++ -fsyntax-only -Xclang -dump-tokens main.cpp &> main.lex
gedit main.lex

The main function will look like this:

int              'int'              [StartOfLine]                Loc=<main.cpp:3:1>
identifier       'main'             [LeadingSpace]               Loc=<main.cpp:3:5>
l_paren          '('                                             Loc=<main.cpp:3:9>
r_paren          ')'                                             Loc=<main.cpp:3:10>
l_brace          '{'                [StartOfLine]                Loc=<main.cpp:4:1>
identifier       'std'              [StartOfLine] [LeadingSpace] Loc=<main.cpp:5:2>
coloncolon       '::'                                            Loc=<main.cpp:5:5>
identifier       'cout'                                          Loc=<main.cpp:5:7>
lessless         '<<'               [LeadingSpace]               Loc=<main.cpp:5:12>
string_literal   '"Hello world!\n"' [LeadingSpace]               Loc=<main.cpp:5:15>
semi             ';'                                             Loc=<main.cpp:5:31>
return           'return'           [StartOfLine] [LeadingSpace] Loc=<main.cpp:6:2>
numeric_constant '0'                [LeadingSpace]               Loc=<main.cpp:6:9>
semi             ';'                                             Loc=<main.cpp:6:10>
r_brace          '}'                [StartOfLine]                Loc=<main.cpp:7:1>
Step 2.1.2 Parser

Compiler transforms the sequence of tokens into the abstract syntax tree, also known as AST.

Step 2.1.3 Semantic analisys

At this step, the program checkes and translates such symbols like auto and checks if the syntax is correct. It takes the AST and returns annoteted AST. By the concept it is the same AST but with all data types and where everything is defined.

AST is unique for each language. AST of the main function will look like this:

`-FunctionDecl 0x200a988 <main.cpp:3:1, line:7:1> line:3:5 main 'int ()'
  `-CompoundStmt 0x2017228 <line:4:1, line:7:1>
    |-CXXOperatorCallExpr 0x20171c0 <line:5:2, col:15> 'basic_ostream<char, std::char_traits<char> >':'std::basic_ostream<char>' lvalue adl
    | |-ImplicitCastExpr 0x20171a8 <col:12> 'basic_ostream<char, std::char_traits<char> > &(*)(basic_ostream<char, std::char_traits<char> > &, const char *)' <FunctionToPointerDecay>
    | | `-DeclRefExpr 0x2017128 <col:12> 'basic_ostream<char, std::char_traits<char> > &(basic_ostream<char, std::char_traits<char> > &, const char *)' lvalue Function 0x1f82d18 'operator<<' 'basic_ostream<char, std::char_traits<char> > &(basic_ostream<char, std::char_traits<char> > &, const char *)'
    | |-DeclRefExpr 0x200aa90 <col:2, col:7> 'std::ostream':'std::basic_ostream<char>' lvalue Var 0x200a508 'cout' 'std::ostream':'std::basic_ostream<char>'
    | `-ImplicitCastExpr 0x2017110 <col:15> 'const char *' <ArrayToPointerDecay>
    |   `-StringLiteral 0x200aac0 <col:15> 'const char [14]' lvalue "Hello world!\n"
    `-ReturnStmt 0x2017218 <line:6:2, col:9>
      `-IntegerLiteral 0x20171f8 <col:9> 'int' 0

You can generate the AST by running:

clang++ -Xclang -ast-dump -fsyntax-only main.cpp
Step 2.1.4 Code generation

At this step we take annoteted AST and translate it to the generic (language independent) representation of the code. This representation is also known as intermidiet representation (IR). The step is required to support multiple languahes and multiple backends. The approach allows easily extend language and architecture support.

The output is already a language independent representation of our code.

Example of the optimization can be removing unused parts of code.

As the result we get small output file. You can generate it by running:

clang++ -S -emit-llvm -o - main.cpp

Step 2.2 Optimizer

In the middle end of the compilier it makes the optimization of the intermidiet representation.

As we know, compilers are very smart and they are capable to optimize alot of code. Have you ever heard about RVO? It happens at this stage together with many other optimizations.

The optimization happens in 2 steps:

  1. Analysis
  2. Optimization

As the outcome of the optimizer step we have optimized IR.

Step 3 Backend

The goal of the backend is to take a text file (optimized IR) and create the assembly code.

Based on the optimized IR, backend selects the needed assembler instruction. The instruction should be scheduled and register allocated, so we can store data. Based on the target, backend can optimize instruction. The last step is output of the assembly code.

Interesting that scheduling is basically optimization of the order. Another interesting fact is that register allocation can be solver with a coloring graph problem. We want to minimiza the number of register to run the program.

To get the assembly code, you can run the following command:

clang++ main.cpp -S main.cpp

You can also investigate the assemblz on the Compiler Explorer page.

Step 4: Assembler

You already might think, assember? But we have already some assembler code! What should happen here?

On this step we transfowm assembly into the object file. So the outcome of the step is an object file.

Let’t check what is an object file. According to Wiki:

An object file is a computer file containing object code, that is, machine code output of an assembler or compiler. The object code is usually relocatable, and not usually directly executable.

The object file creates the symbol table. It will tell the executable what we need, for example that we need std::cout. Some symbols will be unresolved. What it means? At the compilation time we do not know where some library that we use will be loaded into the memory.

Undefined references usually some pointers to the memory. The reason they are undefined is that we do not know the memeroy address, where for example libc will be loaded.

To see symbols table, you can run the following command:

clang++ main.cpp -o main.o
nm main.o

The object file might depend on other object files. The dependency problem will be resolved in the next and last step.

Step 5: Linker

So now we reached the point where we build the executable!

Linker will find unresolved symbols and try to find them in memory. If something is missing you will get e very nasty error message.

But in our case we just depend on libstdc++, libm, libgcc and libc they all are installed and availabel, we get the executable.

To investigate the executable we can run:

clang++ main.cpp -o main.o
objdump -x main.o

That is it! we are done and can run the executable!

clang++ main.cpp -o main
./main

Hope you enjoyed with me the journey!

What to take from this article?

  • Compiler has 5 steps: preprocessor, compiler, backend, assembly and linking.
  • Compiler does a lot of optimizations for you!
  • All compilers follow the same path migh minor differences.

Used sources

Talks:

Articles: