C Program Compilation Process - Source to Binary
Advertisement
You might know how to write a C program, but do you know how the C program is compiled and converted to binary. In this blog, we’re going through the steps to get a compiled C output and learn about the C Program Compilation process to convert source code to binary.
What is a Compiler?
In computing, a compiler is a computer program that translates computer code written in one programming language (the source language) into another language (the target language). The name “compiler” is primarily used for programs that translate source code from a high-level programming language to a lower level language (e.g., assembly language, object code, or machine code) to create an executable program.
Types of C Compilers
Before jumping directly to the processes involved in the compiler. It’s worth knowing what types of C Compilers are present and which one is suitable for your development environment.
GCC
GCC stands for GNU Compiler Collection, is produced by GNU Projects. It’s a free compiler under the GPL(General Public License). In this blog, we are going to use this compiler to compile our c program.
Clang
This compiler uses the LLVM backend for compiling not only C but also C++, Objective-C, and Objective-C++.
Clang Compiler is mostly used by macOS users as:
GCC had caused some problems for developers at Apple, as the source code is large and difficult to use. So, they had come up with Clang.
Source – EDUCBA
To learn about all the compilers and the types of c compilers visit the list of compilers Wikipedia page.
Let’s now look at what the C compiler pipeline means.
C Program Compilation Pipeline
After we finish writing the code the first thing we do is compile it, it normally takes a few seconds for the compiler to compile the code and translate it to the machine language. But during this time, the Code goes through a series of steps to convert to an executable file. These compilation phases are in sequence, so they’re often called a pipeline.
Before we proceed further, there are two rules that we should know:
C Program Compilation Rule
- Only Source files are compiled
- Each Source File is Compiled Separately
The Components of the C Program Compilation Pipeline are:
- Preprocessor
- Compilation
- Assembly
- Linker
As a pipeline, the output of one component becomes the input of the next component, and this whole phase continues until the last product is collected from the pipeline. This will become clearer as you read the blog further.
The last by-product is also different for the system that you are compiling the program.
For Windows machine, an executable with .exe file is generated as a by-product, and for Linux, a .out file is generated.
One thing we need to note about the compilation pipeline is that it can only produce output if and only if the source file passes through all components of the compilation pipeline successfully.
Even a small failure in any of the components will lead to a compilation or linkage failure and give an error message.
Below is an image showing the steps for the compilation of the C program. The image describes the steps and the file created by the move. You may take a look at this picture for a quick reference to the entire compilation process.
You can use a single command to get all the intermediate files that are generated by the C program Compilation Pipeline.
$ gcc –Wall –save-temps cprogram.c –o cprogram
This will dump all the product file components in the same directory along with the executable file.
The -Wall option is used to display the error, if any, during the process. The -save-temps option in the GCC compiler driver is used to save all the intermediate files in the directory.
All intermediate files that will be saved are:
- cprogram.i – Product by Preprocessor step
- cprogram.s – Product by Compiler
- cprogram.o – Product by Assembler
- cprogram.out(Linux) /cprogram.exe(Windows) – Executable file (Last Product)
In each segment of the C program compilation process, we will learn more about these intermediate files.
The command uses the cprogram.c
file (Used later in the blog) as the c source file. You should assign your own name to the source file.
The -o option in the GCC compiler driver is used to assign the name to the output file.
In this example, each intermediate file is called a cprogram, but the extensions are different.
Step 1 – Preprocessing
Pre-processing is the first step in the C Program Compilation Pipeline.
What is Preprocessing?
When writing a C program, we include libraries, define some macros, and sometimes even make some conditional compilation. All of these are referred to as preprocessor directives.
During the preprocessing step in the C program compilation pipeline, the preprocess directives are substituted with their original values.
Let’s make that clear!
Header Files
The source file consists of a number of header files in the C language, and it is the preprocessor’s task to include the library.
Example:
If the program contains #include
, this line will be replaced by the original contents of the header file when the source file is pre-processed.
Macros
Macros are defined by #define
syntax in the C programming language. During the preprocessing stage, the macros are replaced by their values.
Conditional Compilation
Often we want our compiled code to be minimal, so we can use conditional compilation. The preprocessor often operates on a conditional compilation and reduces the code by adding only those lines that fulfill the condition.
Some Conditional Compilation preprocessor directives are:
- #undef
- #ifdef
- #ifndef
- #if
- #else
- #elif
- #endif
Comments
For improved readability and more reference to the code, we use the comments in our code. However, the comments do not provide the computer with the necessary information and therefore the comments are removed in the pre-processing stage.
The pre-processed code is often called the Translation Unit (Compilation Unit).
You can see the pre-processed code of the C program in the section below.
Extract preprocessed code from GCC
We can also take a look at the file from each part of the compilation pipeline. Let’s ask the C compiler driver to dump the translation unit without going any further.
To extract the translation unit from the source code, we can use the -E option in the GCC.
Example:
// Header File
#include
#define Max 10
int main()
{
printf("Hello World"); // Print Hello World
printf("%d", Max);
return 0;
}
$ gcc -E cprogram.c
# 1 "cprogram.c"
# 1 ""
# 1 ""
# 1 "cprogram.c"
.......
.......
.......
typedef __time64_t time_t;
# 435 "C:/msys64/mingw64/x86_64-w64-mingw32/include/corecrt.h" 3
typedef struct localeinfo_struct {
pthreadlocinfo locinfo;
pthreadmbcinfo mbcinfo;
} _locale_tstruct,*_locale_t;
typedef struct tagLC_ID {
unsigned short wLanguage;
unsigned short wCountry;
unsigned short wCodePage;
} LC_ID,*LPLC_ID;
.......
.......
.......
# 1582 "C:/msys64/mingw64/x86_64-w64-mingw32/include/stdio.h" 2 3
# 3 "cprogram.c" 2
# 5 "cprogram.c"
int main()
{
printf("Hello World");
printf("%d", 10);
return 0;
}
The above translation code is not full, as it is very large because it includes the stdio.h
header file (Total 1038 lines of translation unit). To see the whole output run this command on your development machine.
This example illustrates how the preprocessor functions. The Preprocessor only performs basic functions, such as inclusion, by copying contents from a file or macro expansion by text substitution.
There is a CPP tool, which stands for C Pre-Processor, which is used to pre-process a C file.
The C preprocessor or cpp is the macro preprocessor for the C, Objective-C, and C++ computer programming languages. The preprocessor provides the ability for the inclusion of header files, macro expansions, conditional compilation, and line control.
Source – C Preprocessor Wikipedia
This tool is part of the C development kit that is shipped with each UNIX flavor.
$ cpp cprogram.c
This command will provide you the preprocessed code for the c program.
The preprocessed file has an extension of .i, and if you pass this file to the C compiler driver, the preprocessor stage will be bypassed. This happens because of the file with the .i
extension is supposed to have already been preprocessed and is sent directly to the compilation stage.
Step 2 – Compilation
In the previous section, we had our Translation Unit, and now we can move on to our next step, i.e. the compilation of the Translation Unit Code.
The input for the compilation component is the translation unit, obtained from the previous component. The output from the compilation component is the assembly code (Still Human-readable code, but closer to the hardware).
Extract Assembly Code from GCC
As in the previous stage, we dumped the translation unit code using the -E option. Here, we can use the -S option for the GCC to obtain the assembly code. This will create a file with the .s extension, and we will see the contents of the file using the cat command.
$ gcc -S cprogram.c
$ cat cprogram.s
.file "cprogram.c"
.text
.def printf; .scl 3; .type 32; .endef
.seh_proc printf
printf:
pushq %rbp
.seh_pushreg %rbp
pushq %rbx
.seh_pushreg %rbx
subq $56, %rsp
.seh_stackalloc 56
leaq 128(%rsp), %rbp
.seh_setframe %rbp, 128
.seh_endprologue
movq %rcx, -48(%rbp)
movq %rdx, -40(%rbp)
movq %r8, -32(%rbp)
movq %r9, -24(%rbp)
leaq -40(%rbp), %rax
movq %rax, -96(%rbp)
movq -96(%rbp), %rbx
movl $1, %ecx
movq __imp___acrt_iob_func(%rip), %rax
call *%rax
movq %rbx, %r8
movq -48(%rbp), %rdx
movq %rax, %rcx
call __mingw_vfprintf
movl %eax, -84(%rbp)
movl -84(%rbp), %eax
addq $56, %rsp
popq %rbx
popq %rbp
ret
.seh_endproc
.def __main; .scl 2; .type 32; .endef
.section .rdata,"dr"
.LC0:
.ascii "Hello World\0"
.LC1:
.ascii "%d\0"
.text
.globl main
.def main; .scl 2; .type 32; .endef
.seh_proc main
main:
pushq %rbp
.seh_pushreg %rbp
movq %rsp, %rbp
.seh_setframe %rbp, 0
subq $32, %rsp
.seh_stackalloc 32
.seh_endprologue
call __main
leaq .LC0(%rip), %rcx
call printf
movl $10, %edx
leaq .LC1(%rip), %rcx
call printf
movl $0, %eax
addq $32, %rsp
popq %rbp
ret
.seh_endproc
.ident "GCC: (Rev3, Built by MSYS2 project) 10.2.0"
.def __mingw_vfprintf; .scl 2; .type 32; .endef
The Compilation phase gives us the Assembly code that is unique to the target architecture.
Even if the C language compiler is the same for two different machines with the same C program but different processors and hardware, different assembly codes would be produced.
Generating the assembly code from the C code is one of the most critical stages in the C Program Compilation Pipeline since the assembly code is a low-level language that can be translated to an object file using an assembler.
Step 3 – Assembly
The Compilation stage gives us the assembly code that is the input for the next pipeline component, i.e. assembly.
In this stage, the actual instructions on the machine level are generated from the assembly code. Each architecture has its own assembler, which converts its own assembly code into its own machine code.
The assembler generates a relocatable object file from the assembly code.
Create Object file from the Assembly file
We can use the built-in assembly tool called to translate the assembly file to an object file.
The assembler tool takes the assembly file and produces a relocatable object file.
$ as cprogram.s -o cprogram.o
This assembler tool(as) gives us a new file with a .o extension (.obj in Microsoft Windows), which is the relocatable object file.
If you want to translate a C program directly to an object file, you can use the -c option in the GCC compiler driver.
Using -c with the GCC compiler would merge the first three processes in the pipeline compilation, i.e. pre-processing, compilation and assembling.
$ gcc -c cprogram.c
This is really helpful when you want to work with object files and repeating all of the above steps can be quite hectic for a number of files.
The contents inside the object file contain low-level code and are thus not readable to humans. In the latter portion, we will also learn about a tool that will allow us to see the contents of an object file.
Now that we know how to build object files directly from both the assembly file and the C program. It’s time to learn about the Linking Stage in the C program compilation process.
Step 4 – Linking
This is one of the most critical steps in the C compilation pipeline where the generated relocatable object files are combined/linked to create another object file that is executable in nature.
Let’s take a look at the situation.
Suppose we have a custom header file htd.h, which contains a printHTD function prototype, and a source file htd.c, which contains the function definition of the header file.
#include
#include "htd.h"
void printHTD()
{
printf("Hack The Developer");
}
void printHTD();
The htd.h
header file is included in the cprogram.c
source file.
#include "htd.h"
int main()
{
printHTD();
return 0;
}
As we know there are two source files, and we’ll have to generate separate object files, which will be linked by the linker later to provide us an executable object file.
$ gcc -c cprogram.c
$ gcc -c htd.c
The above command creates two separate relocatable object files. Now let’s link the object files.
We can use the ld tool, which is the default linker in Unix-like systems, to link the relocatable object files.
But the ld tool gives us an undefined reference error.
$ ld htd.o cprogram.o -o cprogram.exe
C:\msys64\mingw64\bin\ld.exe: cprogram.o:cprogram.c:(.text+0x32): undefined reference to `__imp___acrt_iob_func'
C:\msys64\mingw64\bin\ld.exe: cprogram.o:cprogram.c:(.text+0x43): undefined reference to `__mingw_vfprintf'
C:\msys64\mingw64\bin\ld.exe: cprogram.o:cprogram.c:(.text+0x78): undefined reference to `__main'
So we are going to use the gcc for the linking process, which has an inbuilt linker that will link the relocatable object files.
$ gcc htd.o cprogram.o -o cprogram.exe
$ ./cprogram.exe
Hack The Developer
The linking was successful as we got the required output.
In each section, we examined an intermediate file, but not an object file. Let’s look at it in the next section.
Analysis of Object Files
It was said earlier in the Assembly section that we will take a look at the tool that lets us see the contents of the object file.
The nm tool is used to display symbols that can be found in an object file.
$ nm htd.o
0000000000000000 b .bss
0000000000000000 d .data
0000000000000000 r .rdata$zzz
0000000000000000 t .text
$ nm cprogram.o
0000000000000000 b .bss
0000000000000000 d .data
0000000000000000 p .pdata
0000000000000000 r .rdata
0000000000000000 r .rdata$zzz
0000000000000000 t .text
0000000000000000 r .xdata
U __imp___acrt_iob_func
U __main
U __mingw_vfprintf
000000000000006f T main
0000000000000000 t printf
0000000000000054 T printHTD
The object module provides all the details required to relocate and link the application to another program.
The below picture defines the Structure of the relocatable object file.
The ELF header at the top of this table defines the ELF file. The next section is a section header table containing details on all parts in the ELF file.
Let’s Understand the Four important Sections in the ELF File.
- .text – Our c program is stored in this section. We only have read and execute permission to this section and not write permission.
- .data – All initialized global and static variables are stored in the .data section. (Read and Write Permission)
- .rodata – Constants and literals are stored in the .rodata section. (Read Permission)
- .bss – This section contains uninitialized global and static variables. (Read and Write Permission)
To see the content of the generated executable use the readelf tool.
$ readelf -a cprogram.exe
ELF Header:
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
Class: ELF64
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: DYN (Shared object file)
Machine: Advanced Micro Devices X86-64
Version: 0x1
Entry point address: 0x1060
Start of program headers: 64 (bytes into file)
Start of section headers: 14776 (bytes into file)
Flags: 0x0
Size of this header: 64 (bytes)
Size of program headers: 56 (bytes)
Number of program headers: 13
Size of section headers: 64 (bytes)
Number of section headers: 31
Section header string table index: 30
Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
PHDR 0x0000000000000040 0x0000000000000040 0x0000000000000040
0x00000000000002d8 0x00000000000002d8 R 0x8
INTERP 0x0000000000000318 0x0000000000000318 0x0000000000000318
0x000000000000001c 0x000000000000001c R 0x1
[Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]
LOAD 0x0000000000000000 0x0000000000000000 0x0000000000000000
0x0000000000000600 0x0000000000000600 R 0x1000
LOAD 0x0000000000001000 0x0000000000001000 0x0000000000001000
0x0000000000000205 0x0000000000000205 R E 0x1000
LOAD 0x0000000000002000 0x0000000000002000 0x0000000000002000
0x0000000000000190 0x0000000000000190 R 0x1000
LOAD 0x0000000000002db8 0x0000000000003db8 0x0000000000003db8
0x0000000000000258 0x0000000000000260 RW 0x1000
DYNAMIC 0x0000000000002dc8 0x0000000000003dc8 0x0000000000003dc8
0x00000000000001f0 0x00000000000001f0 RW 0x8
NOTE 0x0000000000000338 0x0000000000000338 0x0000000000000338
0x0000000000000020 0x0000000000000020 R 0x8
NOTE 0x0000000000000358 0x0000000000000358 0x0000000000000358
0x0000000000000044 0x0000000000000044 R 0x4
GNU_PROPERTY 0x0000000000000338 0x0000000000000338 0x0000000000000338
0x0000000000000020 0x0000000000000020 R 0x8
GNU_EH_FRAME 0x0000000000002018 0x0000000000002018 0x0000000000002018
0x000000000000004c 0x000000000000004c R 0x4
GNU_STACK 0x0000000000000000 0x0000000000000000 0x0000000000000000
0x0000000000000000 0x0000000000000000 RW 0x10
GNU_RELRO 0x0000000000002db8 0x0000000000003db8 0x0000000000003db8
0x0000000000000248 0x0000000000000248 R 0x1
Symbol table '.symtab' contains 67 entries:
Num: Value Size Type Bind Vis Ndx Name
0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND
...........
36: 0000000000000000 0 FILE LOCAL DEFAULT ABS htd.c
37: 0000000000000000 0 FILE LOCAL DEFAULT ABS cprogram.c
.........
52: 0000000000001149 28 FUNC GLOBAL DEFAULT 16 printHTD
.........
63: 0000000000001165 25 FUNC GLOBAL DEFAULT 16 main
64: 0000000000004010 0 OBJECT GLOBAL HIDDEN 25 __TMC_END__
In the Executable file, we too have ELF Header at the top, next Section is the Program Headers.
Hope You Like It!
Learn GUI Programming In C using GTK.
Learn Intermediate Python Concepts.
Abstract Syntax Tree (AST) is a tree-like data structure that is not dependent on target architecture as it is generated by a program that is unaware of the programming language. ASTs can also be generated for any programming language.
In the case of GCC, the -E (case-sensitive) option is used to dump the translation unit of the source code.
$ gcc -E cprogram.c
Most Unix-like operating systems have an inbuilt assembler tool called, which is used to produce a relocatable object file from an assembly file.
$ as cprogram.s -o cprogram.o
In order to convert a c program directly to an object file using the -c option in the GCC.$ gcc -c cprogram.c
This will dump many files into the same directory. Use GraphViz to view a well-formatted graph of the AST.
$ gcc -fdump-tree-all-graph cprogram.c -o cprogram
The compiler of gcc/g++ is cc and cc1plus, respectively. gcc and g++ are the compiler driver. The diagram from https://nerdyelectronics.com/ has an error.
nice