C Program Compilation Process - Source to Binary

Advertisement

You might know how to write a C program, but do you know how the C program is compiled and converted to binary. In this blog, we’re going through the steps to get a  compiled C output and learn about the C Program Compilation process to convert source code to binary.

What is a Compiler?

In computing, a compiler is a computer program that translates computer code written in one programming language (the source language) into another language (the target language). The name “compiler” is primarily used for programs that translate source code from a high-level programming language to a lower level language (e.g., assembly language, object code, or machine code) to create an executable program.

Types of C Compilers

Before jumping directly to the processes involved in the compiler. It’s worth knowing what types of C Compilers are present and which one is suitable for your development environment.

GCC

GCC stands for GNU Compiler Collection, is produced by GNU Projects. It’s a free compiler under the GPL(General Public License). In this blog, we are going to use this compiler to compile our c program.

Clang

This compiler uses the LLVM backend for compiling not only C but also C++, Objective-C, and Objective-C++.

Clang Compiler is mostly used by macOS users as:

GCC had caused some problems for developers at Apple, as the source code is large and difficult to use. So, they had come up with Clang.

Source – EDUCBA

To learn about all the compilers and the types of c compilers visit the list of compilers Wikipedia page.

Let’s now look at what the C compiler pipeline means.

C Program Compilation Pipeline

After we finish writing the code the first thing we do is compile it, it normally takes a few seconds for the compiler to compile the code and translate it to the machine language. But during this time, the Code goes through a series of steps to convert to an executable file. These compilation phases are in sequence, so they’re often called a pipeline.

Before we proceed further, there are two rules that we should know:

C Program Compilation Rule

  • Only Source files are compiled
  • Each Source File is Compiled Separately

The Components of the C Program Compilation Pipeline are:

  1. Preprocessor
  2. Compilation
  3. Assembly
  4. Linker

As a pipeline, the output of one component becomes the input of the next component, and this whole phase continues until the last product is collected from the pipeline. This will become clearer as you read the blog further.

The last by-product is also different for the system that you are compiling the program.

For Windows machine, an executable with .exe file is generated as a by-product, and for Linux, a .out file is generated.

One thing we need to note about the compilation pipeline is that it can only produce output if and only if the source file passes through all components of the compilation pipeline successfully.

Even a small failure in any of the components will lead to a compilation or linkage failure and give an error message.

Below is an image showing the steps for the compilation of the C program. The image describes the steps and the file created by the move. You may take a look at this picture for a quick reference to the entire compilation process.

You can use a single command to get all the intermediate files that are generated by the C program Compilation Pipeline.

$ gcc –Wall –save-temps cprogram.c –o cprogram

This will dump all the product file components in the same directory along with the executable file.

The -Wall option is used to display the error, if any, during the process. The -save-temps option in the GCC compiler driver is used to save all the intermediate files in the directory.

All intermediate files that will be saved are:

  • cprogram.i – Product by Preprocessor step
  • cprogram.s – Product by Compiler
  • cprogram.o – Product by Assembler
  • cprogram.out(Linux) /cprogram.exe(Windows) – Executable file (Last Product)

In each segment of the C program compilation process, we will learn more about these intermediate files.

The command uses the cprogram.c file (Used later in the blog) as the c source file. You should assign your own name to the source file.

The -o option in the GCC compiler driver is used to assign the name to the output file.

In this example, each intermediate file is called a cprogram, but the extensions are different.

Step 1 – Preprocessing

Pre-processing is the first step in the C Program Compilation Pipeline.

What is Preprocessing?

When writing a C program, we include libraries, define some macros, and sometimes even make some conditional compilation. All of these are referred to as preprocessor directives.

During the preprocessing step in the C program compilation pipeline, the preprocess directives are substituted with their original values.

Let’s make that clear!

Header Files

The source file consists of a number of header files in the C language, and it is the preprocessor’s task to include the library.

Example:

If the program contains #include, this line will be replaced by the original contents of the header file when the source file is pre-processed.

Macros

Macros are defined by #define syntax in the C programming language. During the preprocessing stage, the macros are replaced by their values.

Conditional Compilation

Often we want our compiled code to be minimal, so we can use conditional compilation. The preprocessor often operates on a conditional compilation and reduces the code by adding only those lines that fulfill the condition.

Some Conditional Compilation preprocessor directives are:

  • #undef
  • #ifdef
  • #ifndef
  • #if
  • #else
  • #elif
  • #endif

Comments

For improved readability and more reference to the code, we use the comments in our code. However, the comments do not provide the computer with the necessary information and therefore the comments are removed in the pre-processing stage.

The pre-processed code is often called the Translation Unit (Compilation Unit).

You can see the pre-processed code of the C program in the section below.

Extract preprocessed code from GCC

We can also take a look at the file from each part of the compilation pipeline. Let’s ask the C compiler driver to dump the translation unit without going any further.

To extract the translation unit from the source code, we can use the -E option in the GCC.

Example:

// Header File
#include 

#define Max 10

int main()
{
    printf("Hello World"); // Print Hello World
    printf("%d", Max);
    return 0;
}
$ gcc -E cprogram.c


# 1 "cprogram.c"
# 1 ""
# 1 ""
# 1 "cprogram.c"
.......
.......
.......
typedef __time64_t time_t;
# 435 "C:/msys64/mingw64/x86_64-w64-mingw32/include/corecrt.h" 3

typedef struct localeinfo_struct {
  pthreadlocinfo locinfo;
  pthreadmbcinfo mbcinfo;
} _locale_tstruct,*_locale_t;

typedef struct tagLC_ID {
  unsigned short wLanguage;
  unsigned short wCountry;
  unsigned short wCodePage;
} LC_ID,*LPLC_ID;

.......
.......
.......
# 1582 "C:/msys64/mingw64/x86_64-w64-mingw32/include/stdio.h" 2 3
# 3 "cprogram.c" 2


# 5 "cprogram.c"
int main()
{
    printf("Hello World");
    printf("%d", 10);
    return 0;
}

The above translation code is not full, as it is very large because it includes the stdio.h header file (Total 1038 lines of translation unit). To see the whole output run this command on your development machine.

This example illustrates how the preprocessor functions. The Preprocessor only performs basic functions, such as inclusion, by copying contents from a file or macro expansion by text substitution.

There is a CPP tool, which stands for C Pre-Processor, which is used to pre-process a C file.

The C preprocessor or cpp is the macro preprocessor for the C, Objective-C, and C++ computer programming languages. The preprocessor provides the ability for the inclusion of header files, macro expansions, conditional compilation, and line control.

Source – C Preprocessor Wikipedia

This tool is part of the C development kit that is shipped with each UNIX flavor.

$ cpp cprogram.c

This command will provide you the preprocessed code for the c program.

The preprocessed file has an extension of .i, and if you pass this file to the C compiler driver, the preprocessor stage will be bypassed. This happens because of the file with the .i extension is supposed to have already been preprocessed and is sent directly to the compilation stage.

Step 2 – Compilation

In the previous section, we had our Translation Unit, and now we can move on to our next step, i.e. the compilation of the Translation Unit Code.

The input for the compilation component is the translation unit, obtained from the previous component. The output from the compilation component is the assembly code (Still Human-readable code, but closer to the hardware).

Extract Assembly Code from GCC

As in the previous stage, we dumped the translation unit code using the -E option. Here, we can use the -S option for the GCC to obtain the assembly code. This will create a file with the .s extension, and we will see the contents of the file using the cat command.

$ gcc -S cprogram.c

$ cat cprogram.s

	.file	"cprogram.c"
	.text
	.def	printf;	.scl	3;	.type	32;	.endef
	.seh_proc	printf
printf:
	pushq	%rbp
	.seh_pushreg	%rbp
	pushq	%rbx
	.seh_pushreg	%rbx
	subq	$56, %rsp
	.seh_stackalloc	56
	leaq	128(%rsp), %rbp
	.seh_setframe	%rbp, 128
	.seh_endprologue
	movq	%rcx, -48(%rbp)
	movq	%rdx, -40(%rbp)
	movq	%r8, -32(%rbp)
	movq	%r9, -24(%rbp)
	leaq	-40(%rbp), %rax
	movq	%rax, -96(%rbp)
	movq	-96(%rbp), %rbx
	movl	$1, %ecx
	movq	__imp___acrt_iob_func(%rip), %rax
	call	*%rax
	movq	%rbx, %r8
	movq	-48(%rbp), %rdx
	movq	%rax, %rcx
	call	__mingw_vfprintf
	movl	%eax, -84(%rbp)
	movl	-84(%rbp), %eax
	addq	$56, %rsp
	popq	%rbx
	popq	%rbp
	ret
	.seh_endproc
	.def	__main;	.scl	2;	.type	32;	.endef
	.section .rdata,"dr"
.LC0:
	.ascii "Hello World\0"
.LC1:
	.ascii "%d\0"
	.text
	.globl	main
	.def	main;	.scl	2;	.type	32;	.endef
	.seh_proc	main
main:
	pushq	%rbp
	.seh_pushreg	%rbp
	movq	%rsp, %rbp
	.seh_setframe	%rbp, 0
	subq	$32, %rsp
	.seh_stackalloc	32
	.seh_endprologue
	call	__main
	leaq	.LC0(%rip), %rcx
	call	printf
	movl	$10, %edx
	leaq	.LC1(%rip), %rcx
	call	printf
	movl	$0, %eax
	addq	$32, %rsp
	popq	%rbp
	ret
	.seh_endproc
	.ident	"GCC: (Rev3, Built by MSYS2 project) 10.2.0"
	.def	__mingw_vfprintf;	.scl	2;	.type	32;	.endef

The Compilation phase gives us the Assembly code that is unique to the target architecture.

Even if the C language compiler is the same for two different machines with the same C program but different processors and hardware, different assembly codes would be produced.

Generating the assembly code from the C code is one of the most critical stages in the C Program Compilation Pipeline since the assembly code is a low-level language that can be translated to an object file using an assembler.

Step 3 – Assembly

The Compilation stage gives us the assembly code that is the input for the next pipeline component, i.e. assembly.

In this stage, the actual instructions on the machine level are generated from the assembly code. Each architecture has its own assembler, which converts its own assembly code into its own machine code.

The assembler generates a relocatable object file from the assembly code.

Create Object file from the Assembly file

We can use the built-in assembly tool called to translate the assembly file to an object file.

The assembler tool takes the assembly file and produces a relocatable object file.

$ as cprogram.s -o cprogram.o

This assembler tool(as) gives us a new file with a .o extension (.obj in Microsoft Windows), which is the relocatable object file.

If you want to translate a C program directly to an object file, you can use the -c option in the GCC compiler driver.

Using -c with the GCC compiler would merge the first three processes in the pipeline compilation, i.e. pre-processing, compilation and assembling.

$ gcc -c cprogram.c

This is really helpful when you want to work with object files and repeating all of the above steps can be quite hectic for a number of files.

The contents inside the object file contain low-level code and are thus not readable to humans. In the latter portion, we will also learn about a tool that will allow us to see the contents of an object file.

Now that we know how to build object files directly from both the assembly file and the C program. It’s time to learn about the Linking Stage in the C program compilation process.

Step 4 – Linking

This is one of the most critical steps in the C compilation pipeline where the generated relocatable object files are combined/linked to create another object file that is executable in nature.

Let’s take a look at the situation.

Suppose we have a custom header file htd.h, which contains a printHTD function prototype, and a source file htd.c, which contains the function definition of the header file.

#include 
#include "htd.h"

void printHTD()
{
    printf("Hack The Developer");
}
void printHTD();

The htd.h header file is included in the cprogram.c source file.

#include "htd.h"
int main()
{
    printHTD();
    return 0;
}

As we know there are two source files, and we’ll have to generate separate object files, which will be linked by the linker later to provide us an executable object file.

$ gcc -c cprogram.c

$ gcc -c htd.c

The above command creates two separate relocatable object files. Now let’s link the object files.

We can use the ld tool, which is the default linker in Unix-like systems, to link the relocatable object files.

But the ld tool gives us an undefined reference error.

$ ld htd.o cprogram.o -o cprogram.exe

C:\msys64\mingw64\bin\ld.exe: cprogram.o:cprogram.c:(.text+0x32): undefined reference to `__imp___acrt_iob_func'
C:\msys64\mingw64\bin\ld.exe: cprogram.o:cprogram.c:(.text+0x43): undefined reference to `__mingw_vfprintf'
C:\msys64\mingw64\bin\ld.exe: cprogram.o:cprogram.c:(.text+0x78): undefined reference to `__main'

So we are going to use the gcc for the linking process, which has an inbuilt linker that will link the relocatable object files.

$ gcc htd.o cprogram.o -o cprogram.exe

$ ./cprogram.exe

Hack The Developer

The linking was successful as we got the required output.

In each section, we examined an intermediate file, but not an object file. Let’s look at it in the next section.

Analysis of Object Files

It was said earlier in the Assembly section that we will take a look at the tool that lets us see the contents of the object file.

The nm tool is used to display symbols that can be found in an object file.

$ nm htd.o

0000000000000000 b .bss
0000000000000000 d .data
0000000000000000 r .rdata$zzz
0000000000000000 t .text

$ nm cprogram.o

0000000000000000 b .bss
0000000000000000 d .data
0000000000000000 p .pdata
0000000000000000 r .rdata
0000000000000000 r .rdata$zzz
0000000000000000 t .text
0000000000000000 r .xdata
                 U __imp___acrt_iob_func
                 U __main
                 U __mingw_vfprintf
000000000000006f T main
0000000000000000 t printf
0000000000000054 T printHTD

The object module provides all the details required to relocate and link the application to another program.

The below picture defines the Structure of the relocatable object file.

Object File Structure
Object File Structure

The ELF header at the top of this table defines the ELF file. The next section is a section header table containing details on all parts in the ELF file.

Let’s Understand the Four important Sections in the ELF File.

  • .text – Our c program is stored in this section. We only have read and execute permission to this section and not write permission.
  • .data – All initialized global and static variables are stored in the .data section. (Read and Write Permission)
  • .rodata – Constants and literals are stored in the .rodata section. (Read Permission)
  • .bss – This section contains uninitialized global and static variables. (Read and Write Permission)

To see the content of the generated executable use the readelf tool.

$ readelf -a cprogram.exe

ELF Header:
  Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              DYN (Shared object file)
  Machine:                           Advanced Micro Devices X86-64
  Version:                           0x1
  Entry point address:               0x1060
  Start of program headers:          64 (bytes into file)
  Start of section headers:          14776 (bytes into file)
  Flags:                             0x0
  Size of this header:               64 (bytes)
  Size of program headers:           56 (bytes)
  Number of program headers:         13
  Size of section headers:           64 (bytes)
  Number of section headers:         31
  Section header string table index: 30


Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  PHDR           0x0000000000000040 0x0000000000000040 0x0000000000000040
                 0x00000000000002d8 0x00000000000002d8  R      0x8
  INTERP         0x0000000000000318 0x0000000000000318 0x0000000000000318
                 0x000000000000001c 0x000000000000001c  R      0x1
      [Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]
  LOAD           0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000000600 0x0000000000000600  R      0x1000
  LOAD           0x0000000000001000 0x0000000000001000 0x0000000000001000
                 0x0000000000000205 0x0000000000000205  R E    0x1000
  LOAD           0x0000000000002000 0x0000000000002000 0x0000000000002000
                 0x0000000000000190 0x0000000000000190  R      0x1000
  LOAD           0x0000000000002db8 0x0000000000003db8 0x0000000000003db8
                 0x0000000000000258 0x0000000000000260  RW     0x1000
  DYNAMIC        0x0000000000002dc8 0x0000000000003dc8 0x0000000000003dc8
                 0x00000000000001f0 0x00000000000001f0  RW     0x8
  NOTE           0x0000000000000338 0x0000000000000338 0x0000000000000338
                 0x0000000000000020 0x0000000000000020  R      0x8
  NOTE           0x0000000000000358 0x0000000000000358 0x0000000000000358
                 0x0000000000000044 0x0000000000000044  R      0x4
  GNU_PROPERTY   0x0000000000000338 0x0000000000000338 0x0000000000000338
                 0x0000000000000020 0x0000000000000020  R      0x8
  GNU_EH_FRAME   0x0000000000002018 0x0000000000002018 0x0000000000002018
                 0x000000000000004c 0x000000000000004c  R      0x4
  GNU_STACK      0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000000000 0x0000000000000000  RW     0x10
  GNU_RELRO      0x0000000000002db8 0x0000000000003db8 0x0000000000003db8
                 0x0000000000000248 0x0000000000000248  R      0x1

Symbol table '.symtab' contains 67 entries:
   Num:    Value          Size Type    Bind   Vis      Ndx Name
     0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND 
    ...........    
    36: 0000000000000000     0 FILE    LOCAL  DEFAULT  ABS htd.c
    37: 0000000000000000     0 FILE    LOCAL  DEFAULT  ABS cprogram.c
    .........
    52: 0000000000001149    28 FUNC    GLOBAL DEFAULT   16 printHTD
    .........
    63: 0000000000001165    25 FUNC    GLOBAL DEFAULT   16 main
    64: 0000000000004010     0 OBJECT  GLOBAL HIDDEN    25 __TMC_END__


In the Executable file, we too have ELF Header at the top, next Section is the Program Headers.

Executable File Structure
Executable File Structure

Hope You Like It!

Learn GUI Programming In C using GTK.

Learn Intermediate Python Concepts.

About Author
4.9 13 votes
Article Rating
Subscribe
Notify of
guest
3 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
anonymous
anonymous
2 years ago

The compiler of gcc/g++ is cc and cc1plus, respectively. gcc and g++ are the compiler driver. The diagram from https://nerdyelectronics.com/ has an error.

khan
khan
1 year ago

nice

Scroll to Top