Decoding C Compilation Process: From Source Code to Binary Read it later

4/5 - (7 votes)

Have you ever wondered what happens behind the scenes when you write a C program? How does your code transform from lines of text into a fully functional binary executable? If you’ve been curious about the intricacies of the C program compilation process, you’ve come to the right place.

In this blog, we’ll take you on a journey through the various steps involved in converting your C source code into binary form. We’ll unravel the mysteries of the compilation process and shed light on the essential stages that bring your code to life. By the end of this article, you’ll have a clear understanding of how the C program compilation process works.

What is a Compilation?

Compilation is a fundamental process in computer programming that takes your human-readable code and transforms it into something a computer can understand and execute. Think of it as a translation of sorts, where your code is converted into a language that the computer can comprehend and act upon.

During compilation, your code goes through several stages, each serving a specific purpose. These stages work together to ensure that your code is transformed into a format that the computer can execute efficiently.

In the following sections, we will delve deeper into each of these stages, exploring their roles and significance in the compilation process.

Types of C/C++ Compilers

Before diving into the stages of the C program compilation process, it’s essential to familiarize ourselves with the different types of C compilers available. Understanding these compiler types can help you choose the right one for your specific needs. Let’s explore them briefly:

  1. Integrated Development Environment (IDE) Compilers: These compilers are bundled with popular IDEs like Visual Studio, Code::Blocks, and Eclipse, providing a user-friendly interface and integrated tools for code editing, debugging, and project management.
  2. GNU Compiler Collection (GCC): GCC is a widely used and stable compiler suite supporting multiple programming languages, including C. It offers excellent performance and adherence to language standards.
  3. LLVM/Clang: LLVM is an open-source compiler infrastructure project, with Clang as its C/C++ compiler component. Clang provides fast compilation times, strict adherence to standards, and helpful error messages.
  4. Intel C++ Compiler (ICC): ICC is optimized for generating code for Intel processors, leveraging advanced optimization techniques like auto-vectorization and parallelization.
  5. TinyCC (TCC): TCC is a lightweight and fast compiler suitable for resource-constrained environments, often used for rapid prototyping, scripting, and embedded systems.

To learn about all the types of c compilers visit the list of compilers Wikipedia page.

Note: For the purpose of this blog, we will be using the GCC compiler for all the examples provided. While the concepts discussed in this blog are applicable to other compilers as well, the code examples and demonstrations will specifically use the GCC compiler.

C Program Compilation Pipeline

After writing your code, the next step is to compile it. The compiler takes a few seconds to translate your code into machine language. But behind the scenes, the code goes through a series of steps in what we call the C Program Compilation Pipeline. These steps are sequential, forming a pipeline of processes.

The compilation pipeline consists of the following stages:

  1. Preprocessing
  2. Compilation
  3. Assembly
  4. Linker

It’s important to note that the output of each stage becomes the input for the next stage in the pipeline. This sequential process continues until the final product is obtained. The specific by-product generated depends on the system you’re compiling the program on. For Windows machines, an executable file with a .exe extension is generated, while for Linux, a .out file is created.

It’s crucial to understand that successful compilation depends on the code passing through each component of the pipeline without any errors. Even a small failure in any of the stages can lead to compilation or linkage errors, resulting in error messages.

To give you a visual representation of the entire compilation process, refer to the image below:

Compilation Pipeline Intermediate Files

To access all the intermediate files generated during the compilation process, you can use the following command:

$ gcc -Wall -save-temps cprogram.c -o program

Here, the -Wall option displays any errors encountered during the process. The -save-temps option in the GCC compiler driver saves all the intermediate files in the directory.

The intermediate files that will be saved are as follows:

  • cprogram.i: Generated by the Preprocessor stage
  • cprogram.s: Generated by the Compiler stage
  • cprogram.o: Generated by the Assembler stage
  • cprogram.out (Linux) / cprogram.exe (Windows): Executable file (Final Product)

In the upcoming segments of this guide, we will delve deeper into each segment of the C program compilation process, providing more information about these intermediate files.

Also, learn about the Memory layout of a C Program.

C Program Compilation Rules

Now, let’s take a closer look at two important rules that govern the compilation of C programs. Understanding these rules will give you a clearer picture of how the compilation process works.

1. Only Source Files are Compiled

In the world of C programming, it’s essential to remember that only source files are compiled. Source files typically have the “.c” extension and contain the actual code you’ve written. These files are the heart of your program, where you define functions, declare variables, and implement the desired functionality.

On the other hand, header files, denoted by the “.h” extension, are not directly compiled. They serve as a way to share declarations and function prototypes among multiple source files. Header files are typically included in your source files using the #include directive. The preprocessor takes care of including the content of header files during the preprocessing stage, ensuring that the necessary information is available for compilation.

By separating the code into source and header files, you can promote modularity, reusability, and maintainability in your C programs. This division allows you to make changes to specific parts of your program without affecting the entire codebase.

2. Each Source File is Compiled Separately

When you compile a C program that consists of multiple source files, each source file is compiled independently. This means that the compilation process treats each file as a separate entity.

During the compilation stage, the compiler analyzes the syntax, checks for errors, and generates object code for each source file. This object code contains machine instructions specific to the target architecture, but it’s not yet in the final executable form.

Once each source file has been compiled individually, the object files are generated. These object files contain the compiled code for each source file, ready to be combined in the subsequent linking stage.

The separation and independent compilation of source files allow for efficient code organization, as well as the ability to reuse code across different projects. It also enables faster compilation times, especially when only a portion of the code needs to be recompiled after making changes.

Step 1: Preprocessing

The preprocessing stage is the first step in the C program compilation pipeline. It plays a crucial role in preparing the source code for further compilation. Let’s explore what preprocessing entails and its various aspects.

What is preprocessing?

Preprocessing involves handling directives and modifications made to the original source code before actual compilation. These directives instruct the preprocessor on how to manipulate the code. The preprocessor performs tasks like including header files, expanding macros, handling conditional compilation, and removing comments.

Types of Preprocessor Directives

  1. Header Files: Header files contain declarations and definitions that are shared among multiple source files. They provide essential functions, constants, and structures required by the program. By using the #include directive, we can include header files in our program and make their contents available for use.
  2. Macros: Macros are preprocessor directives that allow us to define and use constants or small snippets of code. They are defined using the #define directive and can help simplify repetitive tasks and improve code readability. Whenever the macro is encountered, it is expanded by the preprocessor, replacing the macro name with its corresponding value or code snippet.
  3. Conditional Compilation: Conditional compilation enables us to include or exclude specific sections of code based on certain conditions. We can use directives like #ifdef, #ifndef, #if, #elif, #else, and #endif to conditionally compile code blocks. This feature allows us to adapt our code to different platforms or configurations.
  4. Comments: Comments in code serve as annotations and do not have any impact on the program’s execution. The preprocessor removes comments from the code during the preprocessing stage. Comments are essential for documenting our code, making it more readable and understandable for both ourselves and other developers.

The pre-processed code is often called the Translation Unit (Compilation Unit).

You can see the pre-processed code of the C program in the section below.

Extract Preprocessed Code from GCC

To extract the preprocessed code, we can use the -E option with the GCC compiler. This option tells the compiler driver to stop the compilation process after preprocessing and dump the translation unit.

For example, let’s consider the following C program:

// Header File
#include <stdio.h>
#define Max 10

int main() {
    printf("Hello World");
    printf("%d", Max);
    return 0;
}

To extract the preprocessed code using GCC, we can run the following command:

$ gcc -E cprogram.c

Output:

# 1 "cprogram.c"
# 1 ""
# 1 ""
# 1 "cprogram.c"
.......
.......
.......
typedef __time64_t time_t;
# 435 "C:/msys64/mingw64/x86_64-w64-mingw32/include/corecrt.h" 3
typedef struct localeinfo_struct {
  pthreadlocinfo locinfo;
  pthreadmbcinfo mbcinfo;
} _locale_tstruct,*_locale_t;
typedef struct tagLC_ID {
  unsigned short wLanguage;
  unsigned short wCountry;
  unsigned short wCodePage;
} LC_ID,*LPLC_ID;
.......
.......
.......
# 1582 "C:/msys64/mingw64/x86_64-w64-mingw32/include/stdio.h" 2 3
# 3 "cprogram.c" 2
# 5 "cprogram.c"
int main()
{
    printf("Hello World");
    printf("%d", 10);
    return 0;
}

The output will display the preprocessed code, including the content from included header files, expanded macros, and resolved comments. However, the output may be quite extensive, especially when involving large header files.

To see the full output, you can run the command on your development machine. It provides insights into how the preprocessor functions by copying file contents during inclusion and expanding macros through text substitution.

Additionally, the CPP tool, which stands for C Pre-Processor, can be utilized to preprocess a C file. It is part of the C development kit shipped with various UNIX flavors. Running the command below provides the preprocessed code for the C program:

$ cpp cprogram.c

The preprocessed file will have an extension of .i. If you pass this file to the C compiler driver, the preprocessor stage will be bypassed. This is because the file with the .i extension is assumed to have already undergone preprocessing and can be directly sent to the compilation stage.

Step 2: Compilation

Once the preprocessor has completed its job of handling directives and macros, we move on to the next stage of the C compilation process: compilation itself. This stage plays a vital role in transforming the preprocessed code into assembly language.

During compilation, the code is analyzed by the compiler to check for any syntax errors and generate intermediate object code. It’s like a translator that understands C language and converts it into a language that the computer can comprehend.

Extract Assembly Code from GCC

To extract the assembly code from our C program, we can utilize the GCC compiler and its helpful options. In the previous stage, we used the -E option to dump the translation unit code. Here, we will use the -S option to obtain the assembly code. When we execute the following command:

$ gcc -S cprogram.c

The GCC compiler will generate a file with the .s extension, containing the assembly code corresponding to our C program. To view the contents of this file, we can use the cat command:

$ cat cprogram.s
Output
.file	"cprogram.c"
.text
.def	printf;	.scl	3;	.type	32;	.endef
.seh_proc	printf
printf:
	pushq	%rbp
	.seh_pushreg	%rbp
	pushq	%rbx
	.seh_pushreg	%rbx
	subq	$56, %rsp
	.seh_stackalloc	56
	leaq	128(%rsp), %rbp
	.seh_setframe	%rbp, 128
	.seh_endprologue
	movq	%rcx, -48(%rbp)
	movq	%rdx, -40(%rbp)
	movq	%r8, -32(%rbp)
	movq	%r9, -24(%rbp)
	leaq	-40(%rbp), %rax
	movq	%rax, -96(%rbp)
	movq	-96(%rbp), %rbx
	movl	$1, %ecx
	movq	__imp___acrt_iob_func(%rip), %rax
	call	*%rax
	movq	%rbx, %r8
	movq	-48(%rbp), %rdx
	movq	%rax, %rcx
	call	__mingw_vfprintf
	movl	%eax, -84(%rbp)
	movl	-84(%rbp), %eax
	addq	$56, %rsp
	popq	%rbx
	popq	%rbp
	ret
	.seh_endproc
	.def	__main;	.scl	2;	.type	32;	.endef
	.section .rdata,"dr"
.LC0:
	.ascii "Hello World"
.LC1:
	.ascii "%d"
	.text
	.globl	main
	.def	main;	.scl	2;	.type	32;	.endef
	.seh_proc	main
main:
	pushq	%rbp
	.seh_pushreg	%rbp
	movq	%rsp, %rbp
	.seh_setframe	%rbp, 0
	subq	$32, %rsp
	.seh_stackalloc	32
	.seh_endprologue
	call	__main
	leaq	.LC0(%rip), %rcx
	call	printf
	movl	$10, %edx
	leaq	.LC1(%rip), %rcx
	call	printf
	movl	$0, %eax
	addq	$32, %rsp
	popq	%rbp
	ret
	.seh_endproc
	.ident	"GCC: (Rev3, Built by MSYS2 project) 10.2.0"
	.def	__mingw_vfprintf;	.scl	2;	.type	32;	.endef

The assembly code obtained from the compilation phase is unique to the target architecture. Even if we use the same C language compiler for two different machines with the same C program but different processors and hardware, the generated assembly codes would be different.

This stage is crucial in the C program compilation pipeline because the assembly code acts as a bridge between the high-level C code and the low-level machine code. It provides a human-readable representation of the program’s operations and instructions, making it easier for programmers to understand the inner workings of their code.

The assembly code generated in this phase can be further translated into an object file using an assembler. This object file contains machine code specific to the target processor architecture and serves as an intermediary step toward creating the final executable file.

Understanding the assembly code is beneficial for optimizing the performance of our programs, as we can analyze and fine-tune the instructions that the computer will execute. It allows us to dive deeper into the underlying hardware and exploit its capabilities effectively.

Step 3: Assembly in C Compilation

In the third stage of the C program compilation pipeline, we encounter the assembly process. This phase plays a crucial role in converting the assembly code generated in the compilation stage into machine code specific to the target processor architecture.

Each architecture has its own assembler, a specialized tool that takes the assembly code as input and produces a relocatable object file. This file serves as an intermediate representation of the code and contains machine-level instructions.

Create an Object file from the Assembly file

To create an object file from the assembly file, we can utilize the built-in assembly tool called as. By executing the following command, we can translate the assembly file into a relocatable object file:

$ as cprogram.s -o cprogram.o

The resulting file will have a .o extension (or .obj if you’re using Microsoft Windows), indicating its nature as a relocatable object file. This object file contains low-level code, which is not easily readable by humans. However, we will analyze the generated object file later in the blog.

Alternatively, if you wish to directly translate a C program into an object file, you can employ the -c option with the GCC compiler driver. This option consolidates the first three stages of the compilation pipeline, namely preprocessing, compilation, and assembling. By using the following command, you can generate the object file:

$ gcc -c cprogram.c

This feature proves particularly useful when dealing with multiple files, as it saves the effort of repeating the previous steps for each individual file.

In every section, we take a look at the intermediate file that is generated, but for this section, we will be analyzing it after the linking process to make a clear sense.

Step 4: Linking in C Compilation Process

Now, let’s focus on the last stage of the C program compilation pipeline: linking. Linking plays a crucial role in transforming the generated relocatable object files into executable object files that can be run on your system.

What does the Linker do in C Compilation?

Let’s explore what the linker does in this last stage.

  1. Combining Object Files: One of the primary tasks of the linker is to combine the object files produced during the assembly stage. These object files contain machine code specific to different functions or modules of your program.
  2. Resolving External References: When you use functions or variables defined in other source files or libraries, your program relies on external references. The linker connects the function calls and variable references in one object file with the corresponding definitions in other object files or libraries.
  3. Symbol Table Management: The linker maintains a symbol table, which serves as a map connecting symbols (such as function and variable names) to their memory locations in the object files. The symbol table enables the linker to resolve references and establish the correct associations between the symbols used in different parts of the program.
  4. Handling Library Dependencies: Many programs make use of external libraries that contain precompiled code for common functions or specialized tasks. The linker handles the inclusion of these libraries by linking them with the object files.
  5. Generating the Executable File: Once the linker has combined the object files, resolved external references, and handled library dependencies, it generates the final executable file. This file contains the machine code of your program, ready to be executed by the computer’s processor.

Link Multiple C files Using Linker

To understand the concept of linking, let’s consider a scenario with a custom header file called htd.h. This header file contains a function prototype for printHTD.

void printHTD();

Additionally, we have a source file named htd.c, which provides the function definition for the header file.

#include <stdio.h>
#include "htd.h"

void printHTD()
{
    printf("Hack The Developer");
}

In our main source file, cprogram.c, we include the htd.h header file:

#include "htd.h"

int main()
{
    printHTD();
    return 0;
}

Since we have two source files, we need to generate separate object files, which will be linked by the linker to create a final executable object file. We can achieve this by executing the following commands:

$ gcc -c cprogram.c
$ gcc -c htd.c

The above commands create two distinct relocatable object files. Now, let’s proceed with linking these object files.

Initially, we might consider using the ld tool, which is the default linker in Unix-like systems, to perform the linking process. However, attempting to link the object files using the ld tool results in an undefined reference error:

$ ld htd.o cprogram.o -o cprogram.exe

Output:

C:msys64mingw64binld.exe: cprogram.o:cprogram.c:(.text+0x32): undefined reference to `__imp___acrt_iob_func'
C:msys64mingw64binld.exe: cprogram.o:cprogram.c:(.text+0x43): undefined reference to `__mingw_vfprintf'
C:msys64mingw64binld.exe: cprogram.o:cprogram.c:(.text+0x78): undefined reference to `__main'

Consequently, we’ll employ the gcc command for the linking process instead. GCC comes with an inbuilt linker that effectively links the relocatable object files. Here’s how we can proceed:

$ gcc htd.o cprogram.o -o cprogram.exe
$ ./cprogram.exe

Output:

Hack The Developer

As we can see, the linking process was successful, and we obtained the desired output.

The linking process brings together all the necessary components of your program, resolving any references to functions or variables defined in different object files. It ensures that the program has access to the required code and data during runtime, allowing it to function as intended.

Analyze Object Files

Now, let’s take a closer look at the object files generated during the assembly process. These object files contain important information that allows us to relocate and link the application to other programs.

To analyze the contents of an object file, we can use the nm tool. By running the command nm <object-file>, we can display the symbols present in the object file.

For example, analyzing the object files generated during the linking section:

$ nm htd.o

0000000000000000 b .bss
0000000000000000 d .data
0000000000000000 r .rdata$zzz
0000000000000000 t .text
$ nm cprogram.o

0000000000000000 b .bss
0000000000000000 d .data
0000000000000000 p .pdata
0000000000000000 r .rdata
0000000000000000 r .rdata$zzz
0000000000000000 t .text
0000000000000000 r .xdata
                 U __imp___acrt_iob_func
                 U __main
                 U __mingw_vfprintf
000000000000006f T main
0000000000000000 t printf
0000000000000054 T printHTD

The output of the nm command provides us with valuable information about the object file’s sections and symbols. In the output, we can see that the function T printHTD is there along with t printf but we never used printf function directly. It was included from htd.h header file.

The nm tool reveals different sections within the object file and the symbols associated with them. Each symbol represents a function, variable, or section present in the object file.

Moving on, let’s discuss the structure of the relocatable object file, which is often in the ELF format.

Executable and Linkable Format (ELF)

ELF (Executable and Linkable Format) is widely used in Unix-like operating systems. The ELF file structure consists of the ELF header and a section header table that provides details about different parts of the file.

Object File Structure in C Compilation
Object File Structure

In an ELF object file, there are four important sections that we should pay attention to:

  1. .text: This section contains the compiled C program. It has read and execute permissions, meaning it can be read and executed by the system, but not modified.
  2. .data: All initialized global and static variables are stored in this section. It has read and write permissions, allowing the program to both read from and modify the data.
  3. .rodata: Constants and literals are stored in this section. It has read permissions, as the data in this section is typically read-only.
  4. .bss: This section contains uninitialized global and static variables. It also has read and write permissions since the variables are allocated but not yet assigned any values.

In order to fully understand the ELF file structure, you need to understand the Memory Layout of the C Program.

Analyze Executable File

To examine the content of the generated executable file, we can use the readelf tool. Running the command readelf -a <executable-file> provides detailed information about the ELF file.

Executable File Structure in C Compilation
Executable File Structure

For example:

$ readelf -a cprogram.exe

ELF Header:
  Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              DYN (Shared object file)
  Machine:                           Advanced Micro Devices X86-64
  Version:                           0x1
  Entry point address:               0x1060
  Start of program headers:          64 (bytes into file)
  Start of section headers:          14776 (bytes into file)
  Flags:                             0x0
  Size of this header:               64 (bytes)
  Size of program headers:           56 (bytes)
  Number of program headers:         13
  Size of section headers:           64 (bytes)
  Number of section headers:         31
  Section header string table index: 30
Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  PHDR           0x0000000000000040 0x0000000000000040 0x0000000000000040
....
....
....
  GNU_RELRO      0x0000000000002db8 0x0000000000003db8 0x0000000000003db8
                 0x0000000000000248 0x0000000000000248  R      0x1
Symbol table '.symtab' contains 67 entries:
   Num:    Value          Size Type    Bind   Vis      Ndx Name
     0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND 
    ...........    
    36: 0000000000000000     0 FILE    LOCAL  DEFAULT  ABS htd.c
    37: 0000000000000000     0 FILE    LOCAL  DEFAULT  ABS cprogram.c
    .........
    52: 0000000000001149    28 FUNC    GLOBAL DEFAULT   16 printHTD
    .........
    63: 0000000000001165    25 FUNC    GLOBAL DEFAULT   16 main
    64: 0000000000004010     0 OBJECT  GLOBAL HIDDEN    25 __TMC_END__

The output of readelf displays the ELF header, program headers, section headers, and other relevant information. It gives us insights into the structure of the executable file and helps us understand its different components and properties.

Wrapping Up

In conclusion, understanding the C compilation process is essential for every aspiring programmer. By grasping the series of steps involved in converting human-readable code into machine-executable programs, you can enhance your ability to write efficient and error-free C programs.

We hope this blog has provided you with a clear understanding of the C compilation process, breaking down complex concepts into simple, digestible explanations. If you have any further questions or require additional clarification, please don’t hesitate to ask. Your journey as a C programmer is just beginning, and we’re here to support you every step of the way.

Frequently Asked Questions (FAQs)

What is the C compilation process?

The C compilation process is a series of steps that converts human-readable C source code into machine-executable code. It involves stages such as preprocessing, compilation, assembly, and linking.

Is the compilation process the same for all programming languages?

No, the compilation process can vary across different programming languages. While some languages follow a similar preprocessing, compilation, assembly, and linking process like C, others may have different stages or use alternative approaches, such as interpretation or just-in-time (JIT) compilation.

Can I customize the compilation process by writing my own build scripts?

Absolutely! Build systems like Make or build automation tools like CMake allow you to define custom build scripts that orchestrate the compilation process according to your specific requirements. These scripts can handle dependencies, specify compiler flags, and streamline the build workflow.

Are there any tools available to analyze the output of the compilation process?

Yes, several tools are available to analyze the output of the compilation process. Some commonly used tools include GCC (GNU Compiler Collection), Clang, LLVM (Low-Level Virtual Machine), Valgrind, GDB (GNU Debugger), and various profilers like gprof and perf. These tools offer features such as diagnostics, static analysis, memory debugging, profiling, and performance optimization.

Reference

  1. “The C Preprocessor” – GNU C Library Documentation:
  2. “GCC (GNU Compiler Collection)” – Official Website: https://gcc.gnu.org/
Was This Article Helpful?

5 Comments

  1. Hi, was there supposed to be a section about linking that is missing? At the end of “Analyze Executable File” section last sentence is “let’s move on to the next stage in the C program compilation process: linking”. And we just wrap up.

Leave a Reply

Your email address will not be published. Required fields are marked *