What’s C Thinking?
Most people are taught object-oriented programming from the start. As a good friend pointed out, the mental models used in those languages, how we think about our programs, are very different from the mental model needed to be a great C programmer. In this post I hope to shed a bit of light on how to think like C.
What is the computer?
C is called a low level language, meaning that it is semantically close to the processor. Unlike object-oriented programming, this means you have to think about the processor and how it functions within the larger computer. But what is a computer?
Basically it is:
- a central processing unit (CPU)
- internal storage (memory)
- some IO devices (for our purposes, we can ignore the IO devices)
CPU
The CPU is that square thing that needs huge fins and a fan because it generates so much heat. It contains two components we’re interested in: the arithmetic logical unit (ALU) and the register set.
The ALU performs operations on data, and it’s not really that bright. It can add, subtract, multiply, and divide. It can shift binary numbers left or right, the equivalent of multiplying or dividing by multiples of two, and it can perform the basic logic operations such as AND, OR, NOT, and XOR (exclusive or).
The ALU can also compare two numbers (and as you will see, everything in the CPU is a number). It does this by simply subtracting and tossing the answer. A set of bits grouped together called the “flags” are set based on the result of the subtract operation. For example, one bit in the set is turned on if the answer is all zeros. Another if the answer is a negative number. Still another one if there is an overflow.
The CPU also contains the registers. In the case of the Intel processor, these are sixteen 64-bit places to put numbers. You can think of the register set as a very fast scratch pad for the CPU. Almost every operation involves a register.
Remember the Memory
C programmers think about the CPU, but they also need to keep memory in mind. Today memory is a vast ocean of individually addressed 8-bit bytes. The Plan 9 machine I’m typing on has 256 billion bytes worth, each with its own unique address.
In memory C stores it’s instructions and data. All your variable are in there. All your functions are in there in the form of machine instructions generated by the C compiler as it parses your C source code.
A program can move data from the memory into registers and back again in chunks as small as a byte and as large as eight bytes (actually larger, but we’re talking about C programs).
Little Man Behind the Curtain
So how does this all work? You can think of the CPU as housing a very tiny man who reads in a machine instruction from memory. The location of the instruction is held in the Instruction Pointer. The little man increases the value in the Instruction Pointer to point to the next instruction, looks at the new instruction, and does what it says.
The CPU just follows the instruction’s orders. Which, by the way, is what machine was first called—“orders.”
It’s that simple. Read an instruction, do what is says, and repeat. Forever.
The instruction might be to load a value into a register. The next instruction could be to load another instruction into a second register. The third instruction may be to add the first register to the second, and the instruction after that can store the results into another memory location. In C this would be:
a = b + c;
These variables, of course, all live in memory.
Ready for your Transfer
Even in an ocean of memory, it wouldn’t take long to run off the end if the CPU only fetched the very next instruction. So the CPU has additional instructions that change the value in the Instruction Pointer.
When we load a constant into the Instruction Pointer we call it a “jump” or a “branch.” The control “jumps” to a new location in the program because the very next thing the CPU does is fetch the instruction from that location. If we jump to an earlier location, we are “looping.”
The following C code executes the code to print out a string and at the closing brace, jumps (or branches) back to the “for” statement.
for (;;)
print("wheee...\n");
Obviously there must be other kinds of instructions that load values into the Instruction Pointer because we don’t want to jump every time. Sometimes we want to jump when some condition is true or false. To do that there are a number of instructions that load a value into the Instruction Pointer based on the setting of bits in the condition flag mentioned earlier.
For example, if we say in C
if (a < b)
print("too bad 'a'\n");
We would generate the following
MOVL a, AX / move a into AX
CMPL AX, b / subtracts b from a
JGE around / check flag bits and jump to "around"
CALL print
around:
These are Plan 9 mnemonics for Intel 64 bit instructions. First “a” is loaded into a register (“AX”), then “CMPL” subtracts “b” from AX. CMPL is a subtract that tosses the results, but sets a bit in the condition flags as a side effect.
Finally “JGE” is the "jump if greater than’’ instruction, so that if a < b our string is printed.
No matter how complex or how many “&&” and “||” are used in a conditional expression, it is all just decomposed into a network of conditional jumps. We sometimes write
if (a == 3)
if (b == 4)
...
instead of
if (a ==3 && b == 3)
...
just because we know what the C compiler is thinking.
Types, Pointers, and Structures
Ken and Dennis programmed the earliest versions of the Unix operating system in assembly language, dealing directly with machine instructions similar to the ones above. The original Unix machines had only one kind of memory, 18-bit words. You could use a word as a number, as an address of another word (a pointer), or to hold a character, and all the instructions fit into the same 18-bit words.
When Bell Labs got their PDP–11 computer, memory looked different. Influenced by the IBM 360, the PDP–11 viewed memory as both 8-bit bytes and 16-bit words. The sequence of bytes were addressed at locations 0, 1, 2, 3, and so on. The words were at locations 0, 2, 4, 6, and so on.
Unlike with the previous Unix machine, when you said “a = 2;’’ how would you know if “a” was a 8-bit byte or a 16-bit word?
Types in C were invented to answer the question. You had to declare the type of a variable before using it. On the PDP 11 “Char’’ was an 8-bit byte and “int’’ was a 16-bit word.
Today we have four sizes of integers: “char,” “short,” “int,” and “long long,” which are 8-bit, 16-bit, 32-bit and 64-bit respectively. By using types to declare variable we tell the C compiler what kind of instructions to generate. The programmer can even get the size of the variable by using the “sizeof” operator. It returns the number of bytes in a variable or expression.
Modify Me Not
Using CPU registers to hold a variable’s address in memory is a common occurrence in assembly programs. It’s actually why there is more than one register. Very early computers had only a single register called an accumulator, in which all operations were performed. To do things like loop over an array of numbers, programmers of these early machines would write code that would modify the instructions! In the loop there would be an instruction that loaded a value from the array into a register. Part of that instruction was the address of the array. On each pass, after loading the value into the register, the code would add one to the memory location holding the load instruction, causing it to now point to the next element in the array. The code modified itself!
What could possibly go wrong with that, right?
So another kind of register appeared, originally called the B-line register. Instead of the instruction holding the address in memory, the register was used as an index into the array and the instruction was to load whatever the register pointed to.
Today any of the 16 registers can be used as an accumulator or an index register.
When we say
a[i] = 0;
we load a register with the value of i, add the address of the array a to that register and then use that register to store a zero into the array location.
Since it was meant to replace assembly programming but not limit what you could do, C introduced a simple type of “pointer to.’’ “int *ip;” for example, is a pointer to an integer type. Since a pointer to any type has to be large enough to address all of memory, on this machine a pointer is eight bytes. The integer it's pointing to is only four bytes. So “sizeof ip” is 8 and “sizeof *ip” is 4.
C structures are like arrays except the members are heterogeneous and one used the member name to reference them, but the machine code the C compiler generates looks a lot like the array code.
Using these types you can do everything that can be done in assembly (almost).
This isn’t all the story, but it’s most of it. There’s not enough space in one of my posts to finish it all, but you can easily see how we can progress from here.
When programers write C, we think in terms of registers, memory, and ALU operations. We think like we were the little man behind the curtain in the machine, manipulating the bits and bytes and registers. That’s how we can go so fast.
Learning assembler and C is a must for anyone serious about computing. In this era of JavaScript frameworks and a big pile of abstractions, remembering the basics ties you to the real world. I can remember my classes at the Electrical Engineering College in Vigo, where we started with assembler and then went down to microcode. Really enlightenment.
Founder/CEO at Coraid
7yHey Javier. I may be mistaken, but I think that's what I tried to say. Sizeof ip, which is the pointer, and is 8 bytes, and sizeof *ip, which is size of what the pointer points to, is 4 bytes. But I was unclear before that. "int *ip;" is the declaration of ip as a pointer to an int. So, let me say it this way: void func(void) { int *ip; print("%d %d\n", sizeof ip, sizeof *ip); } Prints "8 4." (Plan 9 uses print instead of printf.) Thanks for the comment.
Ingeniero en Software y Comunicaciones
7yAt the end a[i] = 0, you are talking about intel x64 then an array pointer is 64 bits and storage is 32 bit like pointer to int thanks for the report:)
Chief Engineer at SVD
9ygreat article! C was my first language and I still "think" in C
Owner and President at Maryland Estate Treasures, Inc.
9yThey don't teach programming like this anymore. I see this when I have to do security code reviews. The garbage that passes for code is amazing! You should write this as part of a text on programming.