Straight Stuff on Structures in the C Programming Language
What is a “struct”? How does it work? How should a C programmer think about them?
All but the simplest C programs are filled with structures. C was written to be a replacement for Unix’s original assembly language. Since its inventors decided that the semantics of the C language should reflect the actual machine and nothing more, the struct is merely a way to describe a block of variables, just like assembly language programmers did with “control blocks.” But since most of us haven’t done much assembly language programming, I’ll explain structs by starting with something more relatable—the array.
To understand structs and arrays is to understand the unifying idea that all of the computer’s memory, where variables and instructions live, is one big array of bytes. I’ll refer to the memory array as the “mem.”
So, what is a variable? A variable is an area of memory we use to hold a value. In the machine I’m typing this post on (a 64 bit Plan 9 system running on an Intel XEON processor), each integer is a concatenation of four bytes. When I want to set that variable to zero, as in
int a;
a = 0;
the CPU stores four bytes of zeros into the memory locations allocated by the C compiler to hold my variable “a.” We say the address of a is at location so-and-so.
All Memory is an Array
I could also have an array of integers, say 100 of them. I would declare this array with the following statement:
int ia[100];
I can then use a number to select which integer in the array I want. I could set one of them to zero like so:
ia[42] = 0;
A single integer in an array is called an element. The 42 in this example is the index. We could, of course, use an integer variable instead of the constant as a selector.
The idea of an array has many analogies in the real world. One of the best is post office boxes. Each box is numbered and identical to the other boxes. The selection of which “box” to use is simply the number of the element.
Actually in C the number is the offset to the box. The first element of an array in C is at index zero, so the first element in “ia” is “ia[0],” the second is at “is[1],” and so on. The indexes are offsets to the element from the beginning of the array.
The idea of an index as being an offset to an element is based on how computer memory works. Our computer programs run in main memory that lives ethereally on the DDR sticks plugged into your machine’s motherboard. The first byte in all of memory has an address (computerese for index) of 0, the next byte is at address 1, and so forth. The address is an index offset.
So, I can think of all of memory on my machine as
char mem[274877906944];
(Don’t try that at home. You can’t declare an array that large.)
So our arrays are just segments of the larger mem array, and our indexes are offsets from the first mem element where the compiler put our array.
Bear with me on this. It’s a bit detailed, but if you can get through it, you’ll better understand what’s going on in C. Hopefully the effort will be worth your while.
My array “ia” is given by the C compiler an offset into memory, let’s say at location 10,000. When I type
ia[42] = 0;
the compiler takes the value 42, adds it to the value 10,000 (the offset to the first element in the array), and produces an address (or index) of 10,042. There we deposit four bytes of zeros since our “int” is 32 bits long.
Structures are Funny Arrays
Arrays are aggregates of homogeneous variables, and are referenced by a numerical offset to a given variable. Structures, on the other hand, are not necessarily made of homogeneous elements, but have the potential to be heterogeneous. As a result, it’s not very useful to use an offset to talk about the member. So Dennis decided to use a name for each variable in the structure.
As an example, here is a structure used to keep track of variables in a C compiler itself:
struct Symbol
{
char name[64]; /* the name of our variable */
int type; /* what kind of symbol it is */
long long addr; /* where the symbol is */
};
This doesn’t declare a variable but defines a new variable type, a structure with a tag of “Symbol.” A structure tag is a kind of type, but only for structures. We can declare a single variable of type “struct Symbol” as follows.
struct Symbol symb;
We now have a variable (“symb”) that is made up of three variables (name, type, addr). Each of these variables is referred to as a field. A field is the element of a structure.
In this example the types are arrays of 64 characters, an integer variable of 4 bytes, and a 8 byte long long variable. The total size of our structure is 76 bytes.
To select a particular variable we have to use a different notation than we used to select an array element. We can’t use an integer, so instead we use the name of the field we are interested in.
To set the field with the name “type” to 9, for example, we would say,
symb.type = 9;
The “dot” notation is used to tell the compiler that we are talking about the integer variable named “type.”
Interestingly, what’s going on in the background is very much like what went on with our array. The structure variable “symb” is declared at some location in memory, let’s say 9,000. Each member of a structure is at an offset from the beginning of the structure. That offset is a function of the variables sizes that went before it. Our 32-bit integer variable is at offset 64 because the 64 byte character array appears just before it. To get to our integer, we add 64 to the offset in the memory in which our structure variable begins resulting in an offset into memory of 9,064.
Arrays of Structs
We can also have an array of structure, declared as follows:
struct Symb symtab[1000];
We then can use both the array notation and the structure notation. To set the 123 entry to type 42 we say
symtab[123].type = 42;
I’ll let you do the math. (Assuming location of 20,000 for the beginning of the array, the location of our integer is 20,000 + 123 * 76 + 64.)
Pointing out Pointers
In reality we most often point to structure element with the “points to” notation. If we define a pointer type
struct Symb *sp;
we can use it to loop over all the structure in the table with the following loop.
for (sp = symtab; sp < &symtab[1000]; sp++)
if (sp->type == 42)
print("life, universe and everything\n");
Here we see the “address of” operator (the ampersand) to check for the end of the array. We also see the “points to” operator (the “->” symbols) because we are using a pointer instead of a variable. The variable “sp” holds the offset in main memory for the structure and the ->name adds the offset of name to that value, giving the address in memory for our variable.
Cleaner Structs using Typedef
As an aside, when using C structure in Plan 9 we use a feature of C to make the structure type a real type. The “typedef” operator allows us to add a type to the compile. It was invented to parameterize the first Unix port to a non-PDP–11 machine. But, it was defined so well, it works for structures too. I would really type the above like
typedef struct Symb Symb;
which defines a new type “Symb” to be a structure with tag “Symb.” I would then declare variables using just the “Symb” and avoid cluttering things up with the “struct” keyword everywhere. As in
Symb symtab[1000];
Whew! That was a lot, but hopefully you have the idea. A block of variables we want to treat as a unit can be organized by collecting them into a structure definition. The memory model gives us a way to reduce the complexity to a simple unifying way to think about both structures and arrays.
For more information on structures, see Brian and Dennis’ book The C Programming Language. They explain this a whole bunch better than I just did. In my defense, no one explains technology as well as Brian.
Production Engineering Manager at Meta
9yThis makes a good read - thank you. What kind of Plan 9 do you use to run on Xeon - is it Nix or something? One note: in your example you stated that, assuming (long)ia==10000 the address of ia[42] is 10,042 which is not precisely that: the offset is multiplied by sizeof element, so having ia declared as an array of ints, (long)&ia[42] would evaluate to 10,168.