Data structure

In computer programming, a data structure is a predefined format for efficiently storing, accessing, and processing data in a computer program. Some data structures are provided as a built-in component of a programming language, and others may require the inclusion of a library or module before the structure can be used. One of the most important steps in designing and writing a computer program.

Bad programmers worry about the code. Good programmers worry about data structures and their relationships.

Linus Torvalds

Content:

Primitive data structures

The following are primitive data structures, also known as data types. Each of the following primitives can hold a single value, such as 1, 3.14, or the lowercase letter a.

NameAbbreviationDescriptionExample values
Boolean bool A boolean data type can represent one of only two possible values: true or false, stored internally as “1” or “0.” It is named after mathematician George Boole, whose system of logical operations is the foundation of all binary computers. True, False, 1, 0
Character char A character represents a single visual or alphanumeric symbol. Internally, it is represented as a number. Depending on the character encoding type, a character may be stored in a single byte (eight ) or multiple bytes (for example, 16 bits for UTF-16 Unicode). When multiple characters are connected in sequence, such as the word “apple” or the sentence “I would like an apple,” they form a string. A, a, 5, ?, å­ �, ð � � �
Integer int A whole number that falls within certain numerical limits. The limits are dependent on the programming language, and sometimes on the architecture of the CPU. Integers may be signed(positive or negative), or unsigned (restricted to positive values only). The maximum value of a 32-bit signed integer is 2147483647 (2 x 1031 – 1), and the minimum value is -2147483648 (-2 x 1031). 1, 123, -2147483648
Floating-point float A floating-point number can represent numbers with a fractional component — numbers with digits after a decimal point. The decimal point can “float,” meaning it can appear at any position in the number. Internally, a float is similar to scientific notation (such as 1618033 x 10-6 = 1.618033), with a stored exponent (in this case, -6) representing the position of the decimal point. Floating-point numbers are signed — they can store both positive and negative values. 1.567001, 0.356000, -5000.01
Double-precision floating-point double A double-precision floating-point uses twice the number of bytes of a standard float, providing approximately double the precision. 3.00000012384632, -351691.581045892
Pointer, also called a reference, handle, or descriptor A pointer is a special value that refers (“points”) to a location in memory where a data object, such as a data structure or a function, is stored. Pointers are a powerful feature of some low-level languages, but they are dangerous if implemented incorrectly, and may cause severe bugs or software vulnerabilities. Some languages do not support pointers. Other languages do, but require that the pointer be “typed” — the programmer must explicitly declare the type of data it points to. Accessing a value by its pointer is called dereferencing. 0xffe2ba6c, 0xffe2ac5b
Fixed-point fixed A numeric value with a whole number and fractional component, similar to a floating-point number, but with a preset level of precision. For example, monetary values can usually be represented in a fixed-point format that always has two digits after the decimal point, e.g. 5.00 or 123.45. Most programming languages do not natively support a fixed-point data type, because floating-point is more flexible. Languages with fixed-point data types include Ada and COBOL. 1.2, 400.00, -9.874

Compound data structures

Compound data structures can contain multiple elements. These elements are usually values of a primitive data type, such as an integer or character. Some compound structures are restricted to a single data type, while others permit a mixture of data types. Some can contain data structures as elements.

Linear data structures

A linear data structure is a compound data structure whose elements are arranged in a logical sequence. In these structures, every element is followed by exactly one other element, unless it is the last one.

NameDescription
ArrayAn array is a sequence of elements, usually of a single data type. Each element is indexed with a number representing its position in the array.
Usually, array indexing is zero-based: if the array contains n elements, the first element has index 0, the second has index 1, and the last has index n-1. For example, consider an array named primes containing the values [2, 3, 5, 7, 11]. The first element, primes[0], is the value 2. The last element, primes[4], is the value 11. The length of this array (the number of elements it contains) is 5.
A static array cannot change in size once it has been declared and its memory allocated. A dynamic array can increase or decrease in length as needed.
A LUT (lookup table) is an array containing pre-calculated values to speed up computation. For example, a LUT may contain frequently-used square roots of numbers. When the square root of a number is needed in a calculation, the value is “looked up” in the array, rather than computed, conserving CPU cycles.
A multidimensional array contains fixed-length arrays as elements. For example, [[1, 2], [3, 4], [5, 6]] is a two-dimensional array: it contains three elements, each of which is an array that contains two elements. The values can be accessed by providing two index numbers: the index of the inner array, and the index within that inner array. In this example, the value at index [0][0] is 1. At index [0][1] is the value 2. At index [1][0] is the value 3. The last value, at index [2][1], is 6.
A matrix is a mathematical object that can be implemented as a multidimensional array.
Associative arrayAn associative array, also called a dictionary, map, or symbol table is a data structure containing pairs of keys and values. Every key must be unique (it should not appear more than once in the structure). Each key is associated with a value, which does not need to be unique — multiple keys may have identical values.
High-level languages often have built-in data types for working with associative arrays, such as the dict class in Python. An associative array may be implemented as a standard single-dimensional or two-dimensional array if the programmer devises the proper scheme to properly index and associate keys and values. It may also be implemented as a hash table, a binary search tree, etc.
JSON is a widely-used data structure for associating keys and values, designed by Douglas Crockford for JavaScript applications. JSON is essentially an associative array in which the values may also be associative arrays.
ListA list is a sequence of elements. In its simplest form, a list can be implemented as an array. However, the term “list” usually refers to a linked list, in which each element of the list (called a “node”) contains pointers to one or more other items in the list. Unlike an array, in a linked list, sequential elements are not arranged sequentially in physical memory.
In a singly-linked list, each element contains a pointer to the next element in the list.
In a doubly-linked list, each element has two pointers: one pointer to the next element, and one pointer to the previous.
In a multiply-linked list, more than two-pointers may be included in a node. These pointers can each have a special purpose, defined by the programmer, for traversing elements in an arbitrary sequence. For example, a node may have one pointer for the next node by alphabetical order, another pointer for the next node by chronological order, the next node by hierarchical priority, etc. Although a multiply-linked list may be traversed linearly, its multiple linkages may permit nonlinear traversal of the data structure.
In a circular linked list, the last node points to the first, allowing for infinite traversal of the list. In other words, the next node after the last is the first.
In general, linked lists are more flexible than arrays. For example, inserting an element into an array may require that all or part of the array be re-written in memory, beginning at the location of the insertion. In contrast, inserting a node in a linked list requires only that a single node be created, and the pointers of adjacent nodes are updated with the new node’s location. The tradeoff is that linked lists require more system resources. For example, linked lists require either manual or automatic GC (garbage collection), to “clean up” (make available) memory that is no longer in use after nodes are removed from the list.
QueueA queue is a sequential collection in which elements are added to the end of the queue, and removed from the beginning. It is a FIFO data structure (“first-in, first-out”). It is most efficiently implemented with a doubly-linked list. A queue can also be implemented as a singly-linked list or a dynamic array, if two additional pointers are maintained, pointing to the first and last elements. Adding an element to a queue is called queueing, and removing an element from a queue is called dequeuing.
StackA stack is a sequential collection of elements in which only the last element (at the “top” of the stack) can be modified. A new element can be “pushed” onto the stack, in which case, it becomes the last element of the stack, and the stack’s length increases by 1. The last element can be “popped” from the stack, in which case, it is returned as a value and removed from the stack, decreasing the stack’s length by 1. Sometimes, a “peek” operation is also provided, allowing the last element to be read without being removed. A stack is a LIFO (“last-in, first-out”) data structure: the last element added is always the next element removed. A stack can usually be implemented as an array or a linked list, with a separate pointer to the top element of the stack.
StringA string contains a sequence of characters. In some languages, a string is an array of characters, terminated with a null character. For example, the word “hope” could be represented by the array [‘h’, ‘o’, ‘p’, ‘e’, ‘\0’]. In other languages, such as Python, strings are implemented as a class, with associated methods for processing, accessing, and manipulating the string’s characters.
TupleA tuple is a sequential list of elements. It is usually implemented as an immutable object that allows mixed data types. To identify its size, it may be referred to as an “n-tuple,” i.e., “three-tuple,” “six-tuple,” etc., where n is the number of elements it contains.

Tree structures

A tree is a compound structure whose elements are arranged in a parent-child hierarchy, similar to branches on a tree. The elements are referred to as nodes. A tree contains a single root node, which has no parents. Traversal usually occurs only in one direction, from the node to one of its children (called “descendents”). Then, traversal continues recursively, to one of the children of that node, until the desired node is reached. If a node has no children, it is called a leaf node.

If a tree is balanced, it is organized so that all branches are of the same “depth” — all leaf nodes have the same distance from the root node. If a tree is unbalanced, there is no restriction on how long a branch can be in relation to the others. Unbalanced trees are simpler to implement, but maybe less efficient in worst-case scenarios.

NameDescription
Abstract syntax treeAn AST (abstract syntax tree) structures the elements of a computer program in a tree hierarchy. The structure of the tree naturally represents some elements of the syntax of a computer language, such as lexical scope and conditional statements. An AST has many uses in computer science, such as automatically converting a program from one language to another.
Binary treeIn a binary tree, every node has zero, one, or two children. By some definitions, binary tree nodes must have exactly two children or none. An example of binary tree usage is numerical comparison, where traversal of the tree is decided based on comparing the two children and choosing the greater or lesser value.
B-treeB-trees are tree structures in which nodes may have more than two children, and are arranged according to their sorted values. B-trees are self-balancing, organized according to a given set of rules ensuring that fundamental operations occur in logarithmic time. In other words, the number of steps required to complete an operation, in the worst case, is never greater than the logarithm of the total number of elements in the tree. This property makes B-trees ideal for quick access of massive data storage, such as file systems and databases.
BSP treeBSP tree contains data used in binary space partitioning, in which a geometric space is recursively, logically divided. BSP trees are frequently used in 3D graphics to optimize the rendering of objects in the space. For example, a space can be partitioned so objects partially obscured by other objects are not fully rendered, or not rendered at all, increasing performance.
Heapheap is a tree that satisfies the “heap property.” In a max heap, for any node C, if P is a parent node of C, then the value (key) of P is greater than or equal to the value of C. In a min heap, the value of P is less than or equal to the value of C.

Heaps are commonly used as a priority queue, in which the highest or lowest-priority item is always stored at the root of the tree.

Binary heaps (with a maximum of two nodes for any parent) are the fundamental data structure of the heap sort algorithm.
Merkle treeMerkle tree, also called a hash tree, is a tree structure in which every parent node contains a cryptographic hash of its children, and leaf nodes contain a cryptographic hash of a block of data. They are useful for verifying the authenticity of data. For example, they are used in the IPFS, Btrfs, and ZFS file systems to verify that data has not been corrupted on disk. They are also used in the Bitcoin and Ethereum cryptocurrency protocols to protect against malicious or incidental inconsistencies in the blockchain. Other systems that use Merkle trees include the Git and Mercurial version control systems, and the NoSQL database system Apache Cassandra. They are named after their inventor, Ralph Merkle.
R-treeR-trees are tree structures optimized for certain spatial operations on the data. For example, they can store geographic locations, and efficiently find locations within a certain radius, based on the distance of their connecting roads. The “R” stands for rectangle, in reference to the fact that elements are grouped according to their minimum bounding rectangles within a geometric space.

Hash structures

In computer science, a hash function is a function that produces a fixed-length number, often represented in hexadecimal, that can be calculated from a given set of input data. Regardless of the size of the input, as long as the input has not changed, it will always produce the same fixed-length hexadecimal number. These types of functions have important uses in computer science, because they can be used to validate or index data, even at a massive scale.

The following are examples of data structures designed to store the results of hash functions, many of which are special cases of structures described in the sections above.

NameDescription
Bloom filterBloom filter is a probabilistic data structure first proposed by scientist Burton Howard Bloom at MIT in 1970. The mathematical properties of the structure allow a program to quickly determine if an element possibly or certainly does not belong to a larger set of elements. Bloom devised the structure to address problems in the growing fields of AI, ML, and big data. Using traditional structures such as hash tables, deterministic queries on large sets of data would not be feasible, exceeding a practical amount of disk space and core memory. Using a Bloom filter, however, results can be obtained with a high degree of probability with a fraction of memory and disk access. The structure can be tuned according to available resources, providing massive efficiency gains for a negligible loss of accuracy.
Hash tablehash table, also called a hash map, is a type of associative array in which the keys are the computed hashes of their values. For example, the contents of files can be hashed to create a fixed-length key, e.g., a hexadecimal number with a constant number of digits. The algorithm assures that this key is almost certainly unique. Usually, there is a small probability of a “collision” in the table — different inputs have a small chance to produce identical keys. The chance of collision is usually negligible, and considered an acceptable tradeoff for high performance.

Hash tables are a good way to index large data sets, providing search times that are independent of the table size.
Rolling hashrolling hash is a hash table whose hash function “rolls through” an input set, progressively computing hashes for a moving window of the input data. It is useful for processing data that is input in a steady stream, such as data transferred over a network. For example, rsync uses a rolling hash to protect against data corruption during network file transfers. The rolling hash is also useful in stream processing. In-stream processing, computations are performed progressively on data sets too big to store all at once on a single device.
Trietrie (alternatively pronounced as “tree” or “try”) is an ordered tree structure that stores associative arrays. It has special properties that make it space-efficient for certain search operations. In a trie, the key of a node is not stored literally. Instead, the key is defined by the node’s position in the tree. The keys of all descendants of a given node share a common prefix, such as the first characters of a string, associated with the parent node. Tries are ideal for certain search functions, such as those used in NLP (natural language processing).

The data structure was first described by computer scientist René de la Briandais in 1959, and the term “trie” was coined by Edward Fredkin in 1961. Fredkin derived the term from the word “retrieval,” and pronounced it “tree.”

Graph structures

A graph is a data structure that organizes data according to the relationships of its elements in a geometric space. The elements are usually vertices (points in the graph) and edges (the connections between vertices).

If the vertices in a graph can be traversed in only one direction (A→B, but not B→A), it is called a Directed graph. If vertices can be traversed in either direction, it is called an Undirected graph. If it’s possible to traverse edges and return at the starting vertex, it is called a Cyclic graph. If such traversal is not possible, it is called an Acyclic graph.

The following are examples of data structures used to efficiently store and process graphs.
NameDescription
Adjacency listAn adjacency list is a collection of unordered lists, each of which describes the neighbors of a vertex in a graph. It may be implemented using various other data structures, such as a hash table in which the values are arrays of adjacent vertices. In other implementations, the vertices and edges are each stored as objects connected by pointers, similar to a multiply-linked list. An adjacency list can be used to compute all neighbors of a vertex in constant time, proportional to the degree of the vertex (the dimensionality of the graph). However, it is inefficient for determining if an edge exists between two given vertices.
Adjacency matrixIn an adjacency matrix, a multidimensional array stores boolean values (requiring only one bit) corresponding to the connectedness of vertices, which are indexed to the rows and columns of the matrix. In this structure, listing the neighbors of a vertex takes much longer to compute, proportional to the total number of vertices. In addition, the structure requires more storage, growing exponentially with the total number of vertices. However, determining if an edge exists between two vertices is significantly faster than with an adjacency list.
Graph-structured stackgraph-structured stack is a data structure whose operations correlate to a traditional stack (a LIFO whose values can be pushed or popped), but the connectedness of the items is defined as an acyclic graph. It has important uses in NLP, specifically parsing ambiguous or non-deterministic grammars, such as those used by humans.
HypergraphIn a hypergraph, an edge can join any number of vertices. These special edges, called hyperedges, usually share cardinality (they are all connected to the same number of nodes). In a k-uniform hypergraph, all hyperedges are connected to k nodes. For example, a “2-uniform hypergraph” is the same as a traditional graph; in a 6-uniform hypergraph, each hyperedge is connected to 6 nodes. Hypergraphs have a special definition of acyclicity, with special rules for what constitutes traversal of nodes in a cycle.

The mathematical properties of hypergraphs have gained attention with the advent of ML (machine learning), in applications such as recommendation systems and semi-supervised learning.
Scene graphscene graph is a structure used in some vector graphics editors, 3D modeling software, and computer games. It can efficiently model a complex relationship of objects that are spatially related, representing a hierarchy of sequential modifications. For example, when multiple geometry effects are applied to an object in a 3D modeling program, each sequential modification is based on the results of the previous changes. A sphere may be created as a NURBS object, modified by an abounding lattice, then converted to wireframe geometry so its individual vertices can be transformed. These modifications could be stored in a scene graph, so the final result of all transformations can be efficiently animated or rendered as single objects. Modifying any single step in the modification chain would not destroy the transformations occurring later in the sequence.