Inside a program, data is represented a variety of ways. Data is also represented different ways outside of a program. Data inside a program can be represented as an array, a linked list, or a heterogeneous set of elements in custom classes. The possibilities for representing data in a program are many. Dozens of books on data structures or that discuss data structures in detail cover the many ways you can represent and work with data in a program. The range of representations outside a program are just as wide to include file formats, relational database table schema, and network application protocols. The choices you have inside and outside a program to represent data are numerous but each choice impacts RAM, storage, CPU, and network transmission all of which determines user experience, performance, security, reliability, and overall quality.
Data Organization Starting Point
Files, spreadsheets, database tables, and program data structures are all useful ways to organize data. When considering how data should be represented, a necessary starting point is categorization. That is, you need to categorize data by identifying the broad concepts entire sets or parts of data belong. The data can then be more effectively aligned to function. Connecting data to function is not the initial goal but often a good categorization effort will later improve upon your discernment of appropriate discrete functions that should apply to data. One of the questions you must answer is what pieces of data belong where?
Closely related to organization here is the concept of taxonomy. Taxonomy is a discipline in its own right and has a connection to other disciplines such as library science and information architecture. The main goal is to establish a reasonable and usable schematic regarding the data in the program. When you are clear about what things mean, you have a much better chance at being clear about how thy may be used. Regarding a program, you may start with files or spreadsheets but the taxonomic definitions will also exist in other forms as well.
A practical strategy concerning data is to consider how you might treat different types and categories of data. This does not mean you are writing functions and creating an implementation for this data, but you are attempting to focus how you address different pieces of data in whole units, in part, or in combination. For example, you may use, access, and update files a certain way. Even the generic category of files is a concept worth considering. Whereas you may decide to approach databases different from networked files. Or, you may decide to approach all remote data in a common, generic way. Your interpretation of category can affect how you plan to use the data later in the implementation.
Next, you would setup an approach to specific formats such as XML, JSON, and HTML. Afterwards you have the categorization of formats derived from these such as RSS. Although RSS, as an example format, is typically used for news websites to deliver information in a format that is an alternative to visual web page, it is actually a plain-text data format in which the contents are further formatted in XML. As you see from the preceding example, even a single group of data known to be encoded in a single format may, in fact, consist of a hierarchy of formats. You have to identify and be aware of these format divisions in order to categorize them in the context of your program.
A Simplified Example – Categorizing RSS Data
First, we are ware of the fact that in most cases, we will retrieve RSS data from a website. The RSS data may or may not be a file on the website itself. Instead, most often, a web page will be responsible for presenting the RSS data to interested parties. That web page may present the RSS data as a file or as a network stream of bits. What we decided to to is categorize the source of RSS data as an RSS feed synonymous with the concept of a network stream. Therein the starting point in categorizing RSS data is to define it first as a network stream. Network streams have certain characteristics that will be useful in determining the actual methods we use to retrieve an RSS feed.
Second, we understand that RSS data is usually in a plain-text format further encoded as XML. The RSS XML representation is a standardized structure documented in the RSS standard. We can use this standard when writing software that distinguishes parts of an RSS document. Some document formats are loosely defined enough that parties who implement them can diverge a little from the exact letter of the standard and RSS is no stranger to this. However, observing the RSS standard gets us 80 to 90 percent of the way in nearly all cases. Further, the fact that most RSS data will be in plain-text XML determines the specific ways we can read and evaluate RSS data within a program. What comes next is the attempt to further classify parts of the RSS document beyond the RSS standard in a way that meaningful to the program.
The third and final part of the categorization exercise is to regroup parts of the RSS document. An RSS document will often refer to its members as channels, items, and summaries. Remember that RSS is a generic format and not actually specific to news websites. Anyone can use RSS. However, the Gautier RSS program deals primarily in news and it makes sense to use concepts such as feed source, headline, and article, and news site. Further, each one of those concepts may have an ideal representation in the context of the news software. As a result, what RSS describes as an item becomes a headline in the program. Later, it is your task to make the translated concept work by defining the appropriate conversion process from more generic to more specific. You can replace the term RSS with LibreOffice file or JSON stream or IBM DB2 database and the process is the same.
Data Types and Data Structures
Many data types, certainly many primitive data types can be encoded as an abstract String data type. Many data types can be “implemented” with the use of an array data structure to realize the data type. While that can be an expedient way to deal with data, long-term, that is rarely useful. The design of the program is often enhanced by working in earnest to adequately specify the proper data types and data structures for a program. The proper application of data types improves overall data encoding, sequencing, and verification. Whereas appropriate data structures satisfies identifying the right concepts that apply to a group of homogeneous or heterogeneous data as well as the operations most suited to the data structure. Choice of both further has a real and precise impact on RAM, CPU, storage, and network transmission. This section in particular can be better understood in more depth by examining chapter 4 of the book, Computer Science Distilled, by Wladston Viana Ferreira Filho.
Coordinating Data Structures and Implementation
Decades ago, computers were slow enough that you had less actual, practical flexibility in selecting and defining data structures of the kind people may use in desktop and server programs today. RAM was smaller in size, CPU cycles took longer, and storage devices ranged from a few thousand to a few million bytes. A period of time existed in which CPU speeds climbed dramatically but in hindsight, that only lasted a short while. Although CPUs do not see such sharp increases in speed every few years, they are nonetheless sped up by other means in each new release. RAM capacities today in desktop and server environments seem sufficient for many applications and we now have more choices between immense storage capacity versus very fast storage that still range between modest to sizable depending on cost. Soon, wireless network speeds and reliability may increase enough to justify expanded use of network communications. Even with the largess we see in computer speed and capacity, there remains a cost associated with different data types, data structures, and algorithms.
Kurt Guntheroth describes the performance situation well in his book, Optimized C++: Proven Techniques for Heightened Performance. The preface to the book summarizes the performance impacting conditions today. The book has excellent tips in general that you will find useful when attempting to resolve performance problems. That includes those involving data structures as there can be surprises in C++ data structures that contrast with common knowledge about the benefits of a given representation. Moreover, an area where you can see issues in data structure choices are those scenarios in which a given data structure usage retains too much data or is otherwise unsuited to the operations applied to them based on the actual data involved. The C++ language in particular along with C are well suited to address many optimization issues while maintaining your ability to sustain the overall structure of a program. The most common advice today is to avoid optimizing the program’s implementation too early and instead focus on the design and organization first. I agree with that most of the time but also add that you should avoid unnecessary and obvious excesses in the design and implementation.
A Design Mindset
This article and the 4 that precede it bear the theme of striving for the right thought foundation in order to successfully build a program. Television and movies glorify the hacker mindset, and there is a place and “moment” for that but overall, a mindset that focuses less on tricks than on process is generally how most successful programs originate. You may notice that there is little in this article about C++ and UI programs specifically. Indeed, the information conveyed applies to all programming languages and program types. However, C++ has a reputation for being difficult and complex to use. Observing the right mindset actually makes the application of C++ more straightforward resulting in a software program that is more likely to benefit from what C++ has to offer. The articles that follow after this one will be more implementation and C++ specific. Hopefully, you are aware of this article before diving into them.