Understanding data structures and algorithms

Algorithms and data structures are the most fundamental concepts in computing. They are the building blocks from which complex software is built. Having an understanding of these foundation concepts is hugely important in software design and this involves the following three characteristics:

  • How algorithms manipulate information contained within data structures
  • How data is arranged in memory
  • What the performance characteristics of particular data structures are

In this book, we will examine this topic from several perspectives. Firstly, we will look at the fundamentals of the Python programming language from the perspective of data structures and algorithms. Secondly, it is important that we have the correct mathematical tools. We need to understand some fundamental concepts of computer science and for this we need mathematics. By taking a heuristics approach, developing some guiding principles means that, in general, we do not need any more than high school mathematics to understand the principles of these key ideas.

Another important aspect is evaluation. Measuring an algorithms performance involves understanding how each increase in data size affects operations on that data. When we are working on large datasets or real-time applications, it is essential that our algorithms and structures are as efficient as they can be.

Finally, we need a sound experimental design strategy. Being able to conceptually translate a real-world problem into the algorithms and data structures of a programming language involves being able to understand the important elements of a problem and a methodology for mapping these elements to programming structures.

To give us some insight into algorithmic thinking, let's consider a real-world example. Imagine we are at an unfamiliar market and we are given the task of purchasing a list of items. We assume that the market is laid out randomly, and each vendor sells a random subset of items, some of which may be on our list. Our aim is to minimize the price we pay for each item as well as minimize the time spent at the market. One way to approach this is to write an algorithm like the following:

Repeat for each vendor:

  1. Does the vendor have items on my list and is the cost less than a predicted cost for the item?
  2. If yes, buy and remove from list; if no, move on to the next vendor.
  3. If no more vendors, end.

This is a simple iterator, with a decision and an action. If we were to implement this, we would need data structures to define both the list of items we want to buy as well as the list of items of each vendor. We would need to determine the best way of matching items in each list and we need some sort of logic to decide whether to purchase or not.

There are several observations that we can make regarding this algorithm. Firstly, since the cost calculation is based on a prediction, we don't know what the real average cost is; if we underpredict the cost of an item, we come to the end of the market with items remaining on our list. Therefore, we need an efficient way to backtrack to the vendor with the lowest cost.

Also, we need to understand what happens to the time it takes to compare items on our shopping list with items sold by each vendor as the number of items on our shopping list, or the number of items sold by each vendor, increases. The order in which we search through items and the shape of the data structures can make a big difference to the time it takes to do a search. Clearly, we would like to arrange our list, as well as the order we visit each vendor, in such a way that we minimize search time.

Also, consider what happens when we change the buy condition to purchase at the cheapest price, not just the below average price. This changes the problem entirely. Instead of sequentially going from one vendor to the next, we need to traverse the market once and, with this knowledge, we can order our shopping list with regards to the vendors we want to visit.

Obviously, there are many more subtleties involved in translating a real-world problem into an abstract construct such as a programming language. For example, as we progress through the market, our knowledge of the cost of a product improves, so our predicted average price variable becomes more accurate until, by the last stall, our knowledge of the market is perfect. Assuming any kind of backtracking algorithm incurs a cost, we can see cause to review our entire strategy. Conditions such as high price variability, the size and shape of our data structures, and the cost of backtracking all determine the most appropriate solution.