Issue 3 - spaghetti birds

Issue 3 - spaghetti birds

Introduction

Complexity has and will maintain a strong fascination for many people. It is true that we live in a complex world and strive to solve inherently complex problems, which often do require complex mechanisms. However, this should not diminish our desire for elegant solutions, which convince by their clarity and effectiveness. Simple, elegant solutions are more effective, but they are harder to find than complex ones, and they require more time, which we too often believe to be unaffordable. “ – Nikalus Wirth

In this issue, we will continue building on topics that should be already becoming familiar: improving and maintaining code quality using automated tools as well as achieving appropriate levels of abstraction to structure code in readable and maintainable ways. We will also look at ways to improve performance when working with DataFrame libraries like Pandas.

Don't miss the weekly challenge at the end! I try extra hard to come up with challenges that are both educational (in terms of Python coding skills) and also at least somewhat relevant to common problems you might encounter in industry.


Ruff - one linter to replace them all

To ensure a reasonable baseline of Python code quality when working on larger projects - especially as part of a team - we have at our disposal a number of automated "static analysis" tools which can help. The reason we call them "static analysis" tools is that they analyse the Python code in our codebase without actually running it (so, "statically"). Some of these tools can point out common mistakes, issues with formatting, or things that are not quite as "Pythonic" as they can be. Some can even auto-format our code to match commonly accepted guidelines and can even auto-fix some of the common issues they identify.

In the last issue, we talked about one such tool - `black`. It is an auto-formatter which can re-format all of the Python code in a codebase to follow PEP8 coding style guidelines and be consistent.

Another type of tools are called "linters" - they point out common issues in your code and suggest improvements. Historically, some of the most commonly used Python linters have been tools like "pylint", "flake8", "pyflakes", etc., some of which also come with a variety of plugins to enhance their functionality.

More recently, a new leader has emerged in the race for better Python quality tools. It is called "ruff". Interestingly, it isn't actually written in Python - it's written in the Rust programming language. In practice, this doesn't make much of a difference, because you can just `pip install ruff` like you would any other Python module or tool. And you can run it on your codebase, without the need to install any additional Rust infrastructure. Because it is written in Rust, however, it runs very fast - it can run a wide range of checks on very large codebases in the blink of an eye.

I like to think of `ruff` more as a collection of linters, auto-formatters and linters. Out of the box, it replaces most (if not all) of the checks that more traditional tools like `flake8`, `pyflakes` and `pylint` can do. It also completely replaces `isort`, which is a tool for automatically sorting and grouping the import statements at the top of Python files based on accepted conventions. In addition, it also implements some unique rules that none of the other tools do. The full list of rules is quite impressive, as you can see here: https://beta.ruff.rs/docs/rules/

The one tool it doesn't replace is `black`, so I recommend first running `black`, then `ruff` as part of your build process. The combination of just these two tools should already bring your baseline code quality way above the "industry average".

If your starting point is an existing codebase, you will need to take some time to work through the initial list of issues. Depending on the quality of the existing code, this may seem a bit intimidating as `ruff` could output thousands of individual suggestions. It helps to work through these in groups - i.e., focus on one type of issue at a time and fix all occurrences of it in the codebase. Also, it's up to you / the team to decide whether you want to get to "Inbox Zero" before you proceed further with anything else, or if you prefer to adopt a more gradual approach to reduce the total number of outstanding issues over a period of several sprints.

Note that, unlike `black`, which requires almost no configuration and can be used right out of the box, with `ruff` you do have to place some initial configuration in the `pyproject.toml` file. Most notably, you'll need to whitelist the various types of checks you want it to run, as per the documentation. Feel free to start with a rather short list and gradually expand for a more comprehensive array of checks and suggestions. For some types of checks, `ruff` will be able to automatically fix some of the issues for you, which is absolutely great. Others you will have to address manually.

Once you do reach the coveted "Inbox Zero", don't just rest on your laurels. Set it up these tools to run as part of your build process every time before you commit new code to the project, to ensure you maintain the baseline code quality level throughout the entire project life-cycle.


Book excerpt - what is in an abstract name

The following is an excerpt from my book "Re-introduction to Python for Software Engineering":


You can think of a computer program as a series of transformations. The data we have at the beginning is called "input". The data we get at the end is called "output".

To make it easier to handle the various pieces of data, we can label them with names. Like in mathematics, we can use the name “x” for example to label a numeric value. We call these “variables” because their values can change. In Python we assign a value to a variable like this:

x = 25

Here, “x” is the name of the variable and its value is whatever number we decide to assign to it - in this case 25. If we change our mind later, or if we wish to re-use the name “x” for a different variable, we can simply do another similar assignment and the old value will be forgotten. In a typical program, we would use many different variables and would assign different pieces of data to them, depending on the task at hand. The names themselves are meaningless to the interpreter, but it is good practice to use names which hint at the reason we needed to use a variable in the first place. For example, if we were describing someone’s age, it would be better to call our variable “age” than “x”.

Variables may also have different lifespans. Sometimes you may need a variable throughout the entirety of a certain sequence of operations. Other times, you may just need a temporary variable. Modern languages, such as Python, largely handle the lifespan of variables behind the scenes and make it easy for you to declare your intentions. (E.g. variables that are no longer needed are automatically cleaned up by a process known as “garbage collection”). 

Back to our example. Because the value of our variable is an integer number, we say that “the type of x is integer” (or, in Python, just “int”). Now we can apply to “x” any of the operations which can be applied to integers - regardless of its actual value. E.g. we can add 10 to it.

We call this technique “abstraction”. It gives us the ability to talk about an “abstract” integer “x”, without caring about its specific value. This is one of the most powerful tools for reasoning (in mathematics and in logic) as well as for manipulating data in computer programming.

Variables don’t have to contain numbers - they can contain values of any data type supported by our programming language. We can even combine them into composite data types. In Python, we can have a “list” of items, like `[1, 2, 3, 4, 5]` or `[x, y, z]`. Each item in the list can be of any data type, and we can mix and match to our heart’s content. Because the list is its own (composite) data type, we can apply certain operations to it. E.g., we can ask for the “length” of the list (how many items are in it?), or we can join two lists together into one. We can, of course, also give our list a name, by assigning it to a variable:


my_lucky_numbers = [10, 15, x]


If we haven’t changed `x` and its value is still 25, the list above would be equivalent to the list `[10, 15, 25]`.


Faster DataFrames - Pandas 2.0, Polars and Modin


The Pandas 2.0 release comes with the new Apache Arrow backend. Here's the most important part of this. As we've said repeatedly in this newsletter, using the right data types and the most appropriate representation of data in the computer memory can make a big difference for both performance and code quality.

Pandas was initially built mostly on top of the older NumPy library. It has used that as the main way to represent arrays and perform operations on them in a fast way. While NumPy has helped make pandas the popular library it is, it was not originally built as a backend for dataframe libraries. Because of this, it has some limitations - such as poor support for strings and lack of support for missing values.

Pandas historically persisted string columns as objects, which was pretty inefficient. The new string[pyarrow] column type is around 3.5 times more efficient. Pandas has now added Apache Arrow support for all data types. There are several advantages that Arrow provides, even for simple types, but especially for arrays. These include better interoperability, better handling of missing values, and most noticeably - significant speed-ups. For example, reading in a Parquet file is 1.6 times faster with the Arrow backend. Calculating the mean of a column of floats is 2.1 times faster. Performing an `.endswith()` operation on strings is up to 31.6 times faster!

While you do get some of these benefits right away by upgrading to latest Pandas and using the Arrow backend, it's still very important to be mindful of the data types for your DataFrame columns. Especially when loading data from external sources, like CSV files, which don't come with explicit data type information, you will get much better memory and speed performance if you explicitly review and specify the most appropriate types for each column.

Other new libraries are also emerging which rely on the Apache Arrow backend and / or alternative compute engines for better performance. Polar and Modin are two notable examples.

Polars is written in Rust and built on top of the Arrow2 implementation. This allows it to take advantage of the better concurrency model that Rust uses, compared to Python, as well as the superior performance of Arrow2 over more traditional Python libraries. In benchmark tests, Polar has been shown to be up 8 times faster than Pandas in loading in data, up to 15 times faster in selecting data based on certain criteria from a DataFrame and twice as fast as Pandas in aggregating certain datasets.

While the main advantage of Polars over Pandas is speed, the Polars syntax can be a bit... polarising. Some people find it more "Pythonic", others find it too different from that of Pandas, because it doesn't override built-in Python constructs as much and is usually longer / more verbose. If you have a large existing codebase that uses standard Pandas, you may need to put in quite a lot of work to re-write the relevant parts of the code in Polars syntax - and sometimes there are no obvious, readily available Polars alternatives for some Pandas constructs.

Finally, the Modin library has been hitting the headlines lately. It is advertised as a "drop-in" replacement for Pandas, providing an instant concurrency boost just by changing your imports from "import pandas as pd" to "import modin.pandas as pd". One important thing about Modin is that it comes with support for a number of different "compute engines", and you'll have to do some research as to which one is the best fit for your needs. Note that if you just install Modin with support for all the engines, it will install a whole lot of additional Python libraries, which you most likely do not need. But by choosing only the right engine (start with Ray), you can indeed see some very impressive immediate performance boosts with very minimal effort even for large existing codebases.


Weekly Python challenge

You are provided with a raw CSV containing flight information for different airlines. Write a Python script which ranks the airlines according to their environmental impact.

The input CSV contains the following columns:

* flight number (e.g. BA680)

* the local time the flight leaves the origin, in ISO format (e.g. 2023-03-27T18:30:00+00:00)

* the local time the flight arrives at the destination, in ISO format (e.g. 2023-03-28T00:35:00+03:00)

Here is some sample input for several flights:


```

flight_no,origin_time,dest_time

BA680,2023-03-27T18:30:00+00:00,2023-03-28T00:35:00+03:00

EK75,2023-03-31T14:40:00+04:00,2023-03-31T20:00:00+01:00

EK161,2023-03-27T07:15:00+04:00,2023-03-27T12:10:00+00:00

```


Your task is to write a Python script which reads the input CSV, calculates the environmental impact of each flight (for simplicity, this will be just the total flight time in minutes) and ranks the airlines (using the letter codes at the beginning of flight numbers - e.g. BA, EK, QF) with the top cumulative environmental impact on top. In this case:


```

airline_code,env_impact

EK,1035

BA,185

```


BA only has one flight for a total of 185 minutes. EK has two flights - 500 minutes and 535 minutes, for a total of 1035 minutes, so it comes out on top.

To view or add a comment, sign in

More articles by Jordan D.

Explore topics