Automation with Python - Chapter 1

An R User’s Note on Learning Python

Mena WANG

1 Introduction

1.1 Why Study Python?

I love using R! R is brilliant when working with statistical methods, data wrangling and visualization. Since many R developers are statisticians or mathematicians, we get to try out their new research findings through R. For example, earlier this week I participated in a presentation on Anomaly Detection where applied mathematician Savvandi Kandanaarachchi introduced her research on applying Item Response Theory to construct unsupervised AD Ensembles (preprint of the research article), the algorithm is also published as an R Package outlierensembles for any R user to test and to apply. Moreover, RMarkdown is a cool reporting tool, click for the RMarkdown version of this article with floating TOC and highlighted code chunks. :)

However, Python is a general-purpose language and is hence much more versatile. Recently, I started to think about how to deploy statistical or ML models in production, say on a website or in a mobile App. After some research, I reached the understanding that in order to deploy a statistical/ML model in production, some knowledge of Python would be very helpful, especially when you need to work with other developers and engineers on various cloud platforms. For more discussion on R vs Python, this is a good read.

So here I am, an R user learning Python.

1.2 Learning Materials Used

I have some very basic understanding of Python, but only in terms of data visualization. In 2020 when Covid-19 case numbers started to surge in Australia, I started to make some DataViz to help interpret the numbers. Because a) R is not very supportive of dual-axis graphs (for good reason, they should be used with caution) and b) I am curious about how Python works, so I tried to use Python (especially the Matplotlib and Seaborn package) to make some of the covid-19 visuals (please see an example of dual-axis graph I made with Python).

Beyond DataViz, my knowledge about Python is very limited. So the first material I am using is a beginner-friendly course Using Python for Automation on LinkedIn Learning by Madecraft and Sam Pettus. I know very little about the topic, automation, and is curious to learn more. Upon finishing the first chapter of this course, I will move on to more data science specific topics.

1.3 The Content of This Document

The above-mentioned course has four chapters. This document will record my notes for Chapter 1. Automate File, Folder and Terminal Interactions. The note is real simple, mainly for records, and for some reflection from the perspective of an R user. Please refer to the original course on LinedIn Learning for a fuller understanding of the topic.

Where relevant, I will produce some R code to compare with the Python code. Therefore, hopefully it might be of interest to R users learning Python, and vice versa.

1.4 IDE

The IDE I use is RStudio, which allows you to run Python code through the reticulate package.

2 Read a txt file

The first task in the course is to read a txt file. The txt file has some hypothetical data with three fields: name, age and P/F (denoting whether the person passed or failed a test). The values are separated by space. Followed please see a subset of the data for demo.

Mary 25 P
John 32 P
Dylan 19 F
Julia 23 F
Chad 17 F
...
        

2.1 Python Approach

The following code is offered in the Python course mentioned above.

# "r"ead the file
f=open("Exercise Files/inputFile.txt","r")
print(f.read())

# close the file after the task
f.close()        

2.2 R approach

Using read_delim() function, we can read the text file as such. The three columns in inputFile can be recognized by defining delim to be space.


f<-read_delim("Exercise Files/inputFile.txt",delim=" ",col_names=FALSE)

f %>% 
  # show the first five lines of the data
  head(n=5) 

#remove the file from the environment
rm(f)        

2.3 Python vs R: Interesting Difference

Even with such a simple task, a very interesting difference between the R and Python approach already emerged: the relationship between “objects” and “methods/functions”.

  • In the Python approach, methods are associated with the object (e.g., read() and close()), so when you want to read file f, you use f.read().
  • Whereas in R, methods are independent from the object, so we read the file by applying the read_delim() function on f. 

This then leads to a very interesting discussion about functional versus object-oriented programming architecture for data scienceHere is a good read on the topic. Maybe I will write more about it after I have gone deeper in my Python learning journey too.

BTW, you can do object-oriented programming in R too. But OOP is a bit more challenging in R than in other languages.

3 Print Part of the txt File

Here we would like to print only part of the file: people who passed the test.

3.1 Python Approach

The approach introduced in the course is to identify the 3rd element in the column and filter it to be P. Important to note that Python counts from 0, so [2] indicates the 3rd element.


f=open("Exercise Files/inputFile.txt","r")

for line in f:
  # split each line by space
  line_split=line.split()
  # check whether the 3rd element is P
  if line_split[2]=="P":
      print(line)
          

3.2 R Approach

Since read_delim automatically recognizes three elements in the document, we can refer to the 3rd element by name.


f<-read_delim("Exercise Files/inputFile.txt",delim=" ",col_names=FALSE)

f %>% 
  filter(X3=="P")
        

4 Separate and Save Files

Now let’s try to

  • separate the txt file into two subsets by Pass/Fail, and then
  • save them in two different files

4.1 Python Approach

Note that we need to create passFile and failFile object first, and operate on them through open(), write() and close().


f = open("Exercise Files/inputFile.txt","r")
# Create pass and fail file respectively, and write on them
passFile = open("Exercise Files/passFile.txt","w")
failFile = open("Exercise Files/failFile.txt","w")

for line in f:
  line_split=line.split()
  # if P, save passFile; else save failFile
  if line_split[2]=="P":
     passFile.write(line)
  else:
     failFile.write(line)
f.close()
passFile.close()
failFile.close()        

4.2 R Approach

We don’t need to create passFile or failFile object in advance. Just filter and write to disk directly. Function filter and write_csv are independent from the objects/data files.


f <- read_delim("Exercise Files/inputFile.txt",delim=" ",col_names=FALSE)

# save the R files as .csv so it won't replace the ones created by Python

f %>% 
  filter(X3=="P") %>% 
  write_csv("Exercise Files/passFile.csv")

f %>% 
  filter(X3=="F") %>% 
  write_csv("Exercise Files/failFile.csv")        

5 Executing Terminal Commands

I can’t execute the following Python code in RStudio. The error message is as such:

CalledProcessError: Command '['python3', 'example_chapter1.py']' returned non-zero exit status 9009

(Same error message received when I run the code on Jupyter Notebook. Will move forward with the course, but bear this in mind and come back later)


import subprocess

for i in range(0,5):
  subprocess.check_call(["python3","example_chapter1.py"])

# example_chapter1.py contains a simple print() command that is supposed to be repeated five times        

6 Organizing Directories

6.1 A function to Identify File Type

First, let’s create a function to identify categories of a file based on its suffix.


SUBDIRECTORIES = {
    "DOCUMENTS": ['.pdf','.rtf','.txt'],
    "AUDIO":['.m4a','.m4b','.mp3'],
    "VIDEOS": ['.mov','.avi','.mp4'],
    "IMAGES": ['.jpg','.jpeg','.png']
}

def pickDirectory(value):
    for category, suffixes in SUBDIRECTORIES.items():
        for suffix in suffixes:
            if suffix == value:
                return category
    return 'MISC' #If filetype doesn't exist in our dictionary
            

Let’s try the function pickDirectory() .


print(pickDirectory('.pdf'))
## DOCUMENTS

print(pickDirectory('.png'))
## IMAGES

print(pickDirectory('.py')) 
## MISC        

In the rest of the session, Sam showed how to reorganize files into relevant folders, and the code worked from RStudio. :) Please check out the course instructions for details.

Chapter 1 is a good learning experience! Moving forward, I will continue with some Data Science focused Python courses, and maybe come back in the future for Chapter 4 on API.


Chuanyan Zhu

Higher education researcher and evaluator with passion in understanding and supporting students’ learning and development

3y

Well done! This is so helpful to learners like me who have some knowledge of R and Python to make use of the strength of both.

Like
Reply
Wenjuan Chen - 陳文娟

PhD | AI and People Lead | End-to-End GenAI Solution Implementation

3y

Good stuff Mena Ning Wang! Looking forward to more to come. 👍

To view or add a comment, sign in

More articles by Mena Ning Wang, PhD

  • Algorithm-Agnostic Model Deployment with Mlflow

    One common challenge in MLOps is the need to migrate between various estimators or algorithms to achieve the optimal…

  • An R User's Learning Notes on Python

    My learning journey on Python continues, but I now publish them to RPubs rather than on LinkedIn directly, because…

    2 Comments
  • Data Types in R and Python

    An R User’s Learning Note on Python Mena WANG 26/09/2021 1 Summary This is a short note about data types and how…

Insights from the community

Others also viewed

Explore topics