Biostratigraphic Data Analysis with 
                      Python (I)
                   (Code Sharing)
Python code in Jupyter3

Biostratigraphic Data Analysis with Python (I) (Code Sharing)

It has been quite a bit of a talk about Python in the last few years and praises for this scripting language continue nowadays. I took a one week class on Python couple of years ago and although I saw the value of it I did not rush to learn more or implement any routines using Python. That is because, as a formally trained software engineer, I have enough skills to write code in C, C++ and/or C#, develop relational databases, run SQL the way I want and had access to Visual Studio, vector algebra libraries, charting components (TChart) and Statistica for more complex multivariate analysis. However, I knew that Python is the way to go when I retire given the limited access to paid licenses. I am not retired yet but my access to paid licenses is limited.

In one of my past articles I showed the value of computing sample to sample similarities. The software I wrote a while ago was implemented using C#, Tchart, ExtremeOptimization (vector algebra libraries) and Visual Studio as IDE. It is my ambition to build a set of web services that will incorporate implementations of all the software applications I put together through the years, StrataPlot, PaleoBiostat, GomBat, PalyFacies, RASC-DataMaker and several smaller ones. It may take a while to complete this but I am excited to work on something for the community of biostratigrathers as well as others and deliver it free of charge. 

No alt text provided for this image

PyCharm IDE

As a start, I just completed writing Python code for sample to sample similarity computation running a section against itself and I am sharing that with you (see below). I used Python, PyCharm, Anaconda and Jupyter. I am trying an agile process by producing scripts that can run and be used in early development stages and add details as I move ahead. This requires a bit of more work on the user side in the beginning. This approach not only allows fast delivery of a product but opens the potential for feedback which could help the development of a better and more complex application.

I wrote most of the code using PyCharm IDE and switched to Jupyter after the similarity matrix was computed (see above) because the free version of PyCharm does not have charting capabilities. However, the code in PyCharm was debugged using anaconda environment and that is because the needed libraries.

No alt text provided for this image

Jupyter3 IDE

Next step, I copied and pasted the code from PyCharm to a single cell in Jupyter and added two more lines of code. See the heat-map chart for the similarity matrix below:

No alt text provided for this image

Please feel free to use the code (see below) the way you see fit. If you have questions and/or want my files I will make them available to you. Just shoot me a chat. I also appreciate comments, feedback and constructive criticism. Please know that proper indentation is critical in Python and unfortunately sometimes Jupyter3 confuses tabs with white spaces.

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
data = []
costheta = 0.0
# NOTE: Replace the path with the path to your file
# The file should be tab delimited (text file) and consist of only data no species name row,
# no depth (sample label) column
data1 = np.loadtxt(r'C:\\Users\Emil\PycharmProjects\VectorAlgebraPython\Data\U1319B_BenhticForams_dataonly.txt', delimiter= "\t")
no_of_rows = data1.shape[0]
no_of_cols = data1.shape[1]
row = 0
cnt = 0
sim_list_rng = list(range(1, no_of_rows+1))
#print(sim_list_rng) #Can take the comment symbol (#) out if you need to see this list
df = pd.DataFrame(index=sim_list_rng, columns = sim_list_rng, dtype=float)
simlist=[]
for a in data1:
    avector = np.array(a)
    for b in data1:
        bvector = np.array(b)
        asquared = avector@avector
        bsquared = bvector@bvector
        absquared = avector@bvector
        try:
            costheta = round(absquared/(np.sqrt(asquared)*np.sqrt(bsquared)),2)
        except:
            costheta = 0.0
        simlist.append(costheta)
    row += 1
    #my_series = pd.Series(data = simlist) not neede but it looks that it works with series as well as lists
    #df.loc[len(df.index)] = simlist #not needed
    df.loc[row] = simlist
    simlist = []
#print(df)
plt.subplots(figsize=(20,20))
sns.heatmap(df, cmap ='YlOrRd', linewidths = 0.30, annot = True)

Sofía Barragán-Montilla

Postdoctoral Researcher at CAU | PhD. in Paleoceanography | GeoLatinas Vice-chair | Benthic Foraminifera expert | Bilingüal Science Communication

4y

Monica Alejandra Gomez Correa

Like
Reply

To view or add a comment, sign in

More articles by Emil Platon

Insights from the community

Others also viewed

Explore topics