Biostratigraphic Data Analysis with Python (I) (Code Sharing)
It has been quite a bit of a talk about Python in the last few years and praises for this scripting language continue nowadays. I took a one week class on Python couple of years ago and although I saw the value of it I did not rush to learn more or implement any routines using Python. That is because, as a formally trained software engineer, I have enough skills to write code in C, C++ and/or C#, develop relational databases, run SQL the way I want and had access to Visual Studio, vector algebra libraries, charting components (TChart) and Statistica for more complex multivariate analysis. However, I knew that Python is the way to go when I retire given the limited access to paid licenses. I am not retired yet but my access to paid licenses is limited.
In one of my past articles I showed the value of computing sample to sample similarities. The software I wrote a while ago was implemented using C#, Tchart, ExtremeOptimization (vector algebra libraries) and Visual Studio as IDE. It is my ambition to build a set of web services that will incorporate implementations of all the software applications I put together through the years, StrataPlot, PaleoBiostat, GomBat, PalyFacies, RASC-DataMaker and several smaller ones. It may take a while to complete this but I am excited to work on something for the community of biostratigrathers as well as others and deliver it free of charge.
PyCharm IDE
As a start, I just completed writing Python code for sample to sample similarity computation running a section against itself and I am sharing that with you (see below). I used Python, PyCharm, Anaconda and Jupyter. I am trying an agile process by producing scripts that can run and be used in early development stages and add details as I move ahead. This requires a bit of more work on the user side in the beginning. This approach not only allows fast delivery of a product but opens the potential for feedback which could help the development of a better and more complex application.
I wrote most of the code using PyCharm IDE and switched to Jupyter after the similarity matrix was computed (see above) because the free version of PyCharm does not have charting capabilities. However, the code in PyCharm was debugged using anaconda environment and that is because the needed libraries.
Jupyter3 IDE
Next step, I copied and pasted the code from PyCharm to a single cell in Jupyter and added two more lines of code. See the heat-map chart for the similarity matrix below:
Please feel free to use the code (see below) the way you see fit. If you have questions and/or want my files I will make them available to you. Just shoot me a chat. I also appreciate comments, feedback and constructive criticism. Please know that proper indentation is critical in Python and unfortunately sometimes Jupyter3 confuses tabs with white spaces.
import matplotlib.pyplot as plt import seaborn as sns import numpy as np import pandas as pd data = [] costheta = 0.0 # NOTE: Replace the path with the path to your file # The file should be tab delimited (text file) and consist of only data no species name row, # no depth (sample label) column data1 = np.loadtxt(r'C:\\Users\Emil\PycharmProjects\VectorAlgebraPython\Data\U1319B_BenhticForams_dataonly.txt', delimiter= "\t") no_of_rows = data1.shape[0] no_of_cols = data1.shape[1] row = 0 cnt = 0 sim_list_rng = list(range(1, no_of_rows+1)) #print(sim_list_rng) #Can take the comment symbol (#) out if you need to see this list df = pd.DataFrame(index=sim_list_rng, columns = sim_list_rng, dtype=float) simlist=[] for a in data1: avector = np.array(a) for b in data1: bvector = np.array(b) asquared = avector@avector bsquared = bvector@bvector absquared = avector@bvector try: costheta = round(absquared/(np.sqrt(asquared)*np.sqrt(bsquared)),2) except: costheta = 0.0 simlist.append(costheta) row += 1 #my_series = pd.Series(data = simlist) not neede but it looks that it works with series as well as lists #df.loc[len(df.index)] = simlist #not needed df.loc[row] = simlist simlist = [] #print(df) plt.subplots(figsize=(20,20)) sns.heatmap(df, cmap ='YlOrRd', linewidths = 0.30, annot = True)
Postdoctoral Researcher at CAU | PhD. in Paleoceanography | GeoLatinas Vice-chair | Benthic Foraminifera expert | Bilingüal Science Communication
4yMonica Alejandra Gomez Correa