Predicting rainfall with data science in python

MACHINE LEARNING BASED RAINFALL PREDICTION
Abstract:
Machine learning and Feature Selection are playing a vital role in internet and health sector also.
Rainfall prediction is important as heavy rainfall can lead to many disasters. The prediction helps
people to take preventive measures and moreover the prediction should be accurate. There are
two types of prediction short term rainfall prediction and long term rainfall. Prediction mostly
short term prediction can gives us the accurate result. The main challenge is to build a model for
long term rainfall prediction. Heavy precipitation prediction could be a major drawback for earth
science department because it is closely associated with the economy and lifetime of human. It’s
a cause for natural disasters like flood and drought that square measure encountered by
individuals across the world each year. Accuracy of rainfall statement has nice importance for
countries like India whose economy is basically dependent on agriculture.
Rainfall prediction is the one of the important technique to predict the climatic conditions in any
country. This paper proposes a rainfall prediction model using LR & RF for dataset. The
input data is having multiple meteorological parameters and to predict the rainfall in more
precise. From the results, the proposed machine learning model provides better results than the
other algorithms in the literature. The goal of this project is to develop an appropriate machine
learning tool which can predict will be rain or not. The algorithm that can be used here are
Logistic Regression and Random Forest.

TABLE OF CONTENTS
CHAPTE
R NO. TITLE
PAGE
NO.
1.
CHAPTER 1 : INTRODUCTION
1.1 GENERAL
1.1.1 THE MACHINE LEARNING SYSTEM
1.1.2 FUNDAMENTAL
1.2 JUPYTER
1.3 MACHINE LEARNING
1.4 CLASSIFICATION TECHNIQUES
1.4.1 NEURAL NETWORK AND DEEP LEARNING
1.4.2 METHODOLOGIES - GIVEN INPUT AND EXPECTED
OUTPUT
1.5 OBJECTIVE AND SCOPE OF THE PROJECT
1.6 EXISTING SYSTEM
1.6.1 DISADVANTAGES OF EXISTING SYSTEM
1.6.2 LITERATURE SURVEY
1.7 PROPOSED SYSTEM
1.7.1 PROPOSED SYSTEM ADVANTAGES
4
6
9
12
12
13
17
17
2.
CHAPTER 2 :PROJECT DESCRIPTION
2.1 INTRODUCTION
2.2 DETAILED DIAGRAM
2.2.1 FRONT END DESIGN
2.2.2 BACK END FLOW
2.3 SOFTWARE SPECIFICATION
28
29
29
30

2.3.1 HARDWARE SPECIFICATION
2.3.2 SOFTWARE SPECIFICATION
2.4 MODULE DESCRIPTION
2.4.1 DATA COLLECTION
2.4.2 DATA AUGUMENTATION
2.4.3 DATA SPLITTING
2.4.4 CLASSIFICATION
2.4.5 PERFORMANCES MATRICES
2.4.6 CONFUSION MATRIX
2.5 MODULE DIAGRAM
2.5.1 SYSTEM ARCHITECTURE
2.5.2 USECASE DIAGRAM
2.5.3 CLASS DIAGRAM
2.5.4 ACTIVITY DIAGRAM
2.5.5 SEQUENCE DIAGRAM
2.5.6 STATE FLOW DIAGRAM
2.5.7 FLOW DIAGRAM
30
31
32
33
34
35
35
36
37
38
40
3.
CHAPTER 3 : SOFTWARE SPECIFICATION
3.1 GENERAL
3.2 ANACONDA
3.3 PYTHON
3.2.1 SCIENTIFIC AND NUMERIC COMPUTING
3.2.2 CREATING SOFTWARE PROTOTYPES
3.2.3 GOOD LANGUAGE TO TEACH PROGRAMMING
41
42
43
44
44
44
4.
CHAPTER 4 : IMPLEMENTATION
4.1 GENERAL
4.2 IMPLEMENTATION CODING
4.3 SNAPSHOTS
48
48
51
5.
CHAPTER 5 : CONCLUSION & REFERENCES
5.1 CONCLUSION
5.2 REFERENCES
55
56

CHAPTER I
INTRODUCTION
1.1 GENERAL
Glossary and Key Terms
This section provides a quick reference for several algorithms that are not explicity mentioned
in this chapter, but may be of interest to the reader. This should provide the reader with some
keywords or useful points of reference for other similar libraries to those discussed in this
chapter.
BIDMachGPU accelerated machine learning library for algorithms that are not necessarily
neural network based.
Caret provides a standardised API for many of the most useful machine learning packages for
R. For readers who are more comfortable with R, Caret provides a good substitute for Python’s
SciKit-Learn.
Mathematicais a commercial symbolic mathematical computation system, developed since
1988 by Wolfram, Inc. It provides powerful machine learning techniques “out of the box” such
as image classification [4].
MATLAB is short for MATrixLABoratory, which is a commercial numerical computing
environment, and is a proprietary programming language by MathWorks. It is very popular at
universities where it is often licensed. It was originally built on the idea that most computing
applications in some wayrely on storage and manipulations of one fundamental object—the
matrix, and this is still a popular approach.
-R is used extensively by the statistics community. The software package Caret provides a
standardised API for many of R’s machine learning libraries.

WEKA is short for the Waikato Environment for Knowledge Analysis [6] and has been a very
popular open source tool since its inception in 1993. In 2005Weka received the SIGKDD Data
Mining and Knowledge Discovery Service
Award: it is easy to learn and simple to use, and provides a GUI to many machine learning
algorithms.
VowpalWabbitMicrosoft’s machine learning library. Mature and actively developed, with an
emphasis on performance.
Requirements and Installation
The most convenient way of installing the Python requirements for this tutorial is by using the
Anaconda scientific Python distribution. Anaconda is a collection of the most commonly used
Python packages preconfigured and ready to use.
Approximately 150 scientific packages are included in the Anaconda installation.
Install the version of Anaconda for your operating system.
All Python software described here is available for Windows, Linux, and Macintosh. All code
samples presented in this tutorial were tested under Ubuntu Linux 14.04 using Python 2.7.
Some code examples may not work on Windows without slight modification (e.g. file paths in
Windows use and not / as in
UNIX type systems).
The main software used in a typical Python machine learning pipeline can consist of almost any
combination of the following tools:
1. NumPy, for matrix and vector manipulation
2. Pandas for time series and R-like DataFrame data structures
3. The 2D plotting library matplotlib
4. SciKit-Learn as a source for many machine learning algorithms and utilities
5. Keras for neural networks and deep learning
Managing Packages
Anaconda comes with its own built in package manager, known as Conda. Using the conda
command from the terminal, you can download, update, and delete Python packages. Conda

takes care of all dependencies and ensures that packages are preconfigured to work with all other
packages you may have installed.
Keeping your Python distribution up to date and well maintained is essential in this fast moving
field. However, Anaconda makes it particularly easy to manage and keep your scientific stack up
to date. Once Anaconda is installed you can manage your Python distribution, and all the
scientific packages installed by Anaconda using the conda application from the command line.
To list all packages currently installed, use conda list. This will output all packages and their
version numbers. Updating all Anaconda packages in your system is performed using the conda
update -all command. Conda itself can be updated using the conda update conda command,
while Python can be updated using the conda update python command. To search for packages,
use the search parameter, e.g. conda search stats where stats is the name or partial name of the
package you are searching for.
OBJECTIVE AND SCOPE OF THE PROJECT
 The objective of this project is to show how sentimental analysis can help improve the
user experience over a social network or system interface.
 The learning algorithm will learn what our emotions are from statistical data then
perform sentiment analysis.
 Our main objective is also maintain accuracy in the final result.
 The main goal of such a sentiment analysis is to discover how the audience perceives the
television show. The Twitter data that is collected will be classified into two categories;
positive or negative. An analysis will then be performed on the classified data to investigate
what percentage of the audience sample falls into each category.
 Particular emphasis is placed on evaluating different machine learning algorithms for the
task of twitter sentiment analysis.

Jupiter
Jupyter, previously known as IPython Notebook, is a web-based, interactive development
environment. Originally developed for Python, it has since expanded to support over 40 other
programming languages including Julia and R.
Jupyter allows for notebooksto be written that contain text, live code, images, and equations.
These notebooks can be shared, and can even be hosted on GitHubfor free.
For each section of this tutorial, you can download a Juypter notebook that allows you to edit and
experiment with the code and examples for each topic. Jupyter is part of the Anaconda
distribution; it can be started from the command line using the jupyter command:
Machine Learning
We will now move on to the task of machine learning itself. In the following sections we will
describe how to use some basic algorithms, and perform regression, classification, and clustering
on some freely available medical datasets concerning breast cancer and diabetes, and we will
also take a look at a DNA microarray dataset.

SciKit-Learn
SciKit-Learn provides a standardised interface to many of the most commonly used machine
learning algorithms, and is the most popular and frequently used library for machine learning for
Python. As well as providing many learning algorithms, SciKit-Learn has a large number of
convenience functions for common preprocessing tasks (for example, normalisation or k-fold
cross validation).
SciKit-Learn is a very large software library.
Clustering
Clustering algorithms focus on ordering data together into groups. In general clustering
algorithms are unsupervised—they require no y response variable as input. That is to say, they
attempt to find groups or clusters within data where you do not know the label for each sample.
SciKit-Learn have many clusteringalgorithms, but in this section we will demonstrate
hierarchical clustering on a DNA expression microarray dataset using an algorithm from the
SciPy library.

We will plot a visualisation of the clustering using what is known as a dendrogram, also using
the SciPy library.
The goal is to cluster the data properly in logical groups, in this case into the cancer types
represented by each sample’s expression data. We do this using agglomerative hierarchical
clustering, using Ward’s linkage method:
Classification
weanalysed data that was unlabelled—we did not know to what class a sample belonged (known
as unsupervised learning). In contrast to this, a supervised problem deals with labelled data
where are aware of the discrete classes to which each sample belongs. When we wish to predict
which class a sample belongs to, we call this a classification problem. SciKit-Learn has a number
of algorithms for classification, in this section we will look at the Support Vector Machine.
We will work on the Wisconsin breast cancer dataset, split it into a training set and a test set,
train a Support Vector Machine with a linear kernel, and test the trained model on an unseen
dataset. The Support Vector Machine model should be able to predict if a new sample is
malignant or benign based on the features of a new, unseen sample:

You will notice that the SVM model performed very well at predicting the malignancy of new,
unseen samples from the test set—this can be quantified nicely by printing a number of metrics
using the classification report function. Here, the precision, recall, and F1 score (F1 = 2·
precision·recall/precision+recall) for each class is shown. The support column is a count of the
number of samples for each class.
Support Vector Machines are a very powerful tool for classification. They work well in high
dimensional spaces, even when the number of features is higher than the number of samples.
However, their running time is quadratic to the number of samples so large datasets can become
difficult to train. Quadratic means that if you increase a dataset in size by 10 times, it will take
100 times longer to train.
Last, you will notice that the breast cancer dataset consisted of 30 features. This makes it
difficult to visualize or plot the data. To aid in visualization of highly dimensional data, we can
apply a technique called dimensionality reduction.
Dimensionality Reduction
Another important method in machine learning, and data science in general, is dimensionality
reduction. For this example, we will look at the Wisconsin breast cancer dataset once again. The
dataset consists of over 500 samples, where each sample has 30 features. The features relate to

images of a fine needle aspirate of breast tissue, and the features describe the characteristics of
the cells present in the images. All features are real values. The target variable is a discrete value
(either malignant or benign) and is therefore a classification dataset.
You will recall from the Iris example in Sect. 7.3 that we plotted a scatter matrix of the data,
where each feature was plotted against every other feature in the dataset to look for potential
correlations (Fig. 3). By examining this plot you could probably find features which would
separate the dataset into groups. Because the dataset only had 4 features we were able to plot
each feature against each other relatively easily. However, as the numbers of features grow, this
becomes less and less feasible, especially if you consider the gene expression example in Sect.
9.4 which had over 6000 features.
One method that is used to handle data that is highly dimensional is Principle Component
Analysis, or PCA. PCA is an unsupervised algorithm for reducing the number of dimensions of a
dataset. For example, for plotting purposes you might want to reduce your data down to 2 or 3
dimensions, and PCA allows
you to do this by generating components, which are combinations of the original features, that
you can then use to plot your data.
PCA is an unsupervised algorithm. You supply it with your data, X, and you specify the number
of components you wish to reduce its dimensionality to. This is known as transforming the data:

Again, you would not use this model for new data—in a real world scenario, you would, for
example, perform a 10-fold cross validation on the dataset, choosing the model parameters that
perform best on the cross validation. This model would be much more likely to perform well on
new data. At the very least, you would randomly select a subset, say 30% of the data, as a test set
and train the model on the remaining 70% of the dataset. You would evaluate the model based on
the score on the test set and not on the training set
.
NEURAL NETWORKS AND DEEP LEARNING
While a proper description of neural networks and deep learning is far beyond the scope of this
chapter, we will however discuss an example use case of one of the most popular frameworks for
deep learning: Keras4.
In this section we will use Keras to build a simple neural network to classify theWisconsin breast
cancer dataset that was described earlier. Often, deep learning algorithms and neural networks
are used to classify images—convolutional neural networks are especially used for image related
classification. However,
they can of course be used for text or tabular-based data as well. In this we will build a standard
feed-forward, densely connected neural network and classify a text-based cancer dataset in order
to demonstrate the framework’susage.
In this example we are once again using the Wisconsin breast cancer dataset, which consists of
30 features and 569 individual samples. To make it more challenging for the neural network, we

will use a training set consisting of only 50% of the entire dataset, and test our neural network on
the remaining 50% of the data.
Note,Keras is not installed as part of the Anaconda distribution, to install it use pip:
Keras additionally requires either Theano or TensorFlow to be installed. In the examples in this
chapter we are using Theano as a backend, however the code will work identically for either
backend. You can install Theano using pip, but it has a number of dependencies that must be
installed first. Refer to the Theano and TensorFlow documentation for more information [12].
Keras is a modular API. It allows you to create neural networks by building a stack of modules,
from the input of the neural network, to the output of the neural network, piece by piece until you
have a complete network. Also, Keras can be configured to use your Graphics Processing Unit,
or GPU. This makes training neural networks far faster than if we were to use a CPU. We begin
by importing Keras:
We may want to view the network’s accuracy on the test (or its loss on the training set) over time
(measured at each epoch), to get a better idea how well it is learning. An epoch is one complete
cycle through the training data.
Fortunately, this is quite easy to plot as Keras’ fit function returns a history object which we can
use to do exactly this:
This will result in a plot similar to that shown. Often you will also want to plot the loss on the
test set and training set, and the accuracy on the test set and training set.
Plotting the loss and accuracy can be used to see if you are over fitting (you experience tiny loss
on the training set, but large loss on the test set) and to see when your training has plateaued.

Problem Statement:
Rainfall prediction is a beneficiary one, but it is a challenging task. Machine learning techniques
can use computational methods and predict rainfall by retrieving and integrating the hidden
knowledge from the linear and non-linear patterns of past weather data. Various tools and
methods for predicting rain are currently available, but there is still a shortage of accurate results.
Existing methods are failing whenever massive datasets are used for rainfall prediction.
OBJECTIVE:
Predicting rainfall is an application of science and technology for predicting the amount of rain
over an area. The most important thing is to accurately determine the rainfall for active use of
rainfall for water resources, crops, pre-planning of water resources and for agricultural purposes.
In earlier rainfall information benefits the farmers for better managing their crops and properties
from heavy rainfall. The farmers better manage to increase the economic growth of the country
by efficient rainfall information. Prediction of precipitation is necessary to save the life of
people’s and properties from flooding. Prediction of rainfall helps people in coastal areas by
preventing the floods.

SCOPE OF THE PROJECT:
The accurate and precise rainfall prediction is still lacking which could assist in diverse fields
like agriculture, water reservation and flood prediction. The issue is to formulate the calculations
for the rainfall prediction that would be based on the previous findings and similarities and will
give the output predictions that are reliable and appropriate. The imprecise and inaccurate
predictions are not only the waste of time but also the loss of resources and lead to inefficient
management of crisis like poor agriculture, poor water reserves and poor management of floods.
Therefore, the need is not to formulate only the rainfall predicting system but also a system that
is more accurate and precise as compared to the existing rainfall predictors.
EXISTING SYSTEM
Supervised learning is built to make prediction, given an unforeseen input instance. A supervised
learning algorithm takes a known set of input dataset and its known responses to the data
(output) to learn the regression/classification model. An algorithm is used to learn the dataset and
train it to generate the model for prediction of rainfall for the response to new data or test data.
Supervised learning uses classification algorithms and regression techniques to develop
predictive models.
1.NAIVE BAYES:
Naive Bayes classifiers calculate the probability of a sample to be of a certain category, based on
prior knowledge. They use the Naïve Bayes Theorem, that assumes that the effect of a certain
feature of a sample is independent of the other features. That means that each character of a
sample contributes independently to determine the probability of the classification of that
sample, outputting the category of the highest probability of the sample. In Bernoulli Naïve
Bayes the predictors are boolean variables. The parameters that we use to predict the class

variable take up only values yes or no.The basic idea of Naive Bayes technique is to find the
probabilities of classes assigned to texts by using the joint probabilities of words and classes.
2.LOGISTICREGRESSION:
Logistic regression is basically a supervised classification algorithm. In a classification problem,
the target variable(or output), y, can take only discrete values for given set of features(or inputs),
X. The logistic regression model described relationship between predictors that can be
continuous, binary, and categorical. Logistic regression becomes a classification technique only
when a decision threshold is brought into the picture. The setting of the threshold value is a very
important aspect of logistic regression and is dependent on the classification problem itself. It
predicts the probability that a given data entry belongs to the category numbered as “1”. Just like
Linear regression assumes that the data follows a linear function, Logistic regression models the
data using the sigmoid function.
1.1.1 DISADVANTAGES OF EXISTING SYSTEM
Methods have performance limitations because of wide range of variations in data and amount

of data is limited.
Issue involved in rainfall classification is choosing the required sampling recess of

Observation-Forecasting of rainfall, which is dependent upon the sampling interval of input data.
Less accuracy

LITERATURE SURVEY:
1. TITLE: PRDICTION OF RAINFALL USING MACHINE LEARNING
TECHNIQUES
Author: Moulana Mohammed, Roshitha Kolapalli, Niharika Golla, Siva Sai Maturi
YEAR: - 2020
Abstract:

Rainfall prediction is important as heavy rainfall can lead to many disasters. The prediction helps
people to take preventive measures and moreover the prediction should be accurate. There are
two types of prediction short term rainfall prediction and long term rainfall. Prediction mostly
short term prediction can gives us the accurate result. The main challenge is to build a model for
long term rainfall prediction. Heavy precipitation prediction could be a major drawback for earth
science department because it is closely associated with the economy and lifetime of human. It’s
a cause for natural disasters like flood and drought that square measure encountered by
individuals across the world each year. Accuracy of rainfall statement has nice importance for
countries like India whose economy is basically dependent on agriculture. The dynamic nature of
atmosphere, applied mathematics techniques fail to provide sensible accuracy for precipitation
statement. The prediction of precipitation using machine learning techniques may use regression.
Intention of this project is to offer non-experts easy access to the techniques, approaches utilized
in the sector of precipitation prediction and provide a comparative study among the various
machine learning techniques.
2. TITLE: RAINFALL PRDICTION USING ACHINE LEARNING
ALGORITHM
Author: Kumar Arun, Garg Ishan, Kaur Sanmeet
YEAR: - 2019
Abstract:
This paper introduces current supervised learning models which are based on machine learning
algorithm for Rainfall prediction in India. Rainfall is always a major issue across the world as it
affects all the major factor on which the human being is depended. In current, Unpredictable and
accurate rainfall prediction is a challenging task. We apply rainfall data of India to different
machine learning algorithms and compare the accuracy of classifiers such as SVM, Navie Bayes,
Logistic Regression, Random Forest and Multilayer Perceptron (MLP). Our motive if to get the
optimized result and a better rainfall prediction.

3. TITLE: A NEURAL NETWORK BASED LOCAL RAINFALL
PREDICTION
Author: Tomoa kiKashiwaoa, Koichi Nakayama, ShinAndo
YEAR: - 2017
Abstract:
In this study, we develop and test a local rainfall (precipitation) prediction system based on
artificial neural networks (ANNs). Our system can automatically obtain meteorological data used
for rainfall prediction from the Internet. Meteorological data from equipment installed at a local
point is also shared among users in our system. The final goal of the study was the practical use
of “big data” on the Internet as well as the sharing of data among users for accurate rainfall
prediction. We predicted local rainfall in regions of Japan using data from the Japan
Meteorological Agency (JMA). As neural network (NN) models for the system, we used a multi-
layer perceptron (MLP) with a hybrid algorithm composed of back-propagation (BP) and random
optimization (RO) methods, and radial basis function network (RBFN) with a least squares
method (LSM), and compared the prediction performance of the two models. Precipitation (total
amount of rainfall above 0.5 mm between 12:00 and 24:00 JST (Japan standard time)) at
Matsuyama, Sapporo, and Naha in 2012 was predicted by NNs using meteorological data for
each city from 2011. The volume of precipitation was also predicted (total amount above 1.0 mm
between 17:00 and 24:00 JST) at 16 points in Japan and compared with predictions by the JMA
in order to verify the universality of the proposed system. The experimental results showed that
precipitation in Japan can be predicted by the proposed method, and that the prediction
performance of the MLP model was superior to that of the RBFN model for the rainfall
prediction problem. However, the results were not better than those generated by the JMA.
Finally, heavy rainfall (above 10 mm/h) in summer (Jun.–Sep.) afternoons (12:00–24:00 JST) in
Tokyo in 2011 and 2012 was predicted using data for Tokyo between 2000 and 2010. The results
showed that the volume of precipitation could be accurately predicted and the caching rate of
heavy rainfall was high. This suggests that the proposed system can predict unexpected local
heavy rainfalls as “guerrilla rainstorms.”

4. TITLE: APPLICATION OF THE DEEP LEARNING FOR THE
PREDICTION OF RAINFALL IN SOUTHERN TAIWAN
Author: Meng-Hua Yen, Ding-Wei Liu, Yi-Chia Hsin, Chu-En Lin
YEAR: - 2018
Abstract:
Precipitation is useful information for assessing vital water resources, agriculture, ecosystems
and hydrology. Data-driven model predictions using deep learning algorithms are promising for
these purposes. Echo state network (ESN) and Deep Echo state network (DeepESN), referred to
as Reservoir Computing (RC), are effective and speedy algorithms to process a large amount of
data. In this study, we used the ESN and the DeepESN algorithms to analyze the meteorological
hourly data from 2002 to 2014 at the Tainan Observatory in the southern Taiwan. The results
show that the correlation coefficient by using the DeepESN was better than that by using the
ESN and commercial neuronal network algorithms (Back-propagation network (BPN) and
support vector regression (SVR), MATLAB, The MathWorks co.), and the accuracy of predicted
rainfall by using the DeepESN can be significantly improved compared with those by using
ESN, the BPN and the SVR. In sum, the DeepESN is a trustworthy and good method to predict
rainfall; it could be applied to global climate forecasts which need high-volume data processing.
5. TITLE: RAINFALL PREDICTION USING MACHINE LEARNING AND
NEURAL NETWORK
Author: Kaushik Dutta, Gouthaman. P
YEAR: - 2020
Abstract:

Rainfall prediction model mainly based on artificial neural networks have been proposed in India
until now. This research work does a comparative study of two rainfall prediction approaches
and finds the more accurate one. The present technique to predict rainfall doesn’t work well with
the complex data present. The approaches which are being used now-a-days are statistical
methods and numerical methods, which don’t work accurately when there is any non-linear
pattern. Existing system fails whenever the complexity of the datasets which contains past
rainfall increases. Henceforth, to find the best way to predict rainfall, study of both machine
learning and neural networks is performed and the algorithm which gives more accuracy is
further used in prediction. Recently, rainfall is considered the primary source of most of the
economy of our country. Agriculture is considered the main economy driven source. To do a
proper investment on agriculture, a proper estimation of rainfall is needed. Along with
agriculture, rainfall prediction is needed for the people in coastal areas. People in coastal areas
are in high risk of heavy rainfall and floods, so they should be aware of the rainfall much earlier
so that they can plan their stay accordingly. For areas which have less rainfall and faces water
scarcity should have rainwater harvesters, which can collect the rainwater. To establish a proper
rainwater harvester, rainfall estimation is required. Weather forecasting is the easiest and fastest
way to get a greater outreach. This research work can be used by all the weather forecasting
channels, so that the prediction news can be more accurate and can spread to all parts of the
country.
6. TITLE: STUDY OF SHORT TERM RAIN FORECASTING USING
MACHINE LEARNING BASED APPROACH
Author: M. S. Balamurugan & R. Manojkumar
YEAR: - 2019
Abstract:
Weather forecasting has been still dependent on statistical and numerical analysis in most part of
the world. Though statistical and numerical analysis provides better results, it highly depends on
stable historical relationships with the predict and predicting value of the predict and at a future

time. On the other hand, machine learning explores new algorithmic approaches in prediction
which is based on data-driven prediction. Climatic changes for a location are dependent on
variable factors like temperature, precipitation, atmospheric pressure, humidity, wind speed and
combination of other such factors which are variable in nature. Since climatic changes are
location-based statistical and numerical approaches result in failure at times and needs an
alternate method like machine learning based study of understanding about the weather forecast.
In this study it has been observed that percentage in departure of rainfall has been ranging from
46 to 91% for the month of June 2019 as per Indian Meteorological Department (IMD) by using
the traditional forecasting methods, but whereas based on the following study implemented using
machine learning it has been observed that forecast was able to achieve much better rainfall
prediction comparative to statistical methods.
1.1 PROPOSED SYSTEM

In proposed work, Regression analysis: Regression analysis deals with the dependence of one
variable (called as dependent variable) on one or more other variables, (called as independent
variables) which is useful for estimating and/ or predicting the mean or average value of the
former in terms of known or fixed values of the latter. For example, the salary of a person is
based on his/her experience here, the experience attribute is independent variable salary is
dependent variable. Simple linear regression defines the relationship between a single dependent
variable and a single independent variable. The below equation is the general form of regression.
y = β0 + β1x + ε where β0 and β1 are parameters, and ε is a probabilistic error term. Regression
analysis is a vital tool for modeling and analyzing information. It is used for predictive analysis
that is forecasting of rainfall or weather, predicting trends in business, finance, and marketing. It
can also be used for correcting errors and also provide quantitative support. The advantages of
regression analysis are:
1. It is a powerful technique for testing relationship between one dependent variable and many
independent variables.
2. It allows researchers to control extraneous factors.
3. Regression asses the cumulative effect of multiple factors.
4. It also helps to attain the measure of error using the regression line as a base for estimations.
ARCHITECTURE FOR PROPOSED SYSTEM:

Proposed approach :
The back-propagation technique works well with less complex system, but as the complexity of
the system increases back propagation method’s accuracy decreases. This process deals with four
types of inputs and three types of outputs layers. Following are the four-input layer used:
1. Air temperature
2. Air humidity

3. Wind speed
4. Sunshine duration
Following are the output layers used:
1. Rainfall
2. Medium rainfall
3. High rainfall
Steps associated with the proposed system are input of data, preprocess of data, splitting of data,
training of the algorithm, testing of the dataset, comparing both the algorithm, giving the best
algorithm, prediction with the more accurate algorithm and result at the end. The main reason for
not doing prediction with both the algorithm is to reduce the complexities of the whole system,
so the system first finds the most accurate algorithm between machine learning and neural
network and accordingly does prediction with the better one. The result will be received in the
form of graphs and excel sheets. For preprocess , all the result will be received in the form of
different graphs and for machine learning and neural network , the accuracy will be received in
the form of Metrics as well as excel sheet and accordingly the predicted value will be received in
the form of excel sheet which will contain two columns ID and predicted value. IDs will be same
as that of in the datasheet. To get for which region prediction is being done, IDs should be
matched with the IDs present in dataset.
PROPOSED SYSTEM ADVANTAGES
 Speed and very low complexity, which makes it very well suited to operate on real
scenarios.
 Computation load needed for image processing purpose is much reduced, combined with
very simple classifiers..
 Ability to learn and extract complex image features.

 With its simplicity and fast processing time, the proposed algorithm is suitable to be
implemented in embedded system or mobile application that has limited processing resources
CHAPTER 2
PROJECT DESCRIPTION
2.1 INTRODUCTION

In today’s situation, rainfall is considered to be one of the sole responsible factors for
most of the significant things across the world. In India, agriculture is considered to be one of the
important factors for deciding the economy of the country and agriculture is solely dependent on
rainfall. Apart From that in the coastal areas across the world, getting to know the amount of
rainfall is very much necessary. In some of the areas which have water scarcity, to establish rain
water harvester, prior prediction of the rainfall should be done.
This project deals with the prediction of rainfall using machine learning & neural networks. The
project performs the comparative study of machine learning approaches and neural network
approaches then accordingly portrays the efficient approach for rainfall prediction. First of all,
preprocess is performed When it comes to machine learning, LASSO regression is being used
and for neural network, ANN (Artificial neural network) approach is being used. After
calculation, types of errors, accuracy of both LASSO and ANN has been compared and
accordingly conclusion has been made. To reduce the systems complexity, the prediction has
been done with the approach that has better accuracy. The prediction has been done using the
dataset which contains rainfall data from year 1901 to 2015 for different regions across the
country.
It contains month wise data as well as annual rainfall data for the same. Currently, rainfall
prediction has become one of the key factors for most of the water conservation systems in and
across country. One of the biggest challenges is the complexity present in rainfall data. Most of
the rainfall prediction system, nowadays are unable to find the hidden layers or any non-linear
patterns present in the system. This project will assist to find all the hidden layers as well as non-
linear patterns, which is useful for performing the precise prediction of rainfall [1].
Rainfall prediction is the application to predict the rainfall in a given region. It can be done in
two types. The first is to analyze the physical law that affects rainfall and the second one is to
make a system which will discover hidden patterns or the features that affects the physical
factors and the process involved in achieving it. The second one is better because it doesn’t
include any type of mathematical calculations and can be useful for complex and non-linear data
[2]. Due to presence of the system which doesn’t find the hidden layers and nonlinear patterns
accurately, the prediction results to be wrong for most of the times and that may lead to huge
losses. So, the main objective for this research work is to find a system that can resolve both the

issues i.e. able to find complexity as well as hidden layers present, which will give proper and
accurate prediction thereby assisting the country to develop when it comes to agriculture and
economy.
2.2 DETAILED DIAGRAM
2.2.1 Back End Module Diagrams:

FRONT END:
2.3 SYSTEM SPECIFICATION:
2.3.1 HARDWARE REQUIREMENTS:

The hardware requirements may serve as the basis for a contract for the implementation of the
system and should therefore be a complete and consistent specification of the whole system.
They are used by software engineers as the starting point for the system design. It shows what
the system does and not how it should be implemented
PROCESSOR : Intel I5
RAM : 4GB
HARD DISK : 40 GB
2.3.2 SOFTWARE REQUIREMENTS:
The software requirements document is the specification of the system. It should include
both a definition and a specification of requirements. It is a set of what the system should
do rather than how it should do it. The software requirements provide a basis for creating
the software requirements specification. It is useful in estimating cost, planning team
activities, performing tasks and tracking the team’s and tracking the team’s progress
throughout the development activity.
PYTHON IDE : Anaconda Jupyter Notebook
PROGRAMMING LANGUAGE : Python
MODULES:
DATASET

The dataset used in this system contains the rainfall of several regions in and across the country.
It contains rainfall from 1901 – 2015 for the same. Along with that annual rainfall is also been
used and the rainfall between the transition of two months. There are in total 4116 rows present
in the dataset. The dataset is been collected from data.gov.in. Category – Rainfall in India
Released under – NDSAP Contributor – Ministry of Earth Sciences, IMD Group – Rainfall
Sectors – Atmosphere science, earth sciences, science & technology.
DATA CLEANING:
In this module the data is cleaned. After cleaning of the data, the data is grouped as per
requirement. This grouping of data is known as data clustering. Then check if there is any
missing value in the data set or not. It there is some missing value then change it by any default
value. After that if any data need to change its format, it is done. That total process before the
prediction is known is data pre-processing. After that the data is used for the prediction and
forecasting step.
Data Prediction and forecasting:
In this step, the pre-processed data is taken for the prediction. This prediction can be done in any
process which are mentioned above. But the Linear Regression algorithm score more prediction
accuracy than the other algorithm. So, in this project the linear regression method is used for the
prediction. For that, the pre-processed data is splitted for the train and test purpose. Then a
predictive object is created to predict the test value which is trained by the trained value. Then
the object is used to forecast data for next few years.
DATA SPLITTING:
For each experiment, we split the entire dataset into 70% training set and 30% test set. We
used the training set for resampling, hyper parameter tuning, and training the model and we used
test set to test the performance of the trained model. While splitting the data, we specified a

random seed (any random number), which ensured the same data split every time the program
executed.
TRAINING AND TESTING:
Algorithms learn from data. They find relationships, develop understanding, make decisions, and
evaluate their confidence from the training data they’re given. And the better the training data is,
the better the model performs.
In fact, the quality and quantity of your training data has as much to do with the success of your
data project as the algorithms themselves.
Now, even if you’ve stored a vast amount of well-structured data, it might not be labeled in a
way that actually works for training your model. For example, autonomous vehicles don’t just
need pictures of the road, they need labeled images where each car, pedestrian, street sign and
more are annotated; sentiment analysis projects require labels that help an algorithm understand
when someone’s using slang or sarcasm; chatbots need entity extraction and careful syntactic
analysis, not just raw language.
In other words, the data you want to use for training usually needs to be enriched or labeled. Or
you might just need to collect more of it to power your algorithms. But chances are, the data
you’ve stored isn’t quite ready to be used to train your classifiers.
Because if you’re trying to make a great model, you need great training data. And we know a
thing or two about that. After all, we’ve labeled over 5 billion rows of data for some of the most
innovative companies in the world. Whether it’s images, text, audio, or, really, any other kind of
data, we can help create the training set that makes your models successful.

REGRESSION:
Random Forest:-
Random forest is a supervised learning algorithm. The "forest" it builds, is an
ensemble of decision trees, usually trained with the “bagging” method. The general idea of the
bagging method is that a combination of learning models increases the overall result.
Put simply: random forest builds multiple decision trees and merges them together to get a
more accurate and stable prediction.
One big advantage of random forest is that it can be used for both classification and regression
problems, which form the majority of current machine learning systems. Let's
look at random forest in classification, since classification is sometimes considered the
building block of machine learning. Below you can see how a random forest would look like
with two trees:
Random forest has nearly the same hyper parameters as a decision tree or a bagging classifier.
Fortunately, there's no need to combine a decision tree with a bagging classifier because
you can easily use the classifier-class of random forest. With random forest, you can also deal
with regression tasks by using the algorithms repressor.

Random forest adds additional randomness to the model, while growing the trees. Instead of
searching for the most important feature while splitting a node, it searches for the best feature
among a random subset of features. This results in a wide diversity that generally results in a
better model. Therefore, in random forest, only a random subset of the features is taken into
consideration by the algorithm for splitting a node. You can even make trees more random by
additionally using random thresholds for each feature rather than searching for the best possible
thresholds (like a normal decision tree does).
Logistic Regression:
It is a classification not a regression algorithm. It is used to estimate discrete values (Binary
values like 0/1, yes/no, true/false) based on given set of independent variable(s). In simple
words, it predicts the probability of occurrence of an event by fitting data to a logit function.
Hence, it is also known as logit regression. Since, it predicts the probability, its output values lies
between 0 and 1 (as expected). Mathematically, the log odds of the outcome are modelled as a
linear combination of the predictor variables.
Odds = p/(1-p) = probability of event occurrence / probability of not event occurrence
ln(odds) = ln(p/(1-p))
logit(p)=ln(p/(1-p))= b0+b1X1+b2X2+b3X3....+bkXk
As we are classifying text on the basis of a wide feature set, with a binary output (true/false or
true article/fake article), a logistic regression (LR) model is used, since it provides the intuitive
equation to classify problems into binary or multiple classes. We performed hyperparameters
tuning to get the best result for all individual datasets, while multiple parameters are tested
before acquiring the maximum accuracies from LR model.
CONFUSION MATRIX:
It is the most commonly used evaluation metrics in predictive analysis mainly because it is very.
Easy to understand and it can be used to compute other essential metrics such as accuracy, recall,

Precision, etc. It is an NxN matrix that describes the overall performance of a model when used
on some dataset, where N is the number of class labels in the classification problem.
PERFORMANCE EVALUATION:
ACCURACY:
Though the train accuracy proved to be good and peaked 90% accurate, validation results seems
not satisfying. The model has shown good results for training data than the test sample. Our
model is yielding better results with the training set rather than the test set. This particular result
is occurring due to overfitting of the test data. A model with no preprocessed data can cause such
overfitting events to occur. Hence, at certain events the classifier can be subject to overfitting the
test data.
Model Loss
The loss function that we considered was the binary cross entropy. When we use this function,
the trained set can be viewed to be improving in the overall loss of the set but in reality, the test
data suggests otherwise. The test data is actually increasing in loss when compared to trained
sample. The increase in loss can be attributed to the overfitting of the data. In the above Figure
A.3, as more epochs considered for the trained set, the loss decreases with the set. While, the
tested set starts with a lower loss value but with the more epochs being considered, the loss of
the tested set actually increases. This illustrates the drawback of the architecture and
methodology that is being used. At 10 epochs, the loss both the sets are almost equal. After 10

epochs, the loss of the training set linearly decreases while the loss of the test set gradually
increases. The visual representation and the statistics both confirm the overfitting of the data sets.
However, this can all be fixed when we perform normalization, preprocessing, and adding
dropout layers. After adding dropout layer and normalizing feature set
2.5 SYSTEM DESIGN:
Designing of system is the process in which it is used to define the interface, modules and data for a
system to specified the demand to satisfy. System design is seen as the application of the system theory.
The main thing of the design a system is to develop the system architecture by giving the data and
information that is necessary for the implementation of a system.
SYSTEM ARCHITECTURE:
USECASE DIAGRAM:
Use case diagrams are a way to capture the system's functionality and requirements in
UML diagrams. It captures the dynamic behavior of a live system. A use case diagram
consists of a use case and an actor
DETAILED ARCHITECTURE FLOW

CLASS DIAGRAM:
Class diagrams are the main building block in object-oriented modeling. They are
used to show the different objects in a system, their attributes, their operations and the
relationships among them. The different objects are Data owner, Cloud user, Cloud
admin these are the objects in this uml relationships and their properties are uploading the
documents, generating key for securing the data, maintaining the cloud data s then
downloading using the key and accessing the cloud data.
STATE DIAGRAM:

A state diagram, also known as a state machine diagram or state chart diagram, is
an illustration of the states an object can attain as well as the transitions between
those states in the Unified Modeling Language. Then, all of the possible
existing states are placed in relation to the beginning and the end.
ACTIVITY DIAGRAM:
Activity Diagrams describe how activities are coordinated to provide a service which can
be at different levels of abstraction. Typically, an event needs to be achieved by some
operations, particularly where the operation is intended to achieve a number of different
things that require coordination

SEQUENCE DIAGRAM:
A sequence diagram is a type of interaction diagram because it describes how and
in what order a group of objects works together. These diagrams are used by software
developers and business professionals to understand requirements for a new system or to
document an existing process.

DATA FLOW DIAGRAM:
Data flow diagrams are used to graphically represent the flow of data in a business
information system. DFD describes the processes that are involved in a system to transfer
data from the input to the file storage and reports generation. Data flow diagrams can be
divided into logical and physical. The logical data flow diagram describes flow of data
through a system to perform certain functionality of a business. The physical data flow
diagram describes the implementation of the logical data flow.
STATE DIAGRAM:

MODULES:
DATASET
The dataset used in this system contains the rainfall of several regions in and across the country.
It contains rainfall from 1901 – 2015 for the same. Along with that annual rainfall is also been
used and the rainfall between the transition of two months. There are in total 4116 rows present
in the dataset. The dataset is been collected from data.gov.in. Category – Rainfall in India
Released under – NDSAP Contributor – Ministry of Earth Sciences, IMD Group – Rainfall
Sectors – Atmosphere science, earth sciences, science & technology.
DATA CLEANING:
In this module the data is cleaned. After cleaning of the data, the data is grouped as per
requirement. This grouping of data is known as data clustering. Then check if there is any
missing value in the data set or not. It there is some missing value then change it by any default
value. After that if any data need to change its format, it is done. That total process before the
prediction is known is data pre-processing. After that the data is used for the prediction and
forecasting step.

Data Prediction and forecasting:
In this step, the pre-processed data is taken for the prediction. This prediction can be done in any
process which are mentioned above. But the Linear Regression algorithm score more prediction
accuracy than the other algorithm. So, in this project the linear regression method is used for the
prediction. For that, the pre-processed data is splitted for the train and test purpose. Then a
predictive object is created to predict the test value which is trained by the trained value. Then
the object is used to forecast data for next few years.
DATA SPLITTING:
For each experiment, we split the entire dataset into 70% training set and 30% test set. We
used the training set for resampling, hyper parameter tuning, and training the model and we used
test set to test the performance of the trained model. While splitting the data, we specified a
random seed (any random number), which ensured the same data split every time the program
executed.
TRAINING AND TESTING:
Algorithms learn from data. They find relationships, develop understanding, make decisions, and
evaluate their confidence from the training data they’re given. And the better the training data is,
the better the model performs.
In fact, the quality and quantity of your training data has as much to do with the success of your
data project as the algorithms themselves.
Now, even if you’ve stored a vast amount of well-structured data, it might not be labeled in a
way that actually works for training your model. For example, autonomous vehicles don’t just
need pictures of the road, they need labeled images where each car, pedestrian, street sign and
more are annotated; sentiment analysis projects require labels that help an algorithm understand
when someone’s using slang or sarcasm; chatbots need entity extraction and careful syntactic
analysis, not just raw language.

In other words, the data you want to use for training usually needs to be enriched or labeled. Or
you might just need to collect more of it to power your algorithms. But chances are, the data
you’ve stored isn’t quite ready to be used to train your classifiers.
Because if you’re trying to make a great model, you need great training data. And we know a
thing or two about that. After all, we’ve labeled over 5 billion rows of data for some of the most
innovative companies in the world. Whether it’s images, text, audio, or, really, any other kind of
data, we can help create the training set that makes your models successful.
REGRESSION:
Random Forest:-
Random forest is a supervised learning algorithm. The "forest" it builds, is an
ensemble of decision trees, usually trained with the “bagging” method. The general idea of the
bagging method is that a combination of learning models increases the overall result.
Put simply: random forest builds multiple decision trees and merges them together to get a
more accurate and stable prediction.
One big advantage of random forest is that it can be used for both classification and regression
problems, which form the majority of current machine learning systems. Let's
look at random forest in classification, since classification is sometimes considered the
building block of machine learning. Below you can see how a random forest would look like
with two trees:
Random forest has nearly the same hyper parameters as a decision tree or a bagging classifier.
Fortunately, no need to combine a decision tree with a bagging classifier because you can easily
use the classifier-class of random forest. With random forest, you can also deal with regression
tasks by using the algorithms repressor.

Random forest adds additional randomness to the model, while growing the trees. Instead of
searching for the most important feature while splitting a node, it searches for the best feature
among a random subset of features. This results in a wide diversity that generally results in a
better model. Therefore, in random forest, only a random subset of the features is taken into
consideration by the algorithm for splitting a node. You can even make trees more random by
additionally using random thresholds for each feature rather than searching for the best possible
thresholds (like a normal decision tree does).
CONFUSION MATRIX:
It is the most commonly used evaluation metrics in predictive analysis mainly because it is very.
Easy to understand and it can be used to compute other essential metrics such as accuracy, recall,
Precision, etc. It is an NxN matrix that describes the overall performance of a model when used
on some dataset, where N is the number of class labels in the classification problem.

PERFORMANCE EVALUATION:
ACCURACY:
Though the train accuracy proved to be good and peaked 90% accurate, validation results seems
not satisfying. The model has shown good results for training data than the test sample. Our
model is yielding better results with the training set rather than the test set. This particular result
is occurring due to overfitting of the test data. A model with no preprocessed data can cause such
overfitting events to occur. Hence, at certain events the classifier can be subject to overfitting the
test data.
Model Loss
The loss function that we considered was the binary cross entropy. When we use this function,
the trained set can be viewed to be improving in the overall loss of the set but in reality, the test
data suggests otherwise. The test data is actually increasing in loss when compared to trained
sample. The increase in loss can be attributed to the overfitting of the data. In the above Figure
A.3, as more epochs considered for the trained set, the loss decreases with the set. While, the
tested set starts with a lower loss value but with the more epochs being considered, the loss of
the tested set actually increases. This illustrates the drawback of the architecture and
methodology that is being used. At 10 epochs, the loss both the sets are almost equal. After 10
epochs, the loss of the training set linearly decreases while the loss of the test set gradually
increases. The visual representation and the statistics both confirm the overfitting of the data sets.
However, this can all be fixed when we perform normalization, preprocessing, and adding
dropout layers. After adding dropout layer and normalizing feature set.

CHAPTER 3
SOFTWARE SPECIFICATION
3.1 GENERAL
ANACONDA
It is a free and open-source distribution of the Python and R programming languages for
scientific computing (data science, machine learning applications, large-scale data processing,
predictive analytics, etc.), that aims to simplify package management and deployment.
Anaconda distribution comes with more than 1,500 packages as well as
the Conda package and virtual environment manager. It also includes a GUI, Anaconda
Navigator, as a graphical alternative to the Command Line Interface (CLI).
The big difference between Conda and the pip package manager is in how package dependencies
are managed, which is a significant challenge for Python data science and the reason Conda
exists. Pip installs all Python package dependencies required, whether or not those conflict with
other packages you installed previously.
So your working installation of, for example, Google Tensorflow, can suddenly stop working
when you pip install a different package that needs a different version of the Numpy library.
More insidiously, everything might still appear to work but now you get different results from
your data science, or you are unable to reproduce the same results elsewhere because you didn't
pip install in the same order.
Conda analyzes your current environment, everything you have installed, any version limitations
you specify (e.g. you only want tensorflow>= 2.0) and figures out how to install compatible
dependencies. Or it will tell you that what you want can't be done. Pip, by contrast, will just
install the thing you wanted and any dependencies, even if that breaks other things.Open source
packages can be individually installed from the Anaconda repository, Anaconda Cloud
(anaconda.org), or your own private repository or mirror, using the conda install command.
Anaconda Inc compiles and builds all the packages in the Anaconda repository itself, and
provides binaries for Windows 32/64 bit, Linux 64 bit and MacOS 64-bit. You can also install

anything on PyPI into a Conda environment using pip, and Conda knows what it has installed
and what pip has installed. Custom packages can be made using the conda build command, and
can be shared with others by uploading them to Anaconda Cloud, PyPI or other repositories.The
default installation of Anaconda2 includes Python 2.7 and Anaconda3 includes Python 3.7.
However, you can create new environments that include any version of Python packaged with
conda.
Anaconda Navigator is a desktop Graphical User Interface (GUI) included in Anaconda
distribution that allows users to launch applications and manage conda packages, environments
and channels without using command-line commands. Navigator can search for packages on
Anaconda Cloud or in a local Anaconda Repository, install them in an environment, run the
packages and update them. It is available for Windows, macOS and Linux.
The following applications are available by default in Navigator:
 JupyterLab
 Jupyter Notebook
 QtConsole
 Spyder

 Glueviz
 Orange
 Rstudio
 Visual Studio Code
Microsoft .NET is a set of Microsoft software technologies for rapidly building and integrating
XML Web services, Microsoft Windows-based applications, and Web solutions. The .NET
Framework is a language-neutral platform for writing programs that can easily and securely
interoperate. There’s no language barrier with .NET: there are numerous languages available to
the developer including Managed C++, C#, Visual Basic and Java Script. The .NET framework
provides the foundation for components to interact seamlessly, whether locally or remotely on
different platforms. It standardizes common data types and communications protocols so that
components created in different languages can easily interoperate.
“.NET” is also the collective name given to various software components built upon the .NET
platform. These will be both products (Visual Studio.NET and Windows.NET Server, for
instance) and services (like Passport, .NET My Services, and so on).
Microsoft VISUAL STUDIO is an Integrated Development Environment (IDE) from
Microsoft. It is used to develop computer programs, as well as websites, web apps, web services
and mobile apps.

Python is a powerful multi-purpose programming language created by Guido van Rossum. It has
simple easy-to-use syntax, making it the perfect language for someone trying to learn computer
programming for the first time. Python features are:
 Easy to code
 Free and Open Source
 Object-Oriented Language
 GUI Programming Support
 High-Level Language
 Extensible feature
 Python is Portable language
 Python is Integrated language
 Interpreted
 Large Standard Library
 Dynamically Typed Language
PYTHON:
 Python is a powerful multi-purpose programming language created by Guido
van Rossum.
 It has simple easy-to-use syntax, making it the perfect language for someone
trying to learn computer programming for the first time.
Features Of Python :
1.Easy to code:
Python is high level programming language. Python is very easy to learn language as compared
to other language like c, c#, java script, java etc. It is very easy to code in python language and
anybody can learn python basic in few hours or days. It is also developer-friendly language.
2. Free and Open Source:
Python language is freely available at official website and you can download it from the given
download link below click on the Download Python keyword.
Since, it is open-source, this means that source code is also available to the public. So you can
download it as, use it as well as share it.
3.Object-Oriented Language:
One of the key features of python is Object-Oriented programming. Python supports object
oriented language and concepts of classes, objects encapsulation etc.

4. GUI Programming Support:
Graphical Users interfaces can be made using a module such as PyQt5, PyQt4, wxPython or Tk
in python.
PyQt5 is the most popular option for creating graphical apps with Python.
5. High-Level Language:
Python is a high-level language. When we write programs in python, we do not need to
remember the system architecture, nor do we need to manage the memory.
6.Extensible feature:
Python is a Extensible language. we can write our some python code into c or c++ language and
also we can compile that code in c/c++ language.
7. Python is Portable language:
Python language is also a portable language. for example, if we have python code for windows
and if we want to run this code on other platform such as Linux, Unix and Mac then we do not
need to change it, we can run this code on any platform.
8. Python is Integrated language:
Python is also an Integrated language because we can easily integrated python with other
language like c, c++ etc.
9. Interpreted Language:
Python is an Interpreted Language. because python code is executed line by line at a time. like
other language c, c++, java etc there is no need to compile python code this makes it easier to
debug our code. The source code of python is converted into an immediate form called bytecode.
10. Large Standard Library
Python has a large standard library which provides rich set of module and functions so you do
not have to write your own code for every single thing.There are many libraries present in
python for such as regular expressions, unit-testing, web browsers etc.
11. Dynamically Typed Language:
Python is dynamically-typed language. That means the type (for example- int, double, long etc)
for a variable is decided at run time not in advance.because of this feature we don’t need to
specify the type of variable.
APPLICATIONS OF PYTHON :
WEB APPLICATIONS

 You can create scalable Web Apps using frameworks and CMS (Content Management
System) that are built on Python. Some of the popular platforms for creating Web Apps
are:Django, Flask, Pyramid, Plone, Django CMS.
 Sites like Mozilla, Reddit, Instagram and PBS are written in Python.
SCIENTIFIC AND NUMERIC COMPUTING
 There are numerous libraries available in Python for scientific and numeric computing.
There are libraries like:SciPy and NumPy that are used in general purpose computing.
And, there are specific libraries like: EarthPy for earth science, AstroPy for Astronomy
and so on.
 Also, the language is heavily used in machine learning, data mining and deep learning.
CREATING SOFTWARE PROTOTYPES
 Python is slow compared to compiled languages like C++ and Java. It might not be a
good choice if resources are limited and efficiency is a must.
 However, Python is a great language for creating prototypes. For example: You can use
Pygame (library for creating games) to create your game's prototype first. If you like the
prototype, you can use language like C++ to create the actual game.

GOOD LANGUAGE TO TEACH PROGRAMMING
 Python is used by many companies to teach programming to kids
 It is a good language with a lot of features and capabilities. Yet, it's one of the easiest
language to learn because of its simple easy-to-use sy

CHAPTER 4
IMPLEMENTATION
4.1 GENERAL
Python is a program that was originally designed to simplify the implementation of numerical
linear algebra routines. It has since grown into something much bigger, and it is used to
implement numerical algorithms for a wide range of applications. The basic language used is
very similar to standard linear algebra notation, but there are a few extensions that will likely
cause you some problems at first.
4.2 CODE IMPLEMENTATION

CHAPTER 5
CONCLUSION AND REFERENCES
CONCLUSION
Rainfall forecast is a daunting task for any algorithm to handle. However, the algorithm that we
focused on was the Artificial Neural Networks. The reason we chose RF & LR was because of
its ability to handle larger data, such as the large batch sizes that were inputted and also allows
various types of data used. This was a huge benefactor in our decision of using RF & LR. The
other reason was that it performed better than other algorithms when handling inconsistences in
the data such as noise or incomplete data. Inconsistences can throw off the accuracy of the
algorithms by an exceptional margin. However, RF & LR was capable of handling these types of
data. The final results agree with our choice as RF & LR was able to yield an accuracy of 87%.
The other algorithms could reach a maximum accuracy of 86%. If we consider extremely large
datasets, that 1% can make quite the difference in forecasting. Through our model, we were able
to prove that RF & LR are a viable model to be used in the field of weather forecasting. They can
handle large data, handle inconsistences, and yield higher accuracies. RF & LR is one of the true
spearheads in the domain of weather forecasting
FUTURE WORK:
In future research, we intend to incorporate different ensemble techniques to combine the
diversities of the models and increase the forecasting ability. We are planning to take data from
different regions to increase the diversity of the data set and check which model performs well
with such noisy data. The architecture of the network model will be examined further to enhance
the accuracy of predictions. We intend to extend our wing in understanding of neural networks
by using different neural network models like Recurrent Neural network(LSTM) and Time delay

neural network(TDNN). The accuracy of the probabilistic model like Naive Bayes will be
examined. In order to do so first, we need to perform discretization.
REFERENCES
1. Manojit Chattopadhyay and Surajit Chattopadhyay, "Elucidating the role of topological
pattern discovery and support vector machine in generating predictive models for Indian
summer monsoon rainfall", Theoretical and Applied Climatology, pp. 1-12, July 2015.
2. Kumar Abhishek, Abhay Kumar, Rajeev Ranjan and Sarthak Kumar, "A Rainfall
Prediction Model using Artificial Neural Network", 2012 IEEE Control and System
Graduate Research Colloquium (ICSGRC 2012), pp. 82-87, 2012.
3. Minghui Qiu, Peilin Zhao, Ke Zhang, Jun Huang, Xing Shi, Xiaoguang Wang, et al., "A
Short-Term Rainfall Prediction Model using Multi-Task Convolutional Neural
Networks", IEEE International Conference on Data Mining, pp. 395-400, 2017.
4. S Aswin, P Geetha and R Vinayakumar, "Deep Learning Models for the Prediction of
Rainfall", International Conference on Communication and Signal Processing, pp. 0657-
0661, April 3–5, 2018.
5. Xianggen Gan, Lihong Chen, Dongbao Yang and Guang Liu, "The Research Of Rainfall
Prediction Models Based On Matlab Neural Network", Proceedings of IEEE CCIS 2011,
pp. 45-48.
6. Cramer Sam, Michael Kampouridis, Alex A. Freitas and Antonis Alexandridis,
"Predicting Rainfall in the Context of Rainfall Derivatives Using Genetic
Programming", 2015 IEEE Symposium Series on Computational Intelligence, pp. 711-
718.

7. Mohini P. Darji, Vipul K. Dabhi and Harshadkumar B. Prajapati, "Rainfall Forecasting
Using Neural Network: A Survey", 2015 International Conference on Advances in
Computer Engineering and Applications (ICACEA), pp. 706-713.
8. Sandeep Kumar Mohapatra, Anamika Upadhyay and Channabasava Gola, "Rainfall
Prediction based on 100 years of Meterological Data", 2017 International Conference on
Computing and Communication Technologies for smart Nation, pp. 162-166.
9. Sankhadeep Chatterjee, Bimal Datta, Soumya Sen and Nilanjan Dey, "Rainfall Prediction
using Hybrid Neural Network Approach", 2018 2nd International Conference on Recent
Advances in Signal Processing Telecommunications & Computing (SigTeICom), pp. 67-
72.
10. Sunil Navadia, Pintukumar Yadav, Jobin Thomas and Shakila Shaikh, "Weather
Prediction: A novel approach for measuring and analyzing weather data", International
conference on I-SMAC (IoT in Social Mobile Analytics and Cloud) (I-SMAC 2017), pp.
414-417.

Predicting rainfall with data science in python

Recommended

More Related Content

Similar to Predicting rainfall with data science in python (20)

Recently uploaded (20)

Predicting rainfall with data science in python