My Kaggle Projects

Weather in Australia: How is the weather tomorrow?

Recently, I found a nice dataset at Kaggle. The dataset includes data on the weather in Australian cities and villages. At each location, weather conditions have been monitored daily from 2008 till 2017. Generally, Australia has a warm and sunny climate. The country is home to world’s driest regions, some regions have periods of heavy rainfall, and the country has regions with a mild and sunny Mediterranean-style climate. At Kaggle, I made a nice notebook contribution. I select the most popular cities in Australia. I start with an exploratory data analysis. Then, I manually add geospatial data to the dataset. After, I included a geospatial representation of the average monthly rainfall using the ‘Basemap‘ Python Package. Lastly, I focus on the weather in Sydney and Melbourne. I have built a logistic regression model to predict the rainfall tomorrow. Also, I forecast the average rainfall for April 2021, using time series analysis.

Exploring the Data

In my Jupyter Notebook analysis, I start with exploring the data. In each data analysis, it is important to answer some questions first. What does the dataset look like? How can you describe the data? What are the data types? What are the variables or columns included in the dataset? So, let’s import the data into the notebook, and let’s take a look at the first 5 rows of the data. The dataset contains data on weather conditions for each day in the past, ranging from 2008 till 2017. Each row has a unique location. For example, what were the weather conditions in Albury on a specific date in the past? Aside from date and location, each column in the dataset represents a weather condition. Many factors are contributing to weather conditions. Weather conditions include the minimum temperature, the maximum temperature, rainfall, evaporation, sunshine during the day, wind speed humidity, pressure, cloudiness (measured at 9 a.m. and 3 p.m.), and the temperature measured at 9 a.m. and 3 p.m. on a specific date.

Correlation of Variables

Looking at the correlation of variables is an essential step in data analysis and building a model. So, let’s make a correlation diagram. A correlation diagram shows how variables are correlated. A correlation between two variables could have a value between minus 1 and 1. Now, let’s have a look at the correlation between certain weather conditions.

Minimum and maximum temperatures are highly and positively correlated. It sounds logical: the higher the maximum temperature, the higher the minimum temperature. This is exactly the same for the temperature measured at 9 a.m. and 3 p.m.; the weather conditions have a strong positive correlation. Also, evaporation and temperature have a reasonably strong and positive correlation. It is like steam coming from a hot cup of coffee. A hot cup of coffee evaporates more quickly than a cup of cold coffee. Aside, sunshine and cloudiness are strongly negatively correlated. Other variables have a weaker positive or negative correlation.

A situation in which 2 or more exploratory or independent variables are highly correlated could be problematic in multiple regression. It gives a distorted picture of the statistical significance of an independent variable. The term referring to this problem is ‘multicollinearity’.

Australian Weather Conditions: Correlation Diagram

Dispersion of Average Rainfall

A way to explore the data is to look at the dispersion of the data. A boxplot gives a good graphical representation of the dispersion. You can detect outliers from a boxplot. Also, a boxplot displays the median value, the first quartile, the third quartile, the minimal value, and the maximum value. In the dispersion of average rainfall in Australian cities, there are some outliers in Summer. In Summer, the tropical north experiences periods of heavy rainfall.

Geospatial Data Visualization

As part of the exploratory data analysis, I look at the unique locations in the dataset. I include the most popular cities and villages in this analysis. After, I manually add map coordinates to the data frame, using the ‘for-loop’ approach in Jupyter Notebook. For each row, I append the latitude and longitude. Now, the dataset contains geodata for geospatial data visualization.

Besides, the column ‘Dates’ should be converted to a ‘DateTime’ data type. Afterwards, I retrieve the day, month, and year of each row. I determine the average rainfall per month in each city, and I store the data in a new data frame. Now, we have a data frame that is ready to create a bunch of nice maps.

Map Plots with Basemap

Basemap is a Python package that allows us to create map plots. It is an extension of the Matplotlib plotting package for Python. In total, there are 4 latitude and longitude values required to create a map plot with Basemap. I display the average rainfall in Australian cities and villages per month and season. I set the background and the fill color of the map plot to grey. Since the southern hemisphere starts every year in Summer, I start with plotting the average rainfall in Summer.

During the Summer months, rainfall in Australia is highest in the tropical north. The south has the least amount of average rainfall in Summer. Cities in the tropical north, such as Cairns, Darwin, Katherine, and Townsville, experience lots of heavy rainfall during the wet season. The wet season is between October/ November and April. As the legend indicates, heavy rainfall in this period could be up to 18 millimeters. The darker the dot in the map, the higher the average rainfall in a city or village.

After March, the amount of rainfall decreases in the tropical north. The dry season is about to begin. In theory, the tropical north has only 2 seasons, the wet season and the dry season.

It gets wetter in the south during the winter. Cities, such as Adelaide, Sydney, and Melbourne, have the largest amount of rainfall in June. The wettest month in Perth (South West of Australia) is July. Adelaide is the driest Australian state capital.

Generally, Alice Springs has low amounts of rainfall. The climate in Alice Springs is considered to be arid for most of the year. In October/November, the build-up for the wet season starts in the tropical north again.

Plotting with Folium

There are many ways to visualize geo data in Python. Previously, I create map plots including rainfall in each month using Basemap. You can also use the Folium or Plotly package in geospatial data visualization. Let’s create a bubble chart of the average rainfall in last month, March, using Folium.

map_australia = folium.Map(location=[latitude, longitude], zoom_start=5)


for i in range(0,len(rainfall_march)):
        folium.CircleMarker(
            location=[rainfall_march.iloc[i]['Latitude'], rainfall_march.iloc[i]['Longitude']],
            popup=rainfall_march.iloc[i]['Location'],
            radius=float(rainfall_march.iloc[i]['Rainfall'])*5,
            color='aqua',
            fill=True,
            fill_color='#00FFFF',
            fill_opacity=0.7,
            parse_html=False).add_to(map_australia)  

map_australia

Folium Map: Average Rainfall in Australia during March

Sydney and Melbourne: Average Weather Conditions

After geospatial data visualization, I focus on the weather conditions in 2 popular cities in Australia. Sydney and Melbourne both have a mild climate, and these cities have a 4-season climate. So, let’s have a look at the average rainfall, and minimum and maximum temperature each month. Recall that the average weather conditions are based on data ranging from 2008 to 2017.

Logistic Regression model

Let’s predict the rain tomorrow in Australia. Data has been labeled beforehand. The column ‘Rain Tomorrow’ in the dataset answers the question if there is any rainfall tomorrow. So, this analysis is perfect for logistic regression. Will it rain in Sydney and Melbourne tomorrow?

Data Preprocessing

Data cleaning and data preprocessing are required to build a logistic regression model. I start with finding the column names. The column names could be convenient to include variables. The dependent variable is ‘Rain Tomorrow’. The independent variables include weather conditions, such as temperature, rainfall, humidity, pressure, evaporation, sunshine, cloudiness, and wind speed. These are called ‘features’, and I assign the features to variable x.

Make sure that all missing values are filled. I replace all missing values for weather conditions (independent variables) with the average value. I make 2 separate data frames, including location data for Sydney and Melbourne. The data frame including Melbourne location data has lots of missing values for ‘Rain Tomorrow’. I have dropped the missing values in the Melbourne data frame. Also, I change the shape of the distribution of the data, named ‘normalizing the data’. In Machine Learning, we assume that data is normally distributed. Building a logistic regression model is a well-known Machine Learning technique.

Lastly, I split the data into a train and test set. Splitting up the dataset into a train and test set is an essential step to evaluate the performance of the Machine Learning algorithm or in this case the logistic regression model. I split the model into an 80% train set and 20% test set.

The Model

I have built a ‘liblinear’ regression model. A ‘liblinear’ solver is used by default when building the model. The advantage of a liblinear model is that the parameter is recommended when dealing with large-scale classification problems.

LR_model = LogisticRegression(C=0.01, solver='liblinear').fit(x_train,y_train) ypred_prob = LR_model.predict_proba(x_test)

There are multiple evaluation metrics to determine the performance of the model. For both the Sydney and Melbourne model, I calculated the log loss and the F-score. A log loss close to 1 implies that the model is poorly estimated, whereas an F-score close to 1 indicates that the model is more accurate. Also, a confusion matrix is a good metric to evaluate the model. Each predicted Y value is compared to the actual value. The matrix shows how a prediction is off the actual value. In statistical terms, predictions that are different from actual outcomes are named type 1 and type 2 errors.

	Sydney	Melbourne
Log Loss	0.405	0.439
F-score	0.820	0.794

The model to predict the weather in Sydney is more accurate than the model to predict the weather in Melbourne. The model to predict the weather in Sydney has a lower log loss value, a higher F-score, and relatively fewer predictions have a type 1 or type 2 error. A shortcoming of the logistic regression model above is that the model does not consider time. Therefore, we need time series analysis.

Time Series Analysis

Time series analysis enables us to make future predictions. Let’s build a time series model and predict the rainfall tomorrow. I select the date and rainfall as variables for the model. First, it is a crucial step to compare the date ranges. Time series analysis requires a frequency in data; it is required that all periods are present. In the Sydney and Melbourne data frames, I count the number of dates in which rainfall has been recorded. Then, I compare the number of records to the differences in days between the minimum and maximum date. Some dates have missing records. I create a new index containing dates from all periods. I fill the missing values with the average amount of rainfall. Now, let’s use upsampling to make predictions per month.

The Time Series Model

A time series model should meet the condition of being stationary. I test both the Sydney and Melbourne rainfall data for stationarity. The Dickey Fuller Test is an excellent metric to test for stationarity. For both models, I reject the null hypothesis of being nonstationary. Both data are stationary. Also, I check for autocorrelation and partial autocorrelation. Autocorrelation is known as serial correlation, and it is the correlation of observations with observations from previous time steps. An autocorrelation plot tells us how present values are correlated with past values in a series of data. Compared to an autocorrelation plot, a partial autocorrelation plot allows us to find the correlation of the residuals.

Now, let’s create the ARIMA model. ARIMA is short for Auto-Regressive Integrated Moving Average. The model enables us to understand the data and make a future prediction. In the model, I use a lag length of 1. As depicted in the autocorrelation plot, this is the optimal value. After building a time series model, we can make our predictions.

model=ARIMA(sydneyweather_monthly['Rainfall'],order=(1,1,1))
model_fit=model.fit()
model_fit.summary()

Forecasting

I start making predictions after fitting the model. I include all forecasted values until 2022. Then, I create a line plot including the current values and forecasted values.
The forecasted and average precipitation for April 2021 will be 5.8mm in Sydney. According to the ‘real life’ weather forecast, the new month starts partly cloudy, but occasional rain showers will be in place from Easter Monday. Melbourne has approximately 2mm rainfall in April 2021. By this time of the year, you should visit Melbourne better than Sydney to avoid plenty of rainfall. According to the ‘real life’ weather forecast, the weather will be nice! It will be sunny and partly cloudy in the first weeks of April!

More on this Analysis

Hope you enjoyed this analysis! Would you like to find out more about this analysis? I made a Notebook contribution at Kaggle, and I shared this Jupyter Notebook on Github. Please find here a link to this notebook contribution:

Github : My Kaggle Projects – Rain in Australia

Kaggle : Rain in Australia – Notebook