Data Visualization in RStudio

Data visualization in data science refers to the graphical representation of data. It is a way to easily understand data and gain meaningful insights from data. In other words, visualized data provides a broad overview of data and allows us to detect patterns in data. Also, a graphical presentation of data makes it simpler to detect outliers. Data scientists could use various software to present data visually, for example RStudio. RStudio is a statistical analysis software package that is used in combination with R. R is a programming language which is developed for data miners and statisticians. Plotting in RStudio is rather simple. Specific examples of data visualization methods include scatterplots, boxplots, histograms, violin plots, and heat maps.

Plotting in RStudio

A commonly used plotting function in RStudio is ggplot. To get access to data visualization functions, one must first import ggplot2 from the library. After importing ggplot2 from the library, the ggplot function is available for use. The ggplot function is applicable for plotting objects. Each ggplot contains the name of the dataset and the labels for the x-axis and y-axis in the command. In addition, the function must contain a plot component referring to the type of plot. There are various types of plots, such as scatterplots, boxplots or heat maps.

#Import library for data visualization
> library(ggplot2)

#General command for data visualization in RStudio
> ggplot(data = dataset, aes(x= …., y=…..) ) + geom_plotname()

#Add a title to the graph
> ……+ggtitle(“Title”)

Scatterplot

A scatterplot provides an overview of how data is distributed. It displays the relationship between continuous variables. In RStudio, the code for a scatterplot includes the geom_point object. The figure below is an example of a scatterplot made with RStudio. The dataset ‘Mall Customers’ is available at the Kaggle Data Science Community. In the example, the scatterplot shows the relationship between income and spending score. The plot indicates that a higher income does not necessarily imply a higher spending score. Additionally, the colored dots denote the age of the customer.

#Make a scatterplot of mall customer data to find the relationship between spending score and income
>ggplot(Mall_Customers, aes(x = Income, y = Spending_Score, color = Age))+geom_point()

Boxplot

Another plot that could be useful to provide an overview of how data is distributed is a boxplot. A boxplot has a minimum, first quartile, median, third quartile, and maximum. The plot shows how tightly the data is grouped and how the data is skewed. The median is the middle value of the dataset and does not necessarily equal the mean value. In the figure below, outliers are depicted as individual points. From the plot, you can also derive the median for the amount of purchases separated by gender. Generally, a boxplot is convenient to compare summarizing statistics.

#Draw a simple boxplot
>ggplot(BlackFriday, aes(x=Gender, y=Purchases) + geom_boxplot()

Heat Map

Data plots could also look like an image. A heat map represents data values as colors. The data that supports the graph below is retrieved from the Kaggle Data Science Community. Data on the Australian weather is available from November 2007 to June 2017. The heat map below only contains data over the year 2017. Also, the plot only contains well-known places in Australia. Observations of other locations are removed from the dataset. The heat map displays the temperature at 3p.m. for each location at a certain time period.

#view the Australian weather dataset, display the first ten entries, and import the library for data manipulation to filter the Australian weather dataset by year and city.
>View(Australian_Weather)
>head(Australian_Weather, n = 10)
>library(dplyr)

#Filter the dataset by year
>Australian_Weather_2017 <- Australian_Weather %>% filter(Date > 2017)

#Filter the dataset by popular cities
>Australian_Weather_Main_Cities <- Australian_Weather_2017 %>% filter(city=”Adelaide”|city=”AliceSprings|city=………|etc.)

#Run the code for the heat map
>ggplot(Australian_Weather_Main_Cities, aes(x = time, y = city, fill=Temperature)) + geom_tile

Violin Plot

Similar to a boxplot is a violin plot. As illustrated in the figure below on Australian weather data, a violin plot has almost the same shape as a boxplot. In contrast to boxplots, violin plots include the probability density. The oddly shaped lines show the distribution shape of the data. A wider section, such as the wider section for main city ´Darwin´ in the figure below, indicates that the probability is high the temperature takes approximately 31,5 degrees. Like a boxplot, a violin plot has a lower adjacent value, median and an upper adjacent value.

#Run the code for a violin plot
>ggplot(Australian_Weather_Main_Cities, aes(x = Main_Cities, y = PM_Temperature)) + geom_violin(alpha=0.1, trim=FALSE)

Violin Plot – Australian Weather Dataset – Kaggle Data Science Community

Histogram

Histograms show the distribution of numerical data. A histogram is different from a bar graph since a bar graph contains categorical data. The plot below shows the frequency of the amount of purchases by gender. The amount of purchases is probably by dollar cents.

#Run the code for a histogram
>ggplot(BlackFriday, aes(Purchases, fill=Gender, col=I(“white”))) + geom_histogram(breaks=seq(0,20000,by=2000))