Word Clouds with Python: Big Bang Theory TV Series Dataset

Recently, I found a great dataset at Kaggle including a The Big Bang Theory TV Series transcript. The Big Bang Theory is a very popular sitcom with my favorite nerds! The show premiered in 2007 and concluded in 2019. At Kaggle, I made a notebook submission including data visualization in Word Clouds. A Word Cloud is a data visualization technique that is used for representing text data. The size of each word in Word Clouds indicates the importance or the frequency of a word. Generating Word Clouds with Python could be a handy tool to explore data.

Introduction to Word Clouds

Let’s explore The Big Bang Theory TV Series transcript by visualizing text data in Word Clouds. First of all, I import the dataset into my notebook and I should import the required packages to generate a Word Cloud. I import Pandas, NumPy, and Matplotlib.

import numpy as np                # linear algebra
import pandas as pd               # data processing, CSV file 
import matplotlib.pyplot as plt   # data visualization

# Continue with loading all necessary libraries
from os import path
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

Aside from the Matplotlib package to plot the figure, you should import the Word Cloud module. The module contains a stopword functionality.

Word Clouds and stopwords

Stopwords do not provide any useful information. Words without any meaning are prepositions, conjunctions, etc. Therefore, I exclude stopwords from the analysis. The code below is the code to exclude stopwords from Word Clouds. You should put the words inside the squared brackets you want to exclude from the Word Cloud.

# Create stopword list:
stopwords = set(STOPWORDS)
stopwords.update([])

Creating Word Clouds

First, I have created a list of persons I want to include in the dataset to create the Word Clouds. I filter the names of the main characters and some other characters. Then, I create a new dataset including the filtered names.

#list the main characters
persons = ['Sheldon', 'Leonard', 'Raj', 'Penny','Howard','Amy','Bernadette']

#other characters
others = ['Ramona','Beverley']

data = df[df.person_scene.isin(persons)]
data.head(5)

Now, let’s create the Word Cloud. I do not want any stopwords in the Word Cloud. In the code below, I refer to the list with stopwords I created previously. I use a standard background color. Also, I include 60 words in the image.

# Generate a word cloud image
wordcloud = WordCloud(stopwords=stopwords, max_words=60, background_color="white").generate(sheldon)

# Display the generated image:
# the matplotlib way:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
Word Cloud – Vocabulary used by Sheldon in all episodes

Now, let’s take a look at the words that have been used by Leonard in all episodes. In this case, you should set the dialogue to Leonard, see the code below.

leonard = " ".join(dialogue for dialogue in data[data["person_scene"]=="Leonard"].dialogue)

# Generate a word cloud image
wordcloud = WordCloud(stopwords=stopwords, max_words=60, background_color="white").generate(leonard)

# Display the generated image:
# the matplotlib way:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
Word Cloud – Vocabulary used by Leonard in all episodes

Word Clouds by episode

When you look at the vocabulary choices in all episodes, the most commonly used words include words that are often used during the day. Now, let’s look at the words that are most frequently used by the main characters in some episodes.

Season 1 Episode 6: The Halloween Party

In the first season, Penny throws a Halloween Party. She has invited her friends, Leonard, Sheldon, Howard, and Raj. Sheldon gets dressed as the doppler effect. The doppler effect refers to the change in sound wave that increases when the object moves forward you, for example the sound of the train when you are waiting in your car at a railroad crossing. It is a scientific thing. In the word clouds, many words about the costumes and costume party have been included. The words shown largest in the figures have been mentioned lots of times during this episode.

Season 2 Episode 15: The Maternal Capacitance

In season 2, Leonard’s mother comes to town. She has a cold and distant personality, and also, she is an accomplished scientist.

Season 2 Episode 6: The Cooper-Nowitzki Theorem

In this episode, a younger student finds Sheldon’s work attractive. Ramona, the younger student, helps Sheldon with his scientific breakthrough. The images below show the words that are mostly used during this episode.

Season 4 Episode 22: The Wildebeest Implementation

In this episode, Bernadette goes shopping with the girls, and later on, she has a double date with Leonard and Priya. What words did she use most often during this episode? The answer is in this Word Cloud.

Word Cloud Bernadette
Series 5 Episode 2: The Infestation Hypothesis

And how about the episode in which Penny has put a chair in her apartment someone threw away? You can choose the number of words in a Word Cloud by putting a number in the max_words criteria in the code. You can opt for including only 20 words, but you can also include a 100 words.

Season 6 Episode 20: The Tenure Turbulence

In this episode, the guys apply for a tenured position. Let’s see the Word Clouds.

Read More

https://www.kaggle.com/lydia70/big-bang-theory-tv-show

https://github.com/Lydia70/my-kaggle-projects/blob/main/big-bang-theory-tv-show.ipynb

Please upvote if you find the notebook at Kaggle useful πŸ™‚