
A few years ago, there was a Kaggle competition to perform text analytics on the Star Wars original trilogy script (episodes 4, 5, and 6). The competition was made up of several tasks of text mining and NLP that can be conducted such as word clouds development and sentiment analysis.
In 2023, I conducted the same text mining project independently to see what insights I could obtain from the mighty starwars. For the purpose of showcasing this project on my portfolio, I asked ChatGPT to come up with a catchy title and description of the project. I asked for the descriptions to be written in the voices of Yoda and Darth Vader. The results were both interesting, but it is the Vader description that got me rolling on the floor, and is now attached hereunder:
IN VADER’S VOICE:
“Behold, the power of the Star Wars dataset! I have plumbed the depths of this vast trove of words, extracting the frequencies of speech by those who dare to call themselves Jedi or Sith. The numbers, like ants scurrying across the battlefield, tell a tale of dominance and intrigue. Lowercase, tokenization, and the expulsion of insignificant stopwords were mere trifles before me, for I have bent the dataset to my will. Lexicon normalization, the crushing of words to their base forms, served only to amplify their power .
And then, the wordclouds, those haunting mists of language, brought forth the essence of our chosen characters. A storm of words, each holding the weight of their souls. Yet my journey did not end there, for I ventured into the abyss of sentiment analysis. Sith and Jedi, locked in a dance of emotions, their sentiments laid bare before me. The Dark Side clashed with the Light, revealing the turmoil within. Now, I unveil this project, a testament to the power that lies within the words of the Star Wars universe”.
Explore it, if you dare, and witness the battle between good and evil, as expressed through the art of text mining and sentiment analysis. The Force is strong in this dataset!”
Description of text mining process
I began by importing the three datasets onto python as pandas data frames. The datasets were a little messy in importing and transformation to usable status. After going through some challenges with importing and ensuring the dataset contained just two columns, I decided to save the datasets as .csv, delete the ‘index’ column on the left, then re-imported the datasets onto Python.
Task 1: Find the characters with the most dialogues in each episode of The Original Trilogy (Episodes IV, V, VI).
For this task, first I created data frames for each episode which counted the frequency with which each character name appeared. I then made tables showing only the top ten characters who appeared most in each episode as shown in the notebook.
Task 2: Plot the number of dialogues according to the character for each episode (i.e. plot the above findings).
The graph plotted shows the top ten characters with the most dialogue for each episode. As can be seen on the above chart, Luke and Han had the most dialogue for each of the three episodes.
Task 3: Add a new column “episode” to the three datasets (to distinguish between the three episodes) and concatenate them into one dataset.
This was implemented as per challenge instructions.
Task 4: Discover the frequency distribution of words in The Original Trilogy.
Task 5: Create a Frequency Distribution plot of the most repeated words in The Original Trilogy.
Using the full dataset, I used a counter to count the number of instances in which each word was used, then saved these. I then created a bar plot of the 20 most used words in the trilogy. The results are shown in the notebook. As shown in the resultant plot, analyzing the most frequently used words at this stage is not useful at all since the most frequently used words are stop words, which tell us nothing about what the characters were actually saying. Obviously, the text will need to be cleaned further before the above process can be repeated.
Task 6: Perform text-mining operations to prepare your dataset for further text analysis. (Use the NLTK library).
To implement this task, I created a function that performed the various text mining operations on the data and then return the processed text. Using the processed text, I was then able to create a new column new_script on the star wars trilogy dataset.
Task 7: Repeat steps 4 & 5, but check the frequency distribution of the “new_script” this time.
I created another frequency plot, this time made up of the 20 most used words in the original trilogy after text mining operations were conducted. The emerging chart was more informative, showing words that resemble actual dialogue being said by the characters.
Task 8. Use Word Clouds to visually represent the most repeated words for Darth Vader and Yoda
For this task, I used the wordcloud package as well as the Image package for reading the character masks. The first cloud I made shows the most repeated words by Vader in the entire trilogy: As on the cloud, Vader mentioned ‘emperor’ several times, in reference to emperor Palpatine. From the cloud, I can say most of the words I can see from Vader make a lot of sense to me. For instance, ‘Skywalker’ is Vader’s nemesis, hence his name is always on Vader’s mouth. Vader also mentioned the ‘force’ many times, whom he considers to be ‘rebels’. The next cloud shows Yoda’s most repeated words.
Yoda’s wordcloud also shows that he repeated several words which are synonymous with his character. As Luke’s teacher, he mentions ‘luke’ a lot, then also mentions the ‘force’ and ‘jedi’, of whom he is part of. The word ‘must’ suggests that Yoda is in a position of giving advice a lot, hence, he gets to mention what the jedi/ force needs to do to succeed. The word ‘yes’ is not clear to me in terms of meaning, but I can see that Yoda likes to mention that he is ‘old’ a lot, and the word ‘hmm’ shows that Yoda is a thinker before he says things.
Task 9. Discover the most relevant words in The Original Trilogy script.
After executing this task, I found the following list of the most relevant words used in the original trilogy script:
Word TF-IDF Score
1276 luke 0.227311
931 get 0.215611
946 going 0.202240
1768 right 0.188869
414 come 0.187198
1447 oh 0.185526
1189 know 0.183855
1911 sir 0.150427
2343 well 0.150427
1836 see 0.148755
943 go 0.148755
133 artoo 0.135384
957 got 0.135384
951 good 0.132041
2399 yes 0.130370
This list of words was mostly similar to the list of most mentioned words I produced from the previous tasks.
Task 10: Perform sentiment analysis on the movie scripts
For this task, I selected 2 Sith characters (Vader and Emperor Palpatine), 2 Jedi characters (Luke and Yoda), and the overall sentiment (all characters across the entire trilogy). After calculating the scores, I made some attempts at understanding the meaning of sentiment scores (obtained using the VADER package) based on the expected character sentiments. The following is my analysis:
Sentiment for Darth Vader: 0.058
The sentiment score for Darth Vader was relatively close to neutral (around 0). It suggests that his dialogue in the movie scripts doesn’t strongly lean towards positive or negative sentiments. However, given his association with the Dark Side of the Force, we may interpret this as Vader expressing emotions that are not overwhelmingly negative.
Sentiment for Emperor Palpatine: 0.099
The sentiment score for Emperor Palpatine was slightly positive. This indicates that his dialogue contains some elements that express positive sentiments. However, considering his role as a Sith Lord, known for his malevolence and manipulation, I posit that this positive sentiment may simply have been subtle or rather deceptive in nature.
Sentiment for Luke: 0.044
The sentiment score for Luke is also close to neutral (around 0), suggesting that his dialogue does not strongly lean towards positive or negative sentiments. This aligns with his character’s teachings as a Jedi, where he strives to maintain emotional balance and resist giving in to negative emotions.
Sentiment for Yoda: 0.046
Similar to Luke, the sentiment score for Yoda is close to neutral (around 0), indicating a lack of strong positive or negative sentiments. As a wise and experienced Jedi Master, Yoda emphasises control over emotions and following the path of the Light Side of the Force.
Sentiment for overall trilogy: 0.055
The sentiment score for the overall trilogy is also relatively close to neutral (around 0). This suggests that the movie scripts as a whole do not exhibit strong positive or negative sentiments.
Overall, the differences between Dark Side characters (Darth Vader and Emperor Palpatine) and Light Side characters (Luke and Yoda), shows some subtle distinctions. The Dark Side characters have slightly higher sentiment scores, indicating a slightly more positive sentiment in their dialogue compared to the Light Side characters. This supports the idea that the Sith are associated with negative feelings but might still exhibit elements of complexity and depth in their dialogue. This is not necessarily what I expected to emerge from sentiment analysis, as I thought the dark side characters were to be the ones displaying largely negative sentiment. However, it is my understanding that while sentiment analysis provides an analysis of textual sentiment, it might not capture the full depth and complexity of character emotions and development in the Star Wars universe.