|
| 1 | +{ |
| 2 | + "cells": [ |
| 3 | + { |
| 4 | + "cell_type": "markdown", |
| 5 | + "metadata": {}, |
| 6 | + "source": [ |
| 7 | + "# Sentiment Analysis" |
| 8 | + ] |
| 9 | + }, |
| 10 | + { |
| 11 | + "cell_type": "markdown", |
| 12 | + "metadata": {}, |
| 13 | + "source": [ |
| 14 | + "## Introduction" |
| 15 | + ] |
| 16 | + }, |
| 17 | + { |
| 18 | + "cell_type": "markdown", |
| 19 | + "metadata": {}, |
| 20 | + "source": [ |
| 21 | + "So far, all of the analysis we've done has been pretty generic - looking at counts, creating scatter plots, etc. These techniques could be applied to numeric data as well.\n", |
| 22 | + "\n", |
| 23 | + "When it comes to text data, there are a few popular techniques that we'll be going through in the next few notebooks, starting with sentiment analysis. A few key points to remember with sentiment analysis.\n", |
| 24 | + "\n", |
| 25 | + "1. **TextBlob Module:** Linguistic researchers have labeled the sentiment of words based on their domain expertise. Sentiment of words can vary based on where it is in a sentence. The TextBlob module allows us to take advantage of these labels.\n", |
| 26 | + "2. **Sentiment Labels:** Each word in a corpus is labeled in terms of polarity and subjectivity (there are more labels as well, but we're going to ignore them for now). A corpus' sentiment is the average of these.\n", |
| 27 | + " * **Polarity**: How positive or negative a word is. -1 is very negative. +1 is very positive.\n", |
| 28 | + " * **Subjectivity**: How subjective, or opinionated a word is. 0 is fact. +1 is very much an opinion.\n", |
| 29 | + "\n", |
| 30 | + "For more info on how TextBlob coded up its [sentiment function](https://planspace.org/20150607-textblob_sentiment/).\n", |
| 31 | + "\n", |
| 32 | + "Let's take a look at the sentiment of the various transcripts, both overall and throughout the comedy routine." |
| 33 | + ] |
| 34 | + }, |
| 35 | + { |
| 36 | + "cell_type": "markdown", |
| 37 | + "metadata": {}, |
| 38 | + "source": [ |
| 39 | + "## Sentiment of Routine" |
| 40 | + ] |
| 41 | + }, |
| 42 | + { |
| 43 | + "cell_type": "code", |
| 44 | + "execution_count": null, |
| 45 | + "metadata": {}, |
| 46 | + "outputs": [], |
| 47 | + "source": [ |
| 48 | + "# We'll start by reading in the corpus, which preserves word order\n", |
| 49 | + "import pandas as pd\n", |
| 50 | + "\n", |
| 51 | + "data = pd.read_pickle('corpus.pkl')\n", |
| 52 | + "data" |
| 53 | + ] |
| 54 | + }, |
| 55 | + { |
| 56 | + "cell_type": "code", |
| 57 | + "execution_count": null, |
| 58 | + "metadata": {}, |
| 59 | + "outputs": [], |
| 60 | + "source": [ |
| 61 | + "# Create quick lambda functions to find the polarity and subjectivity of each routine\n", |
| 62 | + "# Terminal / Anaconda Navigator: conda install -c conda-forge textblob\n", |
| 63 | + "from textblob import TextBlob\n", |
| 64 | + "\n", |
| 65 | + "pol = lambda x: TextBlob(x).sentiment.polarity\n", |
| 66 | + "sub = lambda x: TextBlob(x).sentiment.subjectivity\n", |
| 67 | + "\n", |
| 68 | + "data['polarity'] = data['transcript'].apply(pol)\n", |
| 69 | + "data['subjectivity'] = data['transcript'].apply(sub)\n", |
| 70 | + "data" |
| 71 | + ] |
| 72 | + }, |
| 73 | + { |
| 74 | + "cell_type": "code", |
| 75 | + "execution_count": null, |
| 76 | + "metadata": {}, |
| 77 | + "outputs": [], |
| 78 | + "source": [ |
| 79 | + "# Let's plot the results\n", |
| 80 | + "import matplotlib.pyplot as plt\n", |
| 81 | + "\n", |
| 82 | + "plt.rcParams['figure.figsize'] = [10, 8]\n", |
| 83 | + "\n", |
| 84 | + "for index, comedian in enumerate(data.index):\n", |
| 85 | + " x = data.polarity.loc[comedian]\n", |
| 86 | + " y = data.subjectivity.loc[comedian]\n", |
| 87 | + " plt.scatter(x, y, color='blue')\n", |
| 88 | + " plt.text(x+.001, y+.001, data['full_name'][index], fontsize=10)\n", |
| 89 | + " plt.xlim(-.01, .12) \n", |
| 90 | + " \n", |
| 91 | + "plt.title('Sentiment Analysis', fontsize=20)\n", |
| 92 | + "plt.xlabel('<-- Negative -------- Positive -->', fontsize=15)\n", |
| 93 | + "plt.ylabel('<-- Facts -------- Opinions -->', fontsize=15)\n", |
| 94 | + "\n", |
| 95 | + "plt.show()" |
| 96 | + ] |
| 97 | + }, |
| 98 | + { |
| 99 | + "cell_type": "markdown", |
| 100 | + "metadata": {}, |
| 101 | + "source": [ |
| 102 | + "## Sentiment of Routine Over Time" |
| 103 | + ] |
| 104 | + }, |
| 105 | + { |
| 106 | + "cell_type": "markdown", |
| 107 | + "metadata": {}, |
| 108 | + "source": [ |
| 109 | + "Instead of looking at the overall sentiment, let's see if there's anything interesting about the sentiment over time throughout each routine." |
| 110 | + ] |
| 111 | + }, |
| 112 | + { |
| 113 | + "cell_type": "code", |
| 114 | + "execution_count": null, |
| 115 | + "metadata": { |
| 116 | + "collapsed": true |
| 117 | + }, |
| 118 | + "outputs": [], |
| 119 | + "source": [ |
| 120 | + "# Split each routine into 10 parts\n", |
| 121 | + "import numpy as np\n", |
| 122 | + "import math\n", |
| 123 | + "\n", |
| 124 | + "def split_text(text, n=10):\n", |
| 125 | + " '''Takes in a string of text and splits into n equal parts, with a default of 10 equal parts.'''\n", |
| 126 | + "\n", |
| 127 | + " # Calculate length of text, the size of each chunk of text and the starting points of each chunk of text\n", |
| 128 | + " length = len(text)\n", |
| 129 | + " size = math.floor(length / n)\n", |
| 130 | + " start = np.arange(0, length, size)\n", |
| 131 | + " \n", |
| 132 | + " # Pull out equally sized pieces of text and put it into a list\n", |
| 133 | + " split_list = []\n", |
| 134 | + " for piece in range(n):\n", |
| 135 | + " split_list.append(text[start[piece]:start[piece]+size])\n", |
| 136 | + " return split_list" |
| 137 | + ] |
| 138 | + }, |
| 139 | + { |
| 140 | + "cell_type": "code", |
| 141 | + "execution_count": null, |
| 142 | + "metadata": {}, |
| 143 | + "outputs": [], |
| 144 | + "source": [ |
| 145 | + "# Let's take a look at our data again\n", |
| 146 | + "data" |
| 147 | + ] |
| 148 | + }, |
| 149 | + { |
| 150 | + "cell_type": "code", |
| 151 | + "execution_count": null, |
| 152 | + "metadata": {}, |
| 153 | + "outputs": [], |
| 154 | + "source": [ |
| 155 | + "# Let's create a list to hold all of the pieces of text\n", |
| 156 | + "list_pieces = []\n", |
| 157 | + "for t in data.transcript:\n", |
| 158 | + " split = split_text(t)\n", |
| 159 | + " list_pieces.append(split)\n", |
| 160 | + " \n", |
| 161 | + "list_pieces" |
| 162 | + ] |
| 163 | + }, |
| 164 | + { |
| 165 | + "cell_type": "code", |
| 166 | + "execution_count": null, |
| 167 | + "metadata": {}, |
| 168 | + "outputs": [], |
| 169 | + "source": [ |
| 170 | + "# The list has 10 elements, one for each transcript\n", |
| 171 | + "len(list_pieces)" |
| 172 | + ] |
| 173 | + }, |
| 174 | + { |
| 175 | + "cell_type": "code", |
| 176 | + "execution_count": null, |
| 177 | + "metadata": {}, |
| 178 | + "outputs": [], |
| 179 | + "source": [ |
| 180 | + "# Each transcript has been split into 10 pieces of text\n", |
| 181 | + "len(list_pieces[0])" |
| 182 | + ] |
| 183 | + }, |
| 184 | + { |
| 185 | + "cell_type": "code", |
| 186 | + "execution_count": null, |
| 187 | + "metadata": {}, |
| 188 | + "outputs": [], |
| 189 | + "source": [ |
| 190 | + "# Calculate the polarity for each piece of text\n", |
| 191 | + "\n", |
| 192 | + "polarity_transcript = []\n", |
| 193 | + "for lp in list_pieces:\n", |
| 194 | + " polarity_piece = []\n", |
| 195 | + " for p in lp:\n", |
| 196 | + " polarity_piece.append(TextBlob(p).sentiment.polarity)\n", |
| 197 | + " polarity_transcript.append(polarity_piece)\n", |
| 198 | + " \n", |
| 199 | + "polarity_transcript" |
| 200 | + ] |
| 201 | + }, |
| 202 | + { |
| 203 | + "cell_type": "code", |
| 204 | + "execution_count": null, |
| 205 | + "metadata": { |
| 206 | + "collapsed": true |
| 207 | + }, |
| 208 | + "outputs": [], |
| 209 | + "source": [ |
| 210 | + "# Show the plot for one comedian\n", |
| 211 | + "plt.plot(polarity_transcript[0])\n", |
| 212 | + "plt.title(data['full_name'].index[0])\n", |
| 213 | + "plt.show()" |
| 214 | + ] |
| 215 | + }, |
| 216 | + { |
| 217 | + "cell_type": "code", |
| 218 | + "execution_count": null, |
| 219 | + "metadata": {}, |
| 220 | + "outputs": [], |
| 221 | + "source": [ |
| 222 | + "# Show the plot for all comedians\n", |
| 223 | + "plt.rcParams['figure.figsize'] = [16, 12]\n", |
| 224 | + "\n", |
| 225 | + "for index, comedian in enumerate(data.index): \n", |
| 226 | + " plt.subplot(3, 4, index+1)\n", |
| 227 | + " plt.plot(polarity_transcript[index])\n", |
| 228 | + " plt.plot(np.arange(0,10), np.zeros(10))\n", |
| 229 | + " plt.title(data['full_name'][index])\n", |
| 230 | + " plt.ylim(ymin=-.2, ymax=.3)\n", |
| 231 | + " \n", |
| 232 | + "plt.show()" |
| 233 | + ] |
| 234 | + }, |
| 235 | + { |
| 236 | + "cell_type": "markdown", |
| 237 | + "metadata": { |
| 238 | + "collapsed": true |
| 239 | + }, |
| 240 | + "source": [ |
| 241 | + "Ali Wong stays generally positive throughout her routine. Similar comedians are Louis C.K. and Mike Birbiglia.\n", |
| 242 | + "\n", |
| 243 | + "On the other hand, you have some pretty different patterns here like Bo Burnham who gets happier as time passes and Dave Chappelle who has some pretty down moments in his routine." |
| 244 | + ] |
| 245 | + }, |
| 246 | + { |
| 247 | + "cell_type": "markdown", |
| 248 | + "metadata": { |
| 249 | + "collapsed": true |
| 250 | + }, |
| 251 | + "source": [ |
| 252 | + "## Additional Exercises" |
| 253 | + ] |
| 254 | + }, |
| 255 | + { |
| 256 | + "cell_type": "markdown", |
| 257 | + "metadata": {}, |
| 258 | + "source": [ |
| 259 | + "1. Modify the number of sections the comedy routine is split into and see how the charts over time change." |
| 260 | + ] |
| 261 | + }, |
| 262 | + { |
| 263 | + "cell_type": "code", |
| 264 | + "execution_count": null, |
| 265 | + "metadata": { |
| 266 | + "collapsed": true |
| 267 | + }, |
| 268 | + "outputs": [], |
| 269 | + "source": [] |
| 270 | + } |
| 271 | + ], |
| 272 | + "metadata": { |
| 273 | + "kernelspec": { |
| 274 | + "display_name": "Python 3", |
| 275 | + "language": "python", |
| 276 | + "name": "python3" |
| 277 | + }, |
| 278 | + "language_info": { |
| 279 | + "codemirror_mode": { |
| 280 | + "name": "ipython", |
| 281 | + "version": 3 |
| 282 | + }, |
| 283 | + "file_extension": ".py", |
| 284 | + "mimetype": "text/x-python", |
| 285 | + "name": "python", |
| 286 | + "nbconvert_exporter": "python", |
| 287 | + "pygments_lexer": "ipython3", |
| 288 | + "version": "3.6.2" |
| 289 | + }, |
| 290 | + "toc": { |
| 291 | + "nav_menu": {}, |
| 292 | + "number_sections": true, |
| 293 | + "sideBar": true, |
| 294 | + "skip_h1_title": false, |
| 295 | + "toc_cell": false, |
| 296 | + "toc_position": {}, |
| 297 | + "toc_section_display": "block", |
| 298 | + "toc_window_display": false |
| 299 | + }, |
| 300 | + "varInspector": { |
| 301 | + "cols": { |
| 302 | + "lenName": 16, |
| 303 | + "lenType": 16, |
| 304 | + "lenVar": 40 |
| 305 | + }, |
| 306 | + "kernels_config": { |
| 307 | + "python": { |
| 308 | + "delete_cmd_postfix": "", |
| 309 | + "delete_cmd_prefix": "del ", |
| 310 | + "library": "var_list.py", |
| 311 | + "varRefreshCmd": "print(var_dic_list())" |
| 312 | + }, |
| 313 | + "r": { |
| 314 | + "delete_cmd_postfix": ") ", |
| 315 | + "delete_cmd_prefix": "rm(", |
| 316 | + "library": "var_list.r", |
| 317 | + "varRefreshCmd": "cat(var_dic_list()) " |
| 318 | + } |
| 319 | + }, |
| 320 | + "types_to_exclude": [ |
| 321 | + "module", |
| 322 | + "function", |
| 323 | + "builtin_function_or_method", |
| 324 | + "instance", |
| 325 | + "_Feature" |
| 326 | + ], |
| 327 | + "window_display": false |
| 328 | + } |
| 329 | + }, |
| 330 | + "nbformat": 4, |
| 331 | + "nbformat_minor": 2 |
| 332 | +} |
0 commit comments