Introduction

Data that hasn’t gone through some analysis and visualization might not mean much, as you don’t really understand what’s inside. Some machine learning programmers might be able to get away doing that if they have seen previous research or similar tasks before. But, that’s not the way to go if you’re starting from scratch.

In this article, we will continue from a previous subject about discussional Reddit data acquisition and do some basic data exploration. Firstly, the main step is visualizing Reddit data using Python Notebook (Collaboratory). As a result, you can see the lines of code and results in a single page one after another as if it’s a statistical report. This project will make use of Matplotlib as it’s the most versatile and customizable visualization Python library out there.


Glossary

Some new terms I learned while doing this project:

  • Corpus – Large and structured set of texts.
  • Stemming – The process of reducing inflection in words to their root forms.
  • Lemmatization – The process of reducing the inflected words properly ensuring that the root word belongs to the language.
  • Part-of-Speech (POS) – Indicator about how the word functions in meaning as well as grammatically within the sentence, such as noun, pronoun, verb, adjective, adverb, preposition, conjunction, and interjection.

Requirements

Tools/libraries used:

  • Python
  • Google Colaboratory
  • Google Drive
  • Anytree
  • Pickle
  • Pandas
  • Numpy
  • Matplotlib
  • Wordcloud
  • Regular expression (re)
  • Natural Language Toolkit (NLTK)
  • Scipy’s curve fit

The program used in this tutorial is available in this GitHub Repository. Open the Python Notebook file using Google Colaboratory and clone it to run or modify. It’s generously commented, but it’s not going to be as thorough as this article.

Please always refer to that Notebook while following the text in this article, as this article won’t feature a full copy-paste of the program provided there to maintain readability.


Step-by-step

Without further ado, let’s start the steps to achieve this goal.

1 – Preparing the data to be visualized

As stated before, the data used here is the one obtained from a previous project. It’s recommended to follow the full step-by-step there (and match the filename), but you can directly use the extracted file here. Add it to your Google Drive account, and place it in the main directory (you may place it elsewhere, but you have to modify some stuff in the file-loading program). In case you’re following from scratch, here’s the configuration used for this project:

# Hyperparameters
PAGES = 5   # Number of pages from each subreddit
HTTPS = "https://old.reddit.com/"

SUBREDDIT = ["r/politics", "r/worldnews"]

SORT = "top"    # Sorting type
TIME = "month"   # Sorting timespan

# Change file output name into "save_politic.dat"

Make sure the file is ready in your drive before you load it. Otherwise, you have to restart the Collaboratory runtime (menu Runtime > Restart Runtime) and run the code blocks from the first one again.

As you can see, we’re trying to extract something more serious to fully grasps the depth of the branching discussion from Reddit. It’s something that Reddit has an advantage of compared to other platforms. Do note that the result might be different if you scrape the data yourself because the configuration sorts the best posts in a month.

After you’ve placed the data inside your Google Drive, you can just run the codes inside the “Initialization” part of the notebook. Midway through the loading process, there’ll be an instruction to open a link and copy-paste a code. It’s there to give the notebook the privilege to access your Google Drive.

The first visualization is the one that’s been presented before, which is the recreation of the comment tree itself. Below is the preview of it. Do note that the sentences are cut to prevent overflowing, you can easily modify this in the program.

Reddit discussion tree.

The initialization is similar to the previous project, you can see them yourself in the notebook. To access the content from the data obtained after scraping, there’re some simple examples provided below.

# Simple example for accessing individual data
t_post = loaded[4] # Getting certain post
t_comments = t_post["comments"].children # Getting its comments
t_comments[1].attrs # View a comment & its attributes
t_comments[1].attrs["text"] # View a comment's text
t_comments[1].children  # View the replies of a main comment

In specific, each of those lines individually will give you, consecutively:

  • The post as an object,
  • The comments of that post as a tree node,
  • The attributes of a certain main comment (comment directly replying to the post),
  • The text/content of that comment, and
  • The replies to that comment (comment of a comment).

As shown in the notebook, you can also use Pandas DataFrame to see every post in the data in the form of a table.


2 – Simple Visualization: Upvote vs Time

Before jumping to the main topic, there’re several resources you can access to fully understand the visualization library we’re going to use, which is Matplotlib. The original link to the articles can be found at the bottom of this article. Below are those images, showing intuitive parts of a Matplotlib figure & its anatomy.

Visualization of Matplotlib attributes.
A handy guide to change certain attributes of a Matplotlib figure.
Visualization of Matplotlib anatomy.
Anatomy of a Matplotlib figure.

The first part of the visualization and analytic will be about the amount of upvote vs the time of posting. On every social media, there’s always an ideal timeframe to put content, as users aren’t going to be active all the time.

To get the time data, we must convert the timestamp attribute inside our data into a more readable format. Below lines of code will do the job. Visualizing how many top posts is as easy as plotting them directly into a bar chart, but the data must be grouped by the day they’re posted.

# Add a new columns for real date & weekday (UTC, offset for another timezone)
df["date"] = df["data-timestamp"].astype("int64").astype("datetime64[ms]")
df["day"] = df["date"].dt.weekday
day = ["MON", "TUE", "WED", "THU", "FRI", "SAT", "SUN"]
## N of posts in the monthly top 250 vs time
# Based on the day of the week
plt.rcParams["figure.figsize"] = (6,4)
ax = df["day"].groupby(df["day"]).count().plot(kind="bar", width=0.8)
ax.set(xlabel="", ylabel="POSTS", title="MONTHLY TOP POSTS vs DAY")
plt.xticks(range(7), day)
plt.show()
Monthly top posts vs posting day.

Switching the grouping into the time of the day (hour), we can achieve a similar result but against posting hours. Do remember that the timestamp is in UTC. You may offset the value to match a certain timezone.

# Based on the time of the day
ax = df["date"].groupby(df["date"].dt.hour).count().plot(kind="bar", colormap="cool", width=0.8)
ax.set(xlabel="", ylabel="POSTS", title="MONTHLY TOP POSTS vs HOUR") # Default is UTC
plt.show()
Monthly top posts vs posting hours.

You can also filter the data to be shown by making use of a feature of Pandas, which is by adding a conditional statement. Here, we’re segmenting the bar charts into multiple upvote groups.

# Based on the day of the week, but segmented into upvote count groups
minscr = df["data-score"].min()
maxscr = df["data-score"].max()
print("Upvote range:", minscr, maxscr)

# Manually determine certain segmentation
dfseg = pd.DataFrame(index=range(7), columns=[])
dfseg["<30k"] = df[df["data-score"]<=30000]["day"].groupby(df["day"]).count()
dfseg["30k-50k"] = df[(df["data-score"]>30000) & (df["data-score"]<=50000)]["day"].groupby(df["day"]).count()
dfseg["50k-70k"] = df[(df["data-score"]>50000) & (df["data-score"]<=70000)]["day"].groupby(df["day"]).count()
dfseg["70k-90k"] = df[(df["data-score"]>70000) & (df["data-score"]<=90000)]["day"].groupby(df["day"]).count()
dfseg["90k-110k"] = df[(df["data-score"]>90000) & (df["data-score"]<=110000)]["day"].groupby(df["day"]).count()
dfseg[">110k"] = df[df["data-score"]>110000]["day"].groupby(df["day"]).count()

ax = dfseg.plot(kind="bar", stacked=True, colormap="cool", width=0.8)
ax.set(xlabel="", ylabel="POSTS", title="MONTHLY TOP POSTS vs DAY")
plt.xticks(range(7), day)
plt.show()
Monthly top posts vs posting day, segmented.

From those visuals, we can conclude that the best chance to get into the top of the month is to post on Wednesday at 19:00 UTC. Though, as you can see, the posting hour effect is greater than that of the posting day as there’s a big gap between the worst vs best (2 vs 20). Meanwhile, not so much for the worst vs best posting day (28 vs 48).

The low amount of top posts during those hours (3:00-9:00 UTC, with an anomaly on 4:00 UTC) can be explained by comparing them to the timezones in the USA (the majority of Reddit users). Those numbers equal to around 0:00-4:00 at the latest part and 21:00-01:00 at the earliest part of the USA. Both are of course way too early to access social media.


3 – Simple Visualization: Comment Sorting

Reddit revolves around showing you the optimal content up top (by default, as moderators of a subreddit may change this behavior). And it’s not only between different posts but also between comments to a post.

The majority of the subreddit would set the comment sorting to “best”. But it isn’t necessarily mean the most upvoted one, as there’s a separate sorting method for it which is “top”. Here, we’ll try to analyze how upvotes affect the location (thus visibility) of a comment in the discussion. As a result, we may discover what kind of comment to make to be more visible among the hundreds of comments there.

Indeed, we need to switch the tree into an array. Therefore, to “unpack” the comment tree into an array of comments, sorted from the topmost, we can use the iteration method provided by the Anytree library itself named PreOrderIter. You can modify the “maxlevel” to change how deep into the tree you want to take. Do note that here, we’re getting almost all comment trees from every post in our data.

# Upvote vs the further down the comment
# Even reply of the reply to the top 1st comment is going to be higher than the top 2nd comment
verts = []                      # The score in vertical-wise order
for post in loaded:
  cmts = []
  # Unpack the tree into a single array
  for cmt in PreOrderIter(post["comments"], maxlevel=4):
    # Maxlevel determine how deep do you want to look relative to current node
    # Maxlevel 1 = the post itself, 2 = Main comments, 3 = replies to the main comments, etc
    
    score = cmt.attrs["data-score"]
    if score != None:
      cmts.append(score)
      
  verts.append(cmts[1:]) # Skip the first one coz it's the post, not the comment

Then, we can just use a scatter plot to see how the amount of upvote compares to the location of a comment. The color is there to show you from which post are those comments, with light blue is the first post in our data (most upvoted) and purple the last in our data (less upvoted).

# Scatter
plt.rcParams["figure.figsize"] = (6,6)
n_post = 256  # N of posts to be included
for i, vert in enumerate(verts[0:n_post]):
  # The colormap value (0, 256) is normalized across the whole posts (0, n_post)
  plt.scatter(range(len(vert)), vert, color=cm.cool(i*int(256.0/n_post)), s=15)

# Misc parameters
plt.set_cmap("cool")
plt.xlabel("Comment order")
plt.ylabel("Upvote")
cbar = plt.colorbar(orientation = "horizontal", ticks=[0, 1])
cbar.ax.set_xticklabels(['Top Post', 'Bottom Post'])

plt.show()
Upvote vs comment sorting location.

Additionally, we can try to add a curve fitting to this scatter plot by making use of Scipy’s curve-fit library. Here, we’re just using a generic exponential-decay model and program.

# Add trendline by combining them all (maintining comment index)
comb_n = []
comb_i = []
n_post = 256  # N of posts to be included
for vert in verts[0:n_post]:
  idx = range(len(vert))
  comb_n.extend(vert)
  comb_i.extend(idx)

# Model used for exponential decay fit
def model_func(x, a, k, b):
    return a * np.exp(-k*x) + b

# Curve fit
p0 = (1.,1.e-5,1.) # Starting search koefs
opt, pcov = curve_fit(model_func, comb_i, comb_n, p0)
a, k, b = opt

Next, adding the below line of codes before the show() command will add an extra red line to our previous graph.

# Showing the curve fit
x2 = np.linspace(0, 150, 50)
y2 = model_func(x2, a, k, b)
plt.plot(x2, y2, color='r', label='Fit. func: $f(x) = %.3f e^{%.3f x} %+.3f$' % (a,k,b))
plt.legend(loc='best')

plt.show()
Upvote vs comment sorting location.

As expected, the fitting can’t even model accurately this behavior as a lot of scatter points are far away from that line. We can try another parameter, which is the number of replies to a comment. To simplify our process, we’re just comparing the order of the main comments (comment to a post).

To do so, we can modify our previous programs a little by counting the children to the main comment. This includes the reply to a reply to the main comment, etc. The visualization will use the exact same programs as before.

# Main comments' replies vs the further down the comment
verts = [] # The replies in vertical-wise order
for post in loaded:
  cmts = []
  
  # Unpack the tree into a single array
  for cmt in PreOrderIter(post["comments"], maxlevel=2):
    # We want to get only the main comments here, thus maxlevel=2

    child = 0
    for reply in PreOrderIter(cmt, maxlevel=4):
      # We want to get the reply chain to the main comments
      child += 1
    cmts.append(child)
    
  verts.append(cmts[1:]) # Skip the first one coz it's the post, not the comment
Replies vs comment sorting location.

The relationship is way more visible here compared to our previous graph. There’s a clear exponential decay characteristic with only a few anomalies. Contrast to what we had before.

As a result, we can safely conclude that the amount of reply to a comment is way more correlated to how the sorting algorithm works. This way, you may need to start a more discussion-provoking comment rather than “all-agree” comment to be listed higher in a discussion. Hive-mind is a characteristic common in this website, unfortunately.


4 – Complex: Word analytic

Moving on to a more complex (but still fairly simple) analytic. Here, we’ll make use of the content of the data instead of just their attributes. As a result, it’ll be more about words or sentences rather than numerical data.

Firstly, we can start with the simplest one and get the most and the least used words.

# Common post title words
freq = pd.Series(' '.join(df['text']).split()).value_counts()[:20]
freq

# Uncommon post title words
freq =  pd.Series(' '.join(df['text']).split()).value_counts()[-20:]
freq
Most vs least common words.

Indeed, we can only get commonly-used words with low-contextual information such as prepositions, conjunctions, etc. But, it’s still interesting to see that keywords like “Trump” and “Sanders” can compete here. Definitely what we’re looking for, as we’re handling political discussion data. Hence, we need to do some light data filtering to remove the unnecessary parts.

There’re several steps in the filtering here. Firstly, we need to get every comment into an array. Second, remove unnecessary symbols such as punctuation, tagging, special characters, etc. Thirdly, NLTK library will remove stopwords. Fourthly, that same library is used to normalize every word. Last, we’ll get a group of filtered sentences — a corpus.

# Get every post & comment as an array
content = []
for post in loaded:
  content.extend([node.attrs["text"] for node in PreOrderIter(post["comments"], maxlevel=6)])
content[0:20]
# Libraries for text preprocessing
import re
import nltk

# Stopwords: Stop words include the large number of prepositions, pronouns, conjunctions etc in sentences.
nltk.download('stopwords')
from nltk.corpus import stopwords

# Normalization (Stemming & lemmatization): Convert to base word, ex: 
# Stemming = learn, learned, learning, learner > learn
# Lemmatization = better > good, was > be, meeting > meeting
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import RegexpTokenizer
nltk.download('wordnet')
from nltk.stem.wordnet import WordNetLemmatizer

# Obtaining part-of-speech tags:
from nltk import pos_tag
# Creating a list of stop words and adding custom stopwords
stop_words = set(stopwords.words("english"))

# Creating a list of custom stopwords
new_words = ["using", "show", "result", "large", "also", "iv", "one", "two", "new", "previously", "shown"]
stop_words = stop_words.union(new_words)
corpus = []
for sentence in content:
  # Remove punctuations, tags, special characters and digits
  text = re.sub('[^a-zA-Z]', ' ', sentence)
  text = re.sub("&lt;/?.*?&gt;"," &lt;&gt; ",text)
  text=re.sub("(\\d|\\W)+"," ",text)

  text = text.lower() # Convert to lowercase
  text = text.split() # Convert to list from string
  
  # Stemming then Lemmatisation
  ps = PorterStemmer()
  lem = WordNetLemmatizer()
  text = [lem.lemmatize(word) for word in text if not word in stop_words] 
  text = " ".join(text)
  corpus.append(text)

After a short wait, we can finally see an item of this corpus. We get a sentence full of important words, even if it can be pretty unintelligible. Also, you can se

# Content[2]
Citing Justice Department sources, CBS reports that more federal arrests targeting neo-Nazi or white nationalists are coming down the pike, including people with ties to networks in Europe, Ukraine, and Russia.

# Corpus[2]
citing justice department source cbs report federal arrest targeting neo nazi white nationalist coming pike including people tie network europe ukraine russia

5 – Complex Visualization: Word cloud

Since the topic of this article is not only analytic but also visualization, we need to present beautifully those important words we got before. One example is using a word cloud, which is already famously used everywhere.

# Word cloud
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

plt.rcParams["figure.figsize"] = (10, 5) # Frame size
wordcloud = WordCloud(  
  background_color='white',
  stopwords=stop_words,
  max_words=200,
  max_font_size=50, 
  random_state=42,
  # Render resolution
  width=400,
  height=200
  ).generate(str(corpus))
                         
fig = plt.figure(1)
plt.imshow(wordcloud)
plt.axis('off')
plt.show()
Common word visualization
Word cloud – Filtered corpus.

Despite our rigorous filtering, there’re still some words that are unnecessary to be included here. It’s also taking a lot of space, as they’re commonly used than that of the important words, for example: “would”, “think”, “even”, “like”, etc.

Therefore, the next logical step is to get only the nouns in the corpus. Note, this is only for analytics because removing every non-noun would give you an incoherent sentence if used in an NLP (Natural Language Processing).

The NLTK library has a database to differentiate between verbs, nouns, numbers, and so forth. This tagging is called Part-of-Speech tag, which you can read more in the “Extra Resources” at the bottom of this article. In short, we only want everything that’s tagged starting with N, as they’re indicated as nouns, and recreate the word cloud.

# Removing everything except Noun
# Using part-of-speech (POS) tagging:
# https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

corpus2 = []
for sentence in corpus:
  words = sentence.split()
  tagged = pos_tag(words)

  # Only get the ones with N** tagging
  nouns = [s[0] for s in tagged if s[1][0] == 'N']

  # Revert back the array of words into a sentence
  corpus2.append(' '.join(nouns))
An item from the corpus:
citing justice department source cbs report federal arrest targeting neo nazi white nationalist coming pike including people tie network europe ukraine russia

That item with just the noun:
justice department source cbs report arrest nationalist people network ukraine russia
Common word visualization (without noun)
Word cloud – Only nouns.

At last, we can have a word cloud consisting of the topics talked about, rather than commonly used verbs. The common usage of government, Trump, people, country, election, China, and many other keywords do paint what’s currently the topic in that month.


Learning Tools

No direct example regarding this, some are just a part of what this article is. Overall I took a lot of examples from Stack Overflow for more detailed problems, syntax, and/or functionalities regarding the new libraries I used. The official documentation also helps a lot if they’re lightweight to read (example AnyTree in this case).


Learning Strategy

The hardest part of this whole process is understanding the Matplotlib library. While it’s very powerful and versatile, it demands a lot and might give beginners a hard time in using one. Searching online directly would only give you a big jump and you won’t lean much other than copy-paste. A lot of those codes can be intimidatingly long when in reality they’re just a bunch of parameters.

Which is why understanding the basics is the best way to do this. Not only it’s necessary for any learning project, but it can also help tremendously in choosing a solution online that’d be the best fit for what you exactly want.


Reflective Analysis

While visualization and analytics can be very simple, thinking about what to be the subject of that analysis can take a long time. The resulting codes won’t be much to any other kind of project, but the time for finding ideas here can’t be underestimated.

Despite all of that, there’s a lot of help coming from being an avid Reddit user myself. Those data are information that’s I personally curious with, instead of randomly thinking of one. And certain behaviors might not be as apparent to a non-user.


Conclusion & Future

It’s important to fully grasp what is inside our data before we do anything too complicated. A simple behavior or contextual exploration might give you a headstart when utilizing that data.

It’s a shame that’s I’m still unable to explore the relationship between one comment to another because that’s what Reddit specializes in. In the future, this might be one area to explore: the contextual relationship between one word to another in a chain of comments. Of course, that’d be more into NLP rather than simple analytics.


Ask away any question in the comment section, any feedback is also appreciated. As stated above, the notebook is available in this GitHub repository.

You can check my other blog on another topic here about modularly creating Android form app.

Project duration (incl. learning from scratch & article): 7 + 4 hours.


Extra resources: