Introduction

Scraping social media data increasingly becomes a rather common practice nowadays, especially with the growing popularity of machine learning. One example of such a modern ML problem would be creating a conversational bot, capable of understanding the context of a discussion.

A huge amount of conversational data must be obtained to achieve this. Meanwhile, the perfect social media which promotes deep and branching discussion is Reddit. Not only the topics are categorized under certain themes of a subreddit, but every component is provided with a mean to be upvoted & downvoted.

Scraping Reddit by utilizing Google Colaboratory & Google Drive means no extra local processing power & storage capacity needed for the whole process. Furthermore, using the resulting data can be seamless without the need to upload/download a big file, since Colaboratory can just import data from Drive.


Glossary

Some new terms I learned while doing this project:

  • Selenium WebDriver – Automate a browser natively as a real user would either locally or on remote machines.
  • WebDriver – Remote control interface for both inspection and control of a browser.
  • AnyTree – Simple and extensible Tree data structure.
  • CSS Selector – Patterns used to select elements in CSS.
  • XPath – XML Path Language, used to navigate through elements and attributes in an XML document.

Requirements

  • Python
  • Selenium
  • BeautifulSoup
  • Google Colaboratory
  • Google Drive
  • Anytree
  • Pickle
  • Pandas (optional)

The program used in this tutorial is available in this GitHub repository. Open the Python Notebook file using Google Colaboratory and clone it to run or modify.

The whole notebook is generously commented, though there’re stuff better explained in this blog such as the detailed explanations, necessary changes for your needs, explanation about Reddit itself, and extra stuff.


Step-by-step

Without further ado, let’s start the steps to achieve this goal.

1 – Preparation & Initialization

Install the required libraries & the WebDriver of choice. Below is the example of doing that in Colaboratory. By default Colaboratory is a Python Notebook, thus every line is seen as one. Adding exclamation mark at the beginning will ask Colab to run them as a Shell Command.

# Install all necessary components
!apt-get update
!pip install selenium
!pip install pandas
!pip install anytree
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver

The choice of Selenium is none other than the versatility. With it, you can pretty much do anything like you would with your browser instead of just simple data acquisition. In Reddit, that would be logging into your account (there’re certain subreddits only accessible by certain users).

The main parameters used in this program are presented below. You would want to modify the number of pages to scrape for each subreddit listed. Meanwhile the sorting method & time period will determine the first Xth posts you would get.

# Hyperparameters
PAGES = 3   # Number of pages from each subreddit
HTTPS = "https://old.reddit.com/"
SUBREDDIT = [ "r/all",
              "r/funny",
              "r/jokes"]
# Sorting type, set to hot, new, rising, controversial, top, or gilded
SORT = "top" 
# Sorting timespan, set to hour, week, month, year, or all.
# No effect for the first 3 sorting types above.   
TIME = "week"

The gist for each sorting methods are:

  • Hot lists the posts determined as trending by Reddit based on the vote count, recency, and activity.
  • New lists the posts solely based on the time of posting. Rising is similar to hot, but hugely determined by the recency.
  • Controversial lists posts with conflicting amounts of upvotes & downvotes, usually an almost fully downvoted posts are also placed high in here.
  • Top and Gilded lists posts solely based on the number of upvotes & awards respectively (another mean of appreciating a post which requires purchasable coins, when upvoting is not enough).

To finally initialize a WebDriver & the other libraries, use this.

# Import libraries
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from bs4 import BeautifulSoup as soup
import pandas as pd
import anytree
from anytree import NodeMixin, Node, RenderTree
from pprint import pprint
# Initialize webdriver, set to headless
options = webdriver.ChromeOptions()
options.add_argument("-no-sandbox")
options.add_argument("-headless")
options.add_argument("-disable-dev-shm-usage")
# Open driver with a website, then get result
d = webdriver.Chrome("chromedriver", options=options)
# d.get("https://old.reddit.com/r/all/")
# print(d.page_source) # results

Meanwhile we also need a custom Tree node, which will contain extra parameter for storing Reddit attributes. It is basically an extension of the AnyTree Node.

# Custom anytree node for posts' comments
class NodeCom(NodeMixin):
  def __init__(self, name, attrs=None, parent=None, children=None):
    super(NodeCom, self).__init__()
    self.name = name
    self.attrs = attrs  # The reddit data
    self.parent = parent
    if children:
      self.children = children

2 – Scraping

2a – Posts

The scraping process started by collecting the posts with each of their available attributes. Here we look at a pattern/attribute in which is present at only posts but not anything else. In this case, that would be thing inside their class name. Deeper searching is necessary to get their titles & vote count, since it’s placed on separate element.

# Get all posts
posts = []
for sub in SUBREDDIT:
  # Iterate through every subreddit listed
  link = HTTPS + sub + "/" + SORT + "/?t=" + TIME
  print(link)
  d.get(link)
  for i in range(PAGES):
    wdposts = d.find_elements_by_class_name("thing")
    # Iterate through every post found
    for wdpost in wdposts:
      html = wdpost.get_attribute("outerHTML")
      htpost = soup(html, "html.parser").div
      
      # Get post title
      text = htpost.find("a", class_="title").text
      attrs = htpost.attrs # Get everything else (in attribute)
      
      # Append all & create room for the comment tree
      attrs["data-score"] = int(attrs["data-score"])
      attrs.update({"text":text, "comments":None})
      posts.append(attrs)
    
    # Go to next page, cancel if there isn't any
    try:
      next = d.find_element_by_class_name("next-button").get_attribute("outerHTML")
    except NoSuchElementException:
      break # End loop
    page = soup(next, "html.parser").a.attrs["href"]
    print(i+1, page)
    d.get(page)
print("Total posts:", len(posts))
# Create table using pandas DataFrame, just for preview
pd.DataFrame(posts)[0:4]
A small example of attributes acquired from each post.

2b – Comments

Each posts then further scraped for their comments. Here, the structure of the comments need to be preserved as we need the full conversation data. Hence all comments are stored in a tree data structure.

Unfortunately, there’s no extra attribute added to each comment to refer to its parent other than looking directly into the HTML structure. Thus the scraping & data storing (linking the tree nodes) need to be done one after another recursively, since scraping it all directly would remove any structure presented.

# Recursively get comments & create tree by linking nodes
def recursive(soup, parent):
  # End recursion if it's empty
  if not soup:
    return;
  # Soup is another link opened by soup
  htcoms = soup[0].find_all("div", class_="comment", recursive=False)
  # Iterate through every comment found, set to the same parent
  # print(len(htcoms))
  for htcom in htcoms:
    # if htcom is not None:
    attrs = get_comment(htcom)
    # Create node. Won't be replaced in the next loop since what matters is the object, not the variable.
    # node = NodeCom(attrs["data-fullname"], attrs, parent) # Deleted comments don't have data-fullname
    node = NodeCom("a", attrs, parent)
    nxsoup = htcom.select("div.listing")
    recursive(nxsoup, node) # Go deeper
  # Nothing to return, since this procedure just links the comment nodes

Above is the main recursive function, which arguably the core algorithm to get every comments. Meanwhile get_comment() is a function for processing the scarping data, similar to the one for posts.

# Get every data from a comment HTML
def get_comment(htcom):
  # Get comment & vote count. None = Deleted
  text = htcom.find("div", class_="usertext-body").text
  vote = htcom.find("span", class_="unvoted")
  # Deleted comments. Also automatically None if hidden
  if vote == '':
    vote = None
  if vote != None:
    vote = int(vote.attrs["title"])
  attrs = htcom.attrs # Get everything else (in attribute)
  attrs.update({"text":text, "data-score":vote})
  return attrs

To start the recursion for every post, only a simple for loop is necessary.

# Get all comments from every posts, using recursive scrape-store to differentiate between childs & parents.
for post in posts:
  # print(post)
  d.get(HTTPS + post["data-permalink"])
  # Get the raw text page source
  html = d.page_source
  # Find the div containing the comments list, then get the comments unrecursively.
  htlist = soup(html, "html.parser").select("div.nestedlisting")
  parent = NodeCom(attrs["data-fullname"], post)  # Set the post as the first node
  recursive(htlist, parent)
  post["comments"] = parent

Below is the method to preview the tree of a single post. I’ve taken some simplification measures by cropping the comments & preventing newline just to see the general structure of the tree.

# Preview tree of a single post
prev = posts[1]
print(RenderTree(prev["comments"]))
print()
for pre, fill, node in RenderTree(prev["comments"]):
  vote = node.attrs["vote"]
  if vote == None:
    vote = 0
  treestr = u"%s%d" % (pre, vote)
  # Preview the text cropped, and newline removed
  print(treestr.ljust(8), node.attrs["text"][:50].replace('\n', ' '))
Preview of a tree from a single post.

Using the data is a simple matter of knowing where the attributes are located and your usual python dictionary data usage. In simple terms:

  • Array posts contains multiple posts.
  • Each post contains multiple attributes stored as a dictionary, including comments which is of a tree data structure.
  • The comment tree data structure started with the head containing the post itself, followed by branching comments.
  • Each node of the tree contains name, parent, children, and attrs. In which attrs is a dictionary containing the scraped comment data.
Example of how to manually access individual data.

3 – Data Storage

Coming up to the data saving & storage mechanism, we use pickle library to store the variable in the rawest form, as a binary. The notebook is set to store the results in your Google Drive, all you need is follow the simple instructions provided for the authentication.

In the Notebook, this option is implemented also by making use of the form feature. This way, you can modify the parameters without touching the code directly.

#@title Drive Mount { form-width: "30%" }
FILE = "save.dat" #@param {type:"string"}
# Use pickle for file-variable management
import pickle
# Mount Google Drive as folder in "/drive" Collab folder
from google.colab import drive
drive.mount('/content/drive', force_remount=True)
Preview of the code with its form.

All the data are basically stored in the posts variable, as I already explained before. Dump it to a file using pickle and copy it to the new directory created after mounting Google Drive. The same file will also appear on your Google Drive since that folder is interconnected (you can read & write any data from/to Drive as if it’s just another folder).

# Write binary to a file
with open(FILE, 'wb') as f:
    pickle.dump(posts, f)
copy_cmd = " "+FILE+"  /content/drive/My\ Drive/"
!cp $copy_cmd # Execute the copy command

In the next notebook section, an example is provided to load the same data from Google Drive for further usage. You can try to restart runtime and not running any of the scraping algorithms, as to truly test whether the variables you’ve saved is actually storing its values or not. Then use similar code as before to preview the tree.

# Test load data, do it after the drive is mounted
copy_cmd = "/content/drive/My\ Drive/"+FILE+" "+"/content/"
!cp $copy_cmd
with open(FILE, 'rb') as f:
    loaded = pickle.load(f)

This is ideal to mitigate the limited continuous VM usage duration given by Google, which is 12 hours. Though, any inactivity would terminate the runtime in 90 minutes. As for the size of the data, only 3 pages containing some of the popular posts give you 75 total posts, and a total size of 11 MB already. For actually useful large scale databases, 100x or more of those might be necessary. Hence using Google Drive is a good way to Save-Load without too much care about available local resources.


Learning Tools

No direct example regarding this since a lot of similar projects are just simple Reddit scraping without any of the structure maintained, thus decided to not follow them altogether to keep everything clean & organized.

Overall I took a lot of examples from Stack Overflow for more detailed problems, syntax, and/or functionalities regarding the new libraries I used. The official documentation also helps a lot if they’re lightweight to read (example AnyTree in this case).


Learning Strategy

The big problem with scraping a website is that there’re a lot of ways to do the same thing, all vary pretty wildly. Thus searching online would give you just that. Some giving similar characteristics, some have a big difference.

Which is why understanding the basics is the best way to do this. Not only it’s necessary for any learning project, but it can also help tremendously in choosing a solution online that’d be the best fit for what you exactly want.


Reflective Analysis

Acquiring data as a human is a simple matter of reading, but not so much for a computer. A lot of stuff needs to be concisely defined for a program to understand which is necessary and which is not. While scraping might be a simple process as a whole, it could be the one that takes a lot of time unless done correctly.

Using Google Colaboratory & Google Drive helps that aspect in a lot of ways to make sure local resources is not a big factor, on an already slow process. On top of that, scalability and further development are going to be made easier as the file is just one import away from Google Drive to be used in your new Colaboratory project.


Conclusion & Future

As in the introduction, there’s a lot of use in the Reddit conversational data itself. Let alone the method to scrape and structure a social media or forum discussion. Be it for data analytics, machine learning, lead marketing, etc. Thus, future development using this data itself can be expected from me, as I am personally interested in the wide possibilities that this data provides.


Ask away any question in the comment section, any feedback is also appreciated. As stated above, the notebook is available in this GitHub repository.

Image source: theguardian.com

Project duration (incl. learning from scratch & article): 11 hours.