Sunday, 21 May 2017

Web Scraping Reddit with Google Scripts

Reddit, unlike most websites, allows web scraping as long as the crawler scripts make no more than one request every two seconds to the Reddit servers (see rules). You don’t even need a developer account or an API key to perform scraping on Reddit.



There are popular tools like wget, Site Sucker (Mac) or HTTrack Website Copier (Windows) that can download entire websites for offline use but they are mostly useless for scraping Reddit data since the site doesn’t use page numbers and content of pages is constantly changing. A post maybe listed on the first page of a subreddit but it could find itself on the third page the next second as other posts are voted to the top.

While there exists PHP and Python libraries for scraping Reddit, they are too complicated for the non-techies. Fortunately, there’s always Google Apps Script to the rescue. Here’s what you can do to pull data from any Subreddit on Reddit automatically.

1. Open the Google Sheet and choose File – Make a copy to copy this sheet in your Google Drive.

2. Go to Tools -> Script editor and copy-paste the Script below. You can change “LifeProTips” to any other subreddit name.


/* Replace LifeProTips with the Subreddit Name */
var REDDIT = "LifeProTips";

function run() {
  
   deleteTriggers_();
  
  /* Fetch Reddit posts every 5 minutes to avoid hitting 
     the reddit and Google Script quotas */
  ScriptApp.newTrigger("scrapReddit")
           .timeBased().everyMinutes(5).create();  
}


function scrapReddit() {
  
  // Process 20 Reddit posts in a batch
  var url = "http://www.reddit.com/r/" 
            + REDDIT + "/new.xml?limit=20" + getLastID_(); 

  // Reddit API returns the results in XML format  
  var response = UrlFetchApp.fetch(url);  
  var doc = XmlService.parse(response.getContentText()); 
  var entries = doc.getRootElement()
                   .getChildren('channel')[0].getChildren("item");
  
  var data = new Array();
    
  for (var i=0; i<entries.length; i++) {
        
    /* Extract post date, title, description and link from Reddit */

    var date = entries[i].getChild('pubDate').getText();
    var title = entries[i].getChild('title').getText();
    var desc = entries[i].getChild('description').getText();
    var link = entries[i].getChild('link').getText();
    
    data[i] = new Array(date, title, desc, link);
  }
  
  if (data.length == 0) {
    /* There's no data so stop the background trigger */
    deleteTriggers_();
  } else {    
    writeData_(data);
  }
}


/* Write the scrapped data in a batch to the 
   Google Spreadsheet since this is more efficient */
function writeData_(data) {
  
  if (data.length === 0) {
    return;
  } 
    
  var ss = SpreadsheetApp.getActiveSpreadsheet();
  var sheet = ss.getSheets()[0];
  var row = sheet.getLastRow();
  var col = sheet.getLastColumn();
      
  var range = sheet.getRange(row+1, 1, data.length, 4);
  try {
    range.setValues(data);
  } catch (e) {
    Logger.log(e.toString());
  }
}

/* Use the ID of the last processed post from Reddit as token */
function getLastID_() {

  var ss = SpreadsheetApp.getActiveSpreadsheet();
  var sheet = ss.getSheets()[0];
  var row = sheet.getLastRow();
  var col = sheet.getLastColumn();
  
  var url = sheet.getRange(row, col).getValue().toString();
  var pattern = /.*comments\/([^\/]*).*/; 
  var id = url.match(pattern);
  
  return id ? "&after=t3_" + id[1] : "";

}

/* Posts Extracted, Delete the Triggers */
function deleteTriggers_() {
  var triggers = ScriptApp.getProjectTriggers();
  for (var i=0; i<triggers.length; i++) {
    ScriptApp.deleteTrigger(triggers[i]);
  }
}
3. While in the script editor, choose Run -> Run and authorize the script.

That’s it. The script will run in the background automatically pulling content from Reddit into the Google spreadsheet. And it stops automatically once all the posts* of that Reddit have been fetched.

[*] All Subreddits on Reddit display a maximum of 1000 posts – you can’t go beyond that number even while manually browsing a subreddit.

1 comments:

Anu said... Reply To This Comment

Thank you for great information. Will try it the scripts.

Post a Comment

Related Posts Plugin for WordPress, Blogger...

Labels

404 AdBlock Add-on Airtel GPRS Trick Airtel SMS Trick Alexa Amazon Amazon Kindle Amazon Prime Android Android 8 Android Oreo antivirus Apple Apple Mac ASCII Audacity Audio Authotkey Backup Balance Transfer in Vodafone Battery Bing Blogger Blogging Bookmarklet Browser Camera Chromebook clock Cloud colors command lines Computer Computer Tricks configuration Contact Creative Commons Credit Card CSS devolop DIY Doodle DOS Download Dropbox E-Mail eBook Email Email Attachment Embed Encryption English Error Evernote Eyes Facebook Facebook Tricks Feedburner Flipkart Font Foursquare Free Internet Free sms trick in Vodafone G Mail Gadget Game Getty Images GIF Gists Github Google Google AdSense Google Analytics Google Apps Google Chrome Google Contacts Google Currents Google DNS Google Docs Google Drive Google Earth Google Font Google Forms Google Images Google Map Google Photos Google Play Store Google Plus Google Print Google Reader Google Script Google Sheets Google Spreadsheet Google Translate GPRS Setting GPS Hacking Health App HelloFax Hindi Hoodie HTML Icons idea Image Editing Images IMEI Indian Railways Infographics Instagram Internet Internet Explorer Internet Tricks iOS iPad iPhone IRCTC iTunes iTV JavaScript JioCinema JioTV Junglee Kindle Language Translation Laptop Laptop. TV Life Time FREE GPRS Life-Style Link Linkedln Linux logo Make Money Online Microdoft Powerpoint Microdoft Word Microsoft Office Microsoft Outlook Mobile Mosaic Music Name Networking nexus Notepad OCR Online Shopping Open DNS OS Outlook Password PDF Petya Phillips Hue Lights Photogtraphy Pixel Play Station Podcasts Pokemon Pokemon Go Polls Print Productivity Proxy Server Pushbullet QR Code Ransomware Reddit Reliance Hack GPRS Reliance Jio RGB Ringtone Router RSS Safe Mode Samsung Galaxy S Scrabble Screen Capture Screen Sharing Screencast Secrets Security Send free sms from PC SEO Sierra Skype Slideshare SMBv1 SMS Snapchat Snapdeal Social Media Solution Sound Device Speech Recognition Sql Steam Sync Synology NAS Tata Docomo GPRS trick Teleprompter Torrent Trick Tricks TV Twitter UltraISO Unicode Unknown Extension Unlimited 2GB Unlimited 3GB Unlimited GPRS USB USB Security Key Video Editing virtual desktop Virus attack VLC Vodafone 110% working trick for GPRS Vodafone 3g Vodafone GPRS VPN wallpapers WannaCry Web Design Web Domain Website Wget Whatsapp WiFi Wikipedia Windows Windows 10 Windows 10 S Windows KN Windows Tricks windows updates Winows N Wolfarm Alpha WordPress XBox YouTube Zip
Twitter Delicious Facebook Digg Stumbleupon Favorites More