Reddit, unlike most websites, allows web scraping as long as the crawler scripts make no more than one request every two seconds to the Reddit servers (see rules). You don’t even need a developer account or an API key to perform scraping on Reddit.
There are popular tools like wget, Site Sucker (Mac) or HTTrack Website Copier (Windows) that can download entire websites for offline use but they are mostly useless for scraping Reddit data since the site doesn’t use page numbers and content of pages is constantly changing. A post maybe listed on the first page of a subreddit but it could find itself on the third page the next second as other posts are voted to the top.
While there exists PHP and Python libraries for scraping Reddit, they are too complicated for the non-techies. Fortunately, there’s always Google Apps Script to the rescue. Here’s what you can do to pull data from any Subreddit on Reddit automatically.
1. Open the Google Sheet and choose File – Make a copy to copy this sheet in your Google Drive.
2. Go to Tools -> Script editor and copy-paste the Script below. You can change “LifeProTips” to any other subreddit name.
/* Replace LifeProTips with the Subreddit Name */ var REDDIT = "LifeProTips"; function run() { deleteTriggers_(); /* Fetch Reddit posts every 5 minutes to avoid hitting the reddit and Google Script quotas */ ScriptApp.newTrigger("scrapReddit") .timeBased().everyMinutes(5).create(); } function scrapReddit() { // Process 20 Reddit posts in a batch var url = "http://www.reddit.com/r/" + REDDIT + "/new.xml?limit=20" + getLastID_(); // Reddit API returns the results in XML format var response = UrlFetchApp.fetch(url); var doc = XmlService.parse(response.getContentText()); var entries = doc.getRootElement() .getChildren('channel')[0].getChildren("item"); var data = new Array(); for (var i=0; i<entries.length; i++) { /* Extract post date, title, description and link from Reddit */ var date = entries[i].getChild('pubDate').getText(); var title = entries[i].getChild('title').getText(); var desc = entries[i].getChild('description').getText(); var link = entries[i].getChild('link').getText(); data[i] = new Array(date, title, desc, link); } if (data.length == 0) { /* There's no data so stop the background trigger */ deleteTriggers_(); } else { writeData_(data); } } /* Write the scrapped data in a batch to the Google Spreadsheet since this is more efficient */ function writeData_(data) { if (data.length === 0) { return; } var ss = SpreadsheetApp.getActiveSpreadsheet(); var sheet = ss.getSheets()[0]; var row = sheet.getLastRow(); var col = sheet.getLastColumn(); var range = sheet.getRange(row+1, 1, data.length, 4); try { range.setValues(data); } catch (e) { Logger.log(e.toString()); } } /* Use the ID of the last processed post from Reddit as token */ function getLastID_() { var ss = SpreadsheetApp.getActiveSpreadsheet(); var sheet = ss.getSheets()[0]; var row = sheet.getLastRow(); var col = sheet.getLastColumn(); var url = sheet.getRange(row, col).getValue().toString(); var pattern = /.*comments\/([^\/]*).*/; var id = url.match(pattern); return id ? "&after=t3_" + id[1] : ""; } /* Posts Extracted, Delete the Triggers */ function deleteTriggers_() { var triggers = ScriptApp.getProjectTriggers(); for (var i=0; i<triggers.length; i++) { ScriptApp.deleteTrigger(triggers[i]); } }
3. While in the script editor, choose Run -> Run and authorize the script.
That’s it. The script will run in the background automatically pulling content from Reddit into the Google spreadsheet. And it stops automatically once all the posts* of that Reddit have been fetched.
[*] All Subreddits on Reddit display a maximum of 1000 posts – you can’t go beyond that number even while manually browsing a subreddit.
1 comments:
Thank you for great information. Will try it the scripts.
Post a Comment