First of all, I'm by no means a professional software engineer, so this won't be the cleanest code you'll see. I'm using this blog post to document my coding process, share my thoughts and the approaches I took to solve problems, also as feedback on how I did things wrong/right.

The inspiration for this project came from Wesbos's Twitter and Instagram scraping project.

You can find the repo here: status-scraper

So, what does it do exactly?

It's an api that accepts a social media flag and a username and returns the user status (eg. # of followers, following, posts, likes, etc...).

Endpoint is /scrape/:flag/:username, and currently the :flag can be any of the following:

  • t => twitter.com
  • r => reddit.com
  • g => github.com
  • b => behance.net
  • q => quora.com
  • i => instagram.com

So, a call for https://statusscraperapi.herokuapp.com/scrape/t/mkbhd would return the following response: json { user: "mkbhd", status: { twitterStatus: { tweets: "45,691", following: "339", followers: "3,325,617", likes: "25,255" } } }

Tech used

  • Node
  • esm, an ECMAScript module loader
  • Express
  • Axios
  • Cheerio

Server configuration

``javascript // lib/server.js const PORT = process.env.PORT || 3000; app.listen(PORT, () => console.log(Server running on port ${PORT}`));

// lib/app.js class App { constructor(app, routePrv) { this.app = express(); this.config(); this.routePrv = new Routes().routes(this.app); }

config() { this.app.use(cors()) this.app.use(helmet()); } }

export default new App().app; ```

Project structure

The app has three modules:

Module 1 - Router:

```javascript // lib/routes/router.js

// all routes have the same structure export class Routes { routes(app) { .... // @route GET /scrape/g/:user // @desc log github user status app.get("/scrape/g/:user", async (req, res) => { const user = req.params.user; try { const githubStatus = await Counter.getGithubCount( https://github.com/${user} ); res.status(200).send({ user, status: { githubStatus } }); } catch (error) { res.status(404).send({ message: "User not found" }); } }); ... } } ```

Module 2 - Counter:

  • Acts as a middleware between the route and the acual scraping.
  • It gets the html page and pass it to the scraper module. ```javascript // lib/scraper/counter.js class Counter extends Scraper { ... // Get github count async getGithubCount(url) { const html = await this.getHTML(url); const githubCount = await this.getGithubStatus(html); return githubCount; } ... }

export default new Counter(); ```

Module 3 - Scraper:

It's where all the work is done, and I'll be explaining each social network approach. Let's start.

Twitter

Twitter response has multiple <a> elements that contain all data we want, and it looks like this: text <a class="ProfileNav-stat ProfileNav-stat--link u-borderUserColor u-textCenter js-tooltip js-nav" title="70 Tweets" data-nav="tweets" tabindex=0> <span class="ProfileNav-label" aria-hidden="true">Tweets</span> <span class="u-hiddenVisually">Tweets, current page.</span> <span class="ProfileNav-value" data-count=70 data-is-compact="false">70</span> </a> The class ProfileNav-stat--link is unique for these elements. With cheerio, we can simply get all <a> with the class, loop through them, and extract the data of the title attribute. Now we have "70 Tweets", just split it and store as a key-value pair. ```javascript // lib/scraper/scraper.js

// Get twitter status async getTwitterStatus(html) { try { const $ = cheerio.load(html); let twitterStatus = {}; $(".ProfileNav-stat--link").each((i, e) => { if (e.attribs.title !== undefined) { let data = e.attribs.title.split(" "); twitterStatus[[data[1].toLowerCase()]] = data[0]; } }); return twitterStatus; } catch (error) { return error; } } ```

Reddit

Reddit user page has a <span id="profile--id-card--highlight-tooltip--karma"> on the right side with user's total karma, so it's very easy to get. But when hovered over, it displays post/comment karma.

Reddit response has a <script id="data"> that contains these two pieces of data nested inside an object. window.___r = {"accountManagerModalData":.... ...."sidebar":{}}}; window.___prefetches = ["https://www....}; Just extract the <script> data and parse 'em into json. But we need to get rid of window.___r = at the start, ; window.___prefetches.... at the end and everything after it.

This could be the laziest/worst thing ever :D I split based on " = ", counted the #of characters starting from that ; -using a web app of course-, and sliced them out of the string. Now I have a pure object in a string. ```javascript // lib/scraper/scraper.js

// Get reddit status async getRedditStatus(html, user) { try { const $ = cheerio.load(html); const totalKarma = $("#profile--id-card--highlight-tooltip--karma").html();

  const dataInString = $("#data").html().split(" = ")[1];
  const pageObject = JSON.parse(dataInString.slice(0, dataInString.length - 22));
  const { commentKarma, postKarma } = pageObject.users.models[user];

 return {totalKarma, commentKarma, postKarma};
} catch (error) {
  return error;
}

} ```

Linkedin

It responded with status code 999! like, really linkedin.

I tried sending a customized head request that worked with everyone on stack overflow, but it did not work for me. Does it have something to do with csrf-token? I'm not really sure. Anyways, that was a dead-end, moving on to the next one.

Github

This one was fairly easy, there are five <span class="Counter"> that displays the #of repositories, stars, etc.. Loop through 'em to extract the data, and with Cheerio I can get the element's parent, which is an <a> that has what these numbers represent. Store 'em as key-value pairs and we're ready to go. ```javascript // lib/scraper/scraper.js

// Get github status async getGithubStatus(html) { try { const $ = cheerio.load(html); const status = {}; $(".Counter").each((i, e) => { status[e.children[0].parent.prev.data.trim().toLowerCase()] = e.children[0].data.trim(); }); return status; } catch (error) { return error; } } ```

Behance

Also an easy one, a <script id="beconfig-store_state"> that has an object with all data required. Parse it into json and extract them.

Youtube - you broke my heart

Youtube's response is a huge mess, it has a punch of <script> tags that don't have any ids or classes. I wanted to get the channel's number of subscribers and total video views, both can be found in the About tab.

The desired <script> is similar to the Github one, I could use the same split, slice, parse thing and I'll be done.

But, these two simple numbers are nested like 12 levels deep within the object and there are arrays involved, it's basically hell.

So, I wrote a little helper function that accepts the large JSON/object and the object key to be extracted, and it returns an array of all matches. ```javascript // lib/_helpers/getNestedObjects.js

export function getNestedObjects(dataObj, objKey) { // intialize an empty array to store all matched results let results = []; getObjects(dataObj, objKey);

function getObjects(dataObj, objKey) { // loop through the key-value pairs on the object/json. Object.entries(dataObj).map(entry => { const [key, value] = entry; // check if the current key matches the required key. if (key === objKey) { results = [...results, { [key]: value }]; }

  // check if the current value is an object/array.
  // if the current value is an object, call the function again.
  // if the current value is an array, loop through it, check for an object, and call the function again.
  if (Object.prototype.toString.call(value) === "[object Object]") {
    getObjects(value, objKey);
  } else if (Array.isArray(value)) {
    value.map(val => {
      if (Object.prototype.toString.call(val) === "[object Object]") {
        getObjects(val, objKey);
      }
    });
  }
});

}

// return an array of all matches, or return "no match" if (results.length === 0) { return "No match"; } else { return results; } }

`` As much as I was thrilled thatgetNestedObjectsactually works -[try it](https://github.com/AlaaDesouky/getNestedObjects)-, it didn't last for long though. Somehow the received html didn't contain that