I scraped social media platforms and built an api with it, cause why not 🤷♂️
(Source/Credits: https://dev.to/alaadesouky/i-scraped-social-media-platforms-and-built-an-api-with-it-cause-why-not-5gio)
First of all, I'm by no means a professional software engineer, so this won't be the cleanest code y...
First of all, I'm by no means a professional software engineer, so this won't be the cleanest code you'll see. I'm using this blog post to document my coding process, share my thoughts and the approaches I took to solve problems, also as feedback on how I did things wrong/right.
The inspiration for this project came from Wesbos's Twitter and Instagram scraping project.
You can find the repo here: status-scraper
So, what does it do exactly?
It's an api that accepts a social media flag
and a username
and returns the user status (eg. # of followers, following, posts, likes, etc...).
Endpoint is /scrape/:flag/:username
, and currently the :flag
can be any of the following:
t
=> twitter.comr
=> reddit.comg
=> github.comb
=> behance.netq
=> quora.comi
=> instagram.com
So, a call for https://statusscraperapi.herokuapp.com/scrape/t/mkbhd
would return the following response:
json
{
user: "mkbhd",
status: {
twitterStatus: {
tweets: "45,691",
following: "339",
followers: "3,325,617",
likes: "25,255"
}
}
}
Tech used
- Node
- esm, an ECMAScript module loader
- Express
- Axios
- Cheerio
Server configuration
``javascript
// lib/server.js
const PORT = process.env.PORT || 3000;
app.listen(PORT, () => console.log(
Server running on port ${PORT}`));
// lib/app.js class App { constructor(app, routePrv) { this.app = express(); this.config(); this.routePrv = new Routes().routes(this.app); }
config() { this.app.use(cors()) this.app.use(helmet()); } }
export default new App().app; ```
Project structure
The app has three modules:
Module 1 - Router:
```javascript // lib/routes/router.js
// all routes have the same structure
export class Routes {
routes(app) {
....
// @route GET /scrape/g/:user
// @desc log github user status
app.get("/scrape/g/:user", async (req, res) => {
const user = req.params.user;
try {
const githubStatus = await Counter.getGithubCount(
https://github.com/${user}
);
res.status(200).send({ user, status: { githubStatus } });
} catch (error) {
res.status(404).send({
message: "User not found"
});
}
});
...
}
}
```
Module 2 - Counter:
- Acts as a middleware between the route and the acual scraping.
- It gets the html page and pass it to the scraper module. ```javascript // lib/scraper/counter.js class Counter extends Scraper { ... // Get github count async getGithubCount(url) { const html = await this.getHTML(url); const githubCount = await this.getGithubStatus(html); return githubCount; } ... }
export default new Counter(); ```
Module 3 - Scraper:
It's where all the work is done, and I'll be explaining each social network approach. Let's start.
Twitter response has multiple <a>
elements that contain all data we want, and it looks like this:
text
<a class="ProfileNav-stat ProfileNav-stat--link u-borderUserColor u-textCenter js-tooltip js-nav" title="70 Tweets" data-nav="tweets" tabindex=0>
<span class="ProfileNav-label" aria-hidden="true">Tweets</span>
<span class="u-hiddenVisually">Tweets, current page.</span>
<span class="ProfileNav-value" data-count=70 data-is-compact="false">70</span>
</a>
The class ProfileNav-stat--link
is unique for these elements.
With cheerio, we can simply get all <a>
with the class, loop through them, and extract the data of the title
attribute.
Now we have "70 Tweets"
, just split it and store as a key-value pair.
```javascript
// lib/scraper/scraper.js
// Get twitter status async getTwitterStatus(html) { try { const $ = cheerio.load(html); let twitterStatus = {}; $(".ProfileNav-stat--link").each((i, e) => { if (e.attribs.title !== undefined) { let data = e.attribs.title.split(" "); twitterStatus[[data[1].toLowerCase()]] = data[0]; } }); return twitterStatus; } catch (error) { return error; } } ```
Reddit user page has a <span id="profile--id-card--highlight-tooltip--karma">
on the right side with user's total karma, so it's very easy to get. But when hovered over, it displays post/comment karma.
Reddit response has a <script id="data">
that contains these two pieces of data nested inside an object.
window.___r = {"accountManagerModalData":....
...."sidebar":{}}}; window.___prefetches = ["https://www....};
Just extract the <script>
data and parse 'em into json. But we need to get rid of window.___r =
at the start, ; window.___prefetches....
at the end and everything after it.
This could be the laziest/worst thing ever :D
I split based on " = ", counted the #of characters starting from that ;
-using a web app of course-, and sliced them out of the string. Now I have a pure object in a string.
```javascript
// lib/scraper/scraper.js
// Get reddit status async getRedditStatus(html, user) { try { const $ = cheerio.load(html); const totalKarma = $("#profile--id-card--highlight-tooltip--karma").html();
const dataInString = $("#data").html().split(" = ")[1];
const pageObject = JSON.parse(dataInString.slice(0, dataInString.length - 22));
const { commentKarma, postKarma } = pageObject.users.models[user];
return {totalKarma, commentKarma, postKarma};
} catch (error) {
return error;
}
} ```
It responded with status code 999! like, really linkedin.
I tried sending a customized head request that worked with everyone on stack overflow, but it did not work for me. Does it have something to do with csrf-token
? I'm not really sure.
Anyways, that was a dead-end, moving on to the next one.
Github
This one was fairly easy, there are five <span class="Counter">
that displays the #of repositories, stars, etc.. Loop through 'em to extract the data, and with Cheerio
I can get the element's parent, which is an <a>
that has what these numbers represent. Store 'em as key-value pairs and we're ready to go.
```javascript
// lib/scraper/scraper.js
// Get github status async getGithubStatus(html) { try { const $ = cheerio.load(html); const status = {}; $(".Counter").each((i, e) => { status[e.children[0].parent.prev.data.trim().toLowerCase()] = e.children[0].data.trim(); }); return status; } catch (error) { return error; } } ```
Behance
Also an easy one, a <script id="beconfig-store_state">
that has an object with all data required. Parse it into json and extract them.
Youtube - you broke my heart
Youtube's response is a huge mess, it has a punch of <script>
tags that don't have any ids or classes. I wanted to get the channel's number of subscribers and total video views, both can be found in the About
tab.
The desired <script>
is similar to the Github
one, I could use the same split, slice, parse
thing and I'll be done.
But, these two simple numbers are nested like 12 levels deep within the object and there are arrays involved, it's basically hell.
So, I wrote a little helper function that accepts the large JSON/object and the object key to be extracted, and it returns an array of all matches. ```javascript // lib/_helpers/getNestedObjects.js
export function getNestedObjects(dataObj, objKey) { // intialize an empty array to store all matched results let results = []; getObjects(dataObj, objKey);
function getObjects(dataObj, objKey) { // loop through the key-value pairs on the object/json. Object.entries(dataObj).map(entry => { const [key, value] = entry; // check if the current key matches the required key. if (key === objKey) { results = [...results, { [key]: value }]; }
// check if the current value is an object/array.
// if the current value is an object, call the function again.
// if the current value is an array, loop through it, check for an object, and call the function again.
if (Object.prototype.toString.call(value) === "[object Object]") {
getObjects(value, objKey);
} else if (Array.isArray(value)) {
value.map(val => {
if (Object.prototype.toString.call(val) === "[object Object]") {
getObjects(val, objKey);
}
});
}
});
}
// return an array of all matches, or return "no match" if (results.length === 0) { return "No match"; } else { return results; } }
``
As much as I was thrilled that
getNestedObjectsactually works -[try it](https://github.com/AlaaDesouky/getNestedObjects)-, it didn't last for long though.
Somehow the received html didn't contain that