First of all, I'm by no means a professional software engineer, so this won't be the cleanest code you'll see. I'm using this blog post to document my coding process, share my thoughts and the approaches I took to solve problems, also as feedback on how I did things wrong/right.
The inspiration for this project came from Wesbos's Twitter and Instagram scraping project.
You can find the repo here: status-scraper
So, what does it do exactly?
It's an api that accepts a social media flag
and a username
and returns the user status (eg. # of followers, following, posts, likes, etc...).
Endpoint is /scrape/:flag/:username
, and currently the :flag
can be any of the following:
-
t
=> twitter.com -
r
=> reddit.com -
g
=> github.com -
b
=> behance.net -
q
=> quora.com -
i
=> instagram.com
So, a call for https://statusscraperapi.herokuapp.com/scrape/t/mkbhd
would return the following response:
{
user: "mkbhd",
status: {
twitterStatus: {
tweets: "45,691",
following: "339",
followers: "3,325,617",
likes: "25,255"
}
}
}
Tech used
- Node
- esm, an ECMAScript module loader
- Express
- Axios
- Cheerio
Server configuration
// lib/server.js
const PORT = process.env.PORT || 3000;
app.listen(PORT, () => console.log(`Server running on port ${PORT}`));
// lib/app.js
class App {
constructor(app, routePrv) {
this.app = express();
this.config();
this.routePrv = new Routes().routes(this.app);
}
config() {
this.app.use(cors())
this.app.use(helmet());
}
}
export default new App().app;
Project structure
The app has three modules:
Module 1 - Router:
// lib/routes/router.js
// all routes have the same structure
export class Routes {
routes(app) {
....
// @route GET /scrape/g/:user
// @desc log github user status
app.get("/scrape/g/:user", async (req, res) => {
const user = req.params.user;
try {
const githubStatus = await Counter.getGithubCount(
`https://github.com/${user}`
);
res.status(200).send({ user, status: { githubStatus } });
} catch (error) {
res.status(404).send({
message: "User not found"
});
}
});
...
}
}
Module 2 - Counter:
- Acts as a middleware between the route and the acual scraping.
- It gets the html page and pass it to the scraper module.
// lib/scraper/counter.js
class Counter extends Scraper {
...
// Get github count
async getGithubCount(url) {
const html = await this.getHTML(url);
const githubCount = await this.getGithubStatus(html);
return githubCount;
}
...
}
export default new Counter();
Module 3 - Scraper:
It's where all the work is done, and I'll be explaining each social network approach.
Let's start.
Twitter response has multiple <a>
elements that contain all data we want, and it looks like this:
<a class="ProfileNav-stat ProfileNav-stat--link u-borderUserColor u-textCenter js-tooltip js-nav" title="70 Tweets" data-nav="tweets" tabindex=0>
<span class="ProfileNav-label" aria-hidden="true">Tweets</span>
<span class="u-hiddenVisually">Tweets, current page.</span>
<span class="ProfileNav-value" data-count=70 data-is-compact="false">70</span>
</a>
The class ProfileNav-stat--link
is unique for these elements.
With cheerio, we can simply get all <a>
with the class, loop through them, and extract the data of the title
attribute.
Now we have "70 Tweets"
, just split it and store as a key-value pair.
// lib/scraper/scraper.js
// Get twitter status
async getTwitterStatus(html) {
try {
const $ = cheerio.load(html);
let twitterStatus = {};
$(".ProfileNav-stat--link").each((i, e) => {
if (e.attribs.title !== undefined) {
let data = e.attribs.title.split(" ");
twitterStatus[[data[1].toLowerCase()]] = data[0];
}
});
return twitterStatus;
} catch (error) {
return error;
}
}
Reddit user page has a <span id="profile--id-card--highlight-tooltip--karma">
on the right side with user's total karma, so it's very easy to get. But when hovered over, it displays post/comment karma.
Reddit response has a <script id="data">
that contains these two pieces of data nested inside an object.
window.___r = {"accountManagerModalData":....
...."sidebar":{}}}; window.___prefetches = ["https://www....};
Just extract the <script>
data and parse 'em into json. But we need to get rid of window.___r =
at the start, ; window.___prefetches....
at the end and everything after it.
This could be the laziest/worst thing ever :D
I split based on " = ", counted the #of characters starting from that ;
-using a web app of course-, and sliced them out of the string. Now I have a pure object in a string.
// lib/scraper/scraper.js
// Get reddit status
async getRedditStatus(html, user) {
try {
const $ = cheerio.load(html);
const totalKarma = $("#profile--id-card--highlight-tooltip--karma").html();
const dataInString = $("#data").html().split(" = ")[1];
const pageObject = JSON.parse(dataInString.slice(0, dataInString.length - 22));
const { commentKarma, postKarma } = pageObject.users.models[user];
return {totalKarma, commentKarma, postKarma};
} catch (error) {
return error;
}
}
It responded with status code 999! like, really linkedin.
I tried sending a customized head request that worked with everyone on stack overflow, but it did not work for me. Does it have something to do with csrf-token
? I'm not really sure.
Anyways, that was a dead-end, moving on to the next one.
Github
This one was fairly easy, there are five <span class="Counter">
that displays the #of repositories, stars, etc.. Loop through 'em to extract the data, and with Cheerio
I can get the element's parent, which is an <a>
that has what these numbers represent. Store 'em as key-value pairs and we're ready to go.
// lib/scraper/scraper.js
// Get github status
async getGithubStatus(html) {
try {
const $ = cheerio.load(html);
const status = {};
$(".Counter").each((i, e) => {
status[e.children[0].parent.prev.data.trim().toLowerCase()] = e.children[0].data.trim();
});
return status;
} catch (error) {
return error;
}
}
Behance
Also an easy one, a <script id="beconfig-store_state">
that has an object with all data required. Parse it into json and extract them.
Youtube - you broke my heart
Youtube's response is a huge mess, it has a punch of <script>
tags that don't have any ids or classes. I wanted to get the channel's number of subscribers and total video views, both can be found in the About
tab.
The desired <script>
is similar to the Github
one, I could use the same split, slice, parse
thing and I'll be done.
But, these two simple numbers are nested like 12 levels deep within the object and there are arrays involved, it's basically hell.
So, I wrote a little helper function that accepts the large JSON/object and the object key to be extracted, and it returns an array of all matches.
// lib/_helpers/getNestedObjects.js
export function getNestedObjects(dataObj, objKey) {
// intialize an empty array to store all matched results
let results = [];
getObjects(dataObj, objKey);
function getObjects(dataObj, objKey) {
// loop through the key-value pairs on the object/json.
Object.entries(dataObj).map(entry => {
const [key, value] = entry;
// check if the current key matches the required key.
if (key === objKey) {
results = [...results, { [key]: value }];
}
// check if the current value is an object/array.
// if the current value is an object, call the function again.
// if the current value is an array, loop through it, check for an object, and call the function again.
if (Object.prototype.toString.call(value) === "[object Object]") {
getObjects(value, objKey);
} else if (Array.isArray(value)) {
value.map(val => {
if (Object.prototype.toString.call(val) === "[object Object]") {
getObjects(val, objKey);
}
});
}
});
}
// return an array of all matches, or return "no match"
if (results.length === 0) {
return "No match";
} else {
return results;
}
}
As much as I was thrilled that getNestedObjects
actually works -try it-, it didn't last for long though.
Somehow the received html didn't contain that <script>
, and I have no idea why. I checked if it has the numbers, but a dead-end.
Thanks, youtube.
Quora
The response has multiple <span class="list_count">
, and it's the exact same as Github.
The response literarily has a problem from each one above:
- ✅ Multiple
<script>
tags with the sametype="text/javascript"
- ✅
split, slice, parse
- ✅ The numbers are nested very deep with the object
// Get instagram status
async getInstagramStatus(html) {
try {
const $ = cheerio.load(html);
// get the script containing the data
let script;
$('script[type="text/javascript"]').each((i, e) => {
if (e.children[0] !== undefined && e.children[0].data.includes("window._sharedData =")) {
return (script = e.children[0].data);
}
});
// get json fromat string
const dataInString = script.split(" = ")[1];
// convert to json object
const pageObject = JSON.parse(dataInString.slice(0, dataInString.length -1));
// extract objects with status
const [{ edge_followed_by }] = getNestedObjects(pageObject, "edge_followed_by");
const [{ edge_follow }] = getNestedObjects(pageObject, "edge_follow");
const [{ edge_owner_to_timeline_media }] = getNestedObjects(pageObject, "edge_owner_to_timeline_media");
return {
followers: edge_followed_by.count,
following: edge_follow.count,
posts: edge_owner_to_timeline_media.count
};
} catch (error) {
return error;
}
}
At least I got to use the helper.
Wraping up
This was a cool project to make and I've learned a lot of stuff building it.
I've also created a frontend app with React
and Next
that interacts with the api, you can view it here: Status Logger
Maybe I'll write a blog post for it later.
In the meantime, feel free to share your opinion, good or bad, about it. Also, if you have any other social media networks to scrape.
Top comments (1)
I have an error in the server.js, syntax error in the import app from "./app"