One day i needed a solution that could parse meta graph tags from a input line and produce a title and an icon
Of course there were an infinite number of libraries that used jsoup's, but that was not what i needed, i wanted to use qt and c++
I thought as soon i enter my query - "c++ parser meta tags" i will see all the solutions o was looking for
But in reality, everything is a little more complicated.
What i did:
1) Prepare step, parse the input and decide if there is a valid url or just text
This step seems to be expensive (not so much as loading everything from input,
but I think it is too extra work)
static bool checkIsContainsHyperlink(QString line) {
static QRegularExpression regex(web_pattern);
QRegularExpressionMatch match = regex.match(line);
return match.hasMatch();
}
2) Download with the ability to handle redirects
Many sites do not provide tags on simple web pages, and they often use redirect for reasons which i don't know
connect(&m_WebCtrl, SIGNAL (finished(QNetworkReply*)), this, SLOT (fileDownloaded(QNetworkReply*)));
QNetworkRequest request(url);
request.setAttribute(QNetworkRequest::RedirectPolicyAttribute, true);
m_WebCtrl.get(request);
3) Saving the page we downloaded it seems strange, why we save this page is probably surprising you
The problem is that some sites can ban a specific IP, which makes a lot of requests
For me it was enough to change 3-5 symbols in the url line and i got banned for a few minutes
Caching downloaded pages solved this problem
connect(m_downloader_image, &FileDownloader::downloaded, [&, imagePathName]() {
QByteArray array = m_downloader_image->downloadedData();
if(!array.isEmpty()) {
QFile imageFile(imagePathName);
if(imageFile.open(QIODevice::WriteOnly)) {
imageFile.write(array);
m_result.og_image_local_path = imagePathName;
}
}
emit signalParserDone(m_result);
});
4) Parsing
So we have a web-page in the local folder, it's time to parse it and get what we need
Unfortunately, for me, gumbo-parser turned out to be very unfriendly
So for first start i decided to use regex, hoping to change it to something else in the future
QRegularExpression site_name_regex(og_site_name);
QRegularExpression title_regex(og_title);
QRegularExpression description_regex(og_description);
QRegularExpression url_regex(og_url);
QRegularExpression image_regex(og_image);
QRegularExpressionMatch match;
match = site_name_regex.match(html);
if (match.hasMatch()) {
res.og_site_name = match.captured(1);
}
match = title_regex.match(html);
if (match.hasMatch()) {
res.og_title = match.captured(1);
}
match = description_regex.match(html);
if (match.hasMatch()) {
res.og_description = match.captured(1);
}
match = url_regex.match(html);
if (match.hasMatch()) {
res.og_url = match.captured(1);
}
match = image_regex.match(html);
if (match.hasMatch()) {
res.og_image = match.captured(1);
}
Finally, we can enter URL-address and enjoy the preview and title
Top comments (0)