Links are moving too fast... And your online README.md, links directories, blog posts or whatever... rapidly give links to dead resources π’
Like in my awesome-like Perl README.md π that contains hundreds of links (go check it out, it is cool ! π).
My solution is to check periodically that the links are still up !
Basic version
For this very first version, I will take links from a file or a |
(pipe). And I will use LWP::Simple
.
#!/usr/bin/env perl
use LWP::Simple;
$| = 1; # Ignore this
while(<>) {
chomp; # Remove carriage return
my $link = $_;
print "Checking [$link]...";
my $content = get($link);
if(! defined $content) {
print " BROKEN !\n";
} else {
print " OK\n";
}
}
That I use with a list of links in a links.txt
file for instance:
http://cpantesters.org
https://img.shields.io/badge/Language-Perl-blue
https://www.perltutorial.org/
http://cpancover.com
And I run it like this:
$ cat links.txt | perl checklinks.pl
# OR
$ perl checklinks.pl links.txt
This is the magic of <>
!
It produces an output like the following:
Checking [http://cpantesters.org]... OK
Checking [https://img.shields.io/badge/Language-Perl-blue]... BROKEN !
Checking [https://www.perltutorial.org/]... OK
Checking [http://cpancover.com]... BROKEN !
What ? We have some broken links ?
But the shields.io badge and cpancover.com are actually not down...
And since we are using LWP::Simple
that clearly states that
"If you need more control or access to the header fields in
the requests sent and responses received, then you should use
the full object-oriented interface provided by the
LWP::UserAgent module."
Then... Go for LWP::UserAgent !
LWP::UserAgent
Then, here is my new version based on LWP::UserAgent
:
#!/usr/bin/env perl
use LWP::UserAgent ();
my $ua = LWP::UserAgent->new(timeout => 10);
$| = 1;
while(<>) {
chomp;
my $link = $_;
print "Checking [$link]...";
my $res = $ua->get($link);
if(! $res->is_success) {
print " BROKEN !\n";
} else {
print " OK\n";
}
}
I then run it like this :
echo "https://img.shields.io/badge/Language-Perl-blue" | perl checklinks.pl
The shields.io badge is still up for humans but broken for LWP π:
Checking [https://img.shields.io/badge/Language-Perl-blue]... BROKEN !
I need to see the status code...
403 forbidden !
To know what is the status code, I can print $res->status_line
:
print " BROKEN ! --> " . $res->status_line . "\n"
And the conclusion is terrible π:
BROKEN ! --> 403 Forbidden
The 403 Forbidden is something like the server is working well but refused to serve us because it detected something that he does not like.
Maybe like an empty user agent ? π
Adding $ua->agent('Mozilla/5.0');
like in the LWP::UserAgent CPAN doc effectively fixed the problem:
Checking [https://img.shields.io/badge/Language-Perl-blue]... OK
We fixed one problem, but we still have some others.
406 Not Acceptable
The Perl Tutorial website was OK with the LWP::Simple
version but is now BROKEN with a strange status:
Checking [https://www.perltutorial.org/]... BROKEN ! --> 406 Not Acceptable
"Not Acceptable" is supposed to be a problem with what the client accepts ("Accept" headers) and what the server can give.
In my firefox browser I have this:
I can try to emulate and change them with push_header
like this:
$ua->default_headers->push_header('Accept' => "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8");
$ua->default_headers->push_header('Accept-Encoding' => "gzip, deflate, br");
$ua->default_headers->push_header('Accept-Language' => "en-US,en;q=0.5");
But here it is not the problem, the problem is that I use a bad agent name ("Mozilla/5.0", the one taken as is from LWP::UserAgent doc).
My feeling is that "Mozilla/5.0" is not an empty agent name but is probably too old and looks like too much "a bot with a name" π
This change, fixes the problem:
$ua->agent('Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:84.0) Gecko/20100101 Firefox/82.0');
500 Can't connect to ... (certificate verify failed)
One more problem is related to certificat verification.
If you visit builtinperl.com you will get the usual certificat warning:
We can force the visit when using Firefox, but when using my script:
$ echo "http://builtinperl.com" | perl checklinks.pl
It hardly fails:
Checking [http://builtinperl.com]... BROKEN ! --> 500 Can't connect to builtinperl.com:443 (certificate verify failed)
But once again, you can tweak LWP::UserAgent
to fix this:
use IO::Socket::SSL qw( SSL_VERIFY_NONE );
$ENV{PERL_LWP_SSL_VERIFY_HOSTNAME} = 0;
# And later
# ...
$ua->ssl_opts(SSL_verify_mode => SSL_VERIFY_NONE);
verify_mode => 0
was supposed to do the same than $ENV{PERL_LWP_SSL_VERIFY_HOSTNAME} = 0;
but was not working, if someone knows the why... Please comment π
500 read timeout
Shit happens, even for best of us (CPANTesters)
You can increase the timeout.
405 Method Not Allowed
I usually use HEAD method since I don't care about the content but only the page status. But some links (e.g. CGI) won't answer HEAD requests!
For instance qntm.org answers "405 Method Not Allowed" for HEAD requests, and it's annoying:
$ curl --head https://qntm.org/files/perl/perl.html
HTTP/1.1 405 Method Not Allowed
Date: Mon, 08 Feb 2021 09:27:56 GMT
Server: Apache/2.4.38 (Debian)
Vary: User-Agent
Content-Type: text/html; charset=UTF-8
Pimp my output
I just added some salt to my script to make it nicer.
Unicode characters:
use open ':std', ':encoding(UTF-8)';
See this StackOverflow thread to know why this line.
And colors in terminal:
use Term::ANSIColor;
And later:
print color('red') . " \x{2717}" . color('reset') . " --> " . $res->status_line . "\n";
It does not change much the output but make it clearer and nicer π
Conclusion
There is more to say here π
Like mentioning that Mojolicious provides a very good framework for doing the same kind of tasks (it could be perceived as a "more modern" approach).
And also to try to be kind if possible with websites (using HEAD verb, announce yourself as a bot when possible, do not crawl too often...).
EDIT1: This blog post has a sequel, see Check markdown links with github action
EDIT2: This blog has another sequel, see Check links with HTTP::Simple
Top comments (3)
Some of my modules you may find interesting, written because the existing ones are strange/suboptimal: HTTP::Simple, open::layers
I wrote a sequel, experimenting a version with HTTP::Simple, see check links with HTTP::Simple
Wow thank you a lot ! I need to test these modules ASAP :)