One interesting problem about Nginx I've met on freelance - you need to restrict access to the site for users so that they can only go from search engines. Transitions from instant messengers, other sites and just direct visits to the site needed to be forbidden, in turn, for search robots to be allowed to index the site without any obstacles (direct access, transfer from other sites, etc., i.e., work without restrictions).
The server was on Ubuntu with Nginx as a web server, the site worked on WordPress, in general, nothing new. The implementation was supposed to be at the web server level, with nginx tools it was necessary to control who came from, the user or the search bot and issue the appropriate directive for subsequent actions - to give access to the site or send to the stub.
Changes were made to nginx at the site configuration file level, like mysite.conf for a single target host.
In order to more correctly determine who is who and where he came from, the parameters “$ http_referer” and “$ http_user_agent”, regular expressions, standard lists of bot agents, custom error page, rewrite were used to add several conditions to each them, the output is a condition of a length of 12 lines of code (you can still cut a little).
The logic of the nginx rules was as follows:
- If the user came to the site directly or NOT from a search engine - assign him a marker "a"
- If this is not a search bot, assign the marker “b”
- We check for the presence of markers, if both are present, then it was the user who moved to a direct or from another site for example, respectively, we send it to the stub. Now let's look at the rules in detail.
The first condition:
if ($ http_referer ~ * "^ $ | ^ ((?! (google | yandex | bing | yahoo | mail | duckduckgo) \. [a-z] +).) * $") {
set $ marker a;
}
What is what? Everything is simple, we ask nginx to check each user against the referrer where he came from. By the condition: “^ $” performs a check of an empty referrer or not - this is how direct visits to the site are determined. Next, we try to determine from the beginning of the line that the user came NOT from google, yandex and other search engines (in the regular expression, the construction?! Means no) well, and their corresponding listing.
I decided to list only the domains, but not their endings, the ending may be different for different segments of the Internet and therefore a more universal solution of the type [[az] + ”was used, which implies that there will be small letters and there may be few or many (from 1 and up ... a lot). A dot at the end says that absolutely anything can go further. A little earlier there is one more point - it is escaped by the “\” symbol in order to be correctly defined, like a real point, and not anything.
If this condition fulfills, then we set the variable $ marker value "a" and go on.
The second condition:
Everything is much simpler and more modest here, we need to define a bot or user by agent, they are standard in bots, they don’t change and therefore they were chosen as an example for checking (besides, there are fewer of them). The condition states that if the “user-agent” parameter does not match the mask for Google, Yandex, etc. - it means it is not a bot.
if ($ http_user_agent! ~ * "googlebot | yandex | yahoo | bing") {
set $ marker "$ {marker} b";
}
The use of an exclamation mark at the very beginning of the condition indicates that the discrepancy is checked, an asterisk is not a strict condition, and then the available agents of search engines from different search engines are listed accordingly.
In case this condition fulfills, then we assign the second value to our $ marker variable. The $ {marker} b construction is used so that the data from the first check is not lost, that is, we simply supplement it with the data of the second condition.
And, of course, at the end of all these validations, we can only verify the validity of our condition by a simple comparison:
if ($ marker = ab) {
return 403;
}
If the value of the variable after all performed manipulations is equal to the value “ab”, then everything worked correctly and we determined the user, respectively, and the fact that he went to the site directly or not from the search engine and you need to apply special measures to it, send it to the stub site.
See also Freelance Tasks: Problems with auto-renewing SSL certificates in ISP Manager
There is already at the discretion of the administrator, you can set the code of any error, for example, 403 or register a redirect to the custom page.
In my case, the complete solution was as follows:
if ($ http_referer ~ * "^ $ | ^ ((?! (google | yandex | bing | yahoo | mail | duckduckgo | website) \. [a-z] +).) * $") {
set $ bot a;
}
if ($ http_user_agent! ~ * "googlebot | yandex | yahoo | bing") {
set $ bot "$ {bot} b";
}
if ($ bot = ab) {
rewrite ^ / (. *) $ https://website.com/index.html permanent
}
There is a “website”, which is the website of the customer, so that users can move freely within the website.
Top comments (5)
Using
if
s in nginx is maybe not the best idea... (source)Possibly yes, problems may appear, but I used this approach on live project. Before I tested it of course on pre-prod environment, but in the end - it worked well, without any issues (in my case).
It's not like having a single
if
in your nginx.conf will instantly sigfault your webapp, server and possibly your fridge too. But the fact that it's so weird and unstable may have a negative effect on maintainability.How do you handle additional referrers? Right now it looks like only "google | yandex | bing | yahoo | mail | duckduckgo" are allowed. So each new search engine has to be added and reload nginx?
Yep, right, this is simple "if" condition. Also "reload" will not take much time as it is not full service restart :)