User-Agent: Mozilla/5.0 for a search engine bot??

Discussion:

(too old to reply)

Ivan Shmakov

2014-03-20 09:43:34 UTC

[The Apache-specific question is at the end of this posting.]

I wonder, since when it became a good idea for a major search
engine [1] to use User-Agent: strings like this?

180.76.5.80 - - [20/Mar/2014:08:42:50 +0000] "GET /[...] HTTP/1.1" 304 152 "-"
"Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2"

Until (q. v.) they fix this issue, I've decided to block access
to one of my servers from the respective network:

SetEnvIfNoCase User-Agent (bots?|ezooms|crawler|spider)\b bot_detected
<Directory /var/www/>
[...]
Order deny,allow
Deny from 2001:db8::f00 # an IP I use to test blocks
Deny from 180.76.0.0/16
Allow from env=bot_detected
</Directory>

However, what makes me curious, is whether I can use a specific
(as in: more detailed) 403 error message (or document) just for
this case? (Alas, I see no way to apply ErrorDocument [2] based
on the source IP address.)

TIA.

[1] https://en.wikipedia.org/wiki/Baidu
[2] https://httpd.apache.org/docs/2.2/mod/core.html#errordocument

--
FSF associate member #7257

Eli the Bearded

2014-03-20 18:38:40 UTC

Permalink

Post by Ivan Shmakov
I wonder, since when it became a good idea for a major search
engine [1] to use User-Agent: strings like this?

Ever since site owners decided to send different content to users based
on user agent. I've seen things like this:

From: googlebot(at)googlebot.com
User-Agent: SAMSUNG-SGH-E250/1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 UP.Browser/6.2.3.3.c.1.101 (GUI) MMP/2.0 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)

From: googlebot(at)googlebot.com
User-Agent: DoCoMo/2.0 N905i(c100;TB;W24H16) (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)

From: googlebot(at)googlebot.com
User-Agent: Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_1 like Mac OS X; en-us) AppleWebKit/532.9 (KHTML, like Gecko) Version/4.0.5 Mobile/8B117 Safari/6531.22.7 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)

In each case the "From:" header makes it clear this is a bot, and the UA
has bot aspects, but the UA also has clear attempts to trigger specific
user agent responses.

Post by Ivan Shmakov
However, what makes me curious, is whether I can use a specific
(as in: more detailed) 403 error message (or document) just for
this case? (Alas, I see no way to apply ErrorDocument [2] based
on the source IP address.)

I'd think it unlikely that any human will read the error message, so
don't put a lot of effort into it. You can use SetEnvIf and then use
a CGI or PHP script (or even mod_rewrite) to make sophisticated
ErrorDocuments, but why?

Elijah
------
avoids mod_rewrite due to frequent security issues

Ivan Shmakov

2014-03-21 07:01:38 UTC

Permalink

Post by Eli the Bearded

I wonder, since when it became a good idea for a major search engine
[1] to use User-Agent: strings like this?
180.76.5.80 - - [20/Mar/2014:08:42:50 +0000] "GET /[...] HTTP/1.1"
304 152 "-" "Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/20100101
Firefox/6.0.2"

Ever since site owners decided to send different content to users
From: googlebot(at)googlebot.com
User-Agent: SAMSUNG-SGH-E250/1.0 Profile/MIDP-2.0
Configuration/CLDC-1.1 UP.Browser/6.2.3.3.c.1.101 (GUI) MMP/2.0
(compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)

[...]

Post by Eli the Bearded
In each case the "From:" header makes it clear this is a bot, and the
UA has bot aspects, but the UA also has clear attempts to trigger
specific user agent responses.

Well, that's the point: the User-Agent: they use does not
contain anything which may suggest that that's a bot.
(I haven't looked at From: as my access.log doesn't track one.
But then, wasn't From: omitted from the HTTP/1.1 RFC?)

Post by Eli the Bearded

However, what makes me curious, is whether I can use a specific (as
in: more detailed) 403 error message (or document) just for this
case? (Alas, I see no way to apply ErrorDocument [2] based on the
source IP address.)

So not to be utterly surprised should I ever, for any reason,
stumble on this behavior months (or years) later.

But I guess I'd rather give up on this right now.

Post by Eli the Bearded
Elijah ------ avoids mod_rewrite due to frequent security issues

I hope that the Debian security team takes care of this for me.

--
FSF associate member #7257

Eli the Bearded

2014-03-21 18:52:51 UTC

Permalink

Post by Ivan Shmakov
Well, that's the point: the User-Agent: they use does not
contain anything which may suggest that that's a bot.
(I haven't looked at From: as my access.log doesn't track one.
But then, wasn't From: omitted from the HTTP/1.1 RFC?)

Some people believe (rightly or wrongly) their bots are special. And if
it obeys the wildcard rules in robots.txt, I don't care really.

Guidelines for FROM header in HTTP/1.1:

http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.22

I don't normally log From:, but I have a special bot trap page that
logs all headers of all visitors. I periodically sift through it to
find new bots.

Post by Ivan Shmakov
So not to be utterly surprised should I ever, for any reason,
stumble on this behavior months (or years) later.

Comments in the config file are your friend.

[ on avoiding mod_rewrite due to frequent security issues ]

Post by Ivan Shmakov
I hope that the Debian security team takes care of this for me.

It still requires testing in dev and a push to production. Unless I
really need it, and alas PHP qualifies on that count, I'll avoid it.

Elijah
------
wrote a super-basic web server in perl to help catch headers

Ivan Shmakov

2014-03-23 07:09:58 UTC

Permalink

Post by Eli the Bearded

Post by Ivan Shmakov
But then, wasn't From: omitted from the HTTP/1.1 RFC?

[...]

Post by Eli the Bearded
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.22

ACK, thanks.

[...]

Post by Eli the Bearded

Post by Ivan Shmakov
So not to be utterly surprised should I ever, for any reason,
stumble on this behavior months (or years) later.

Comments in the config file are your friend.

Server-side comments are not generally seen at the client side.

Post by Eli the Bearded
[ on avoiding mod_rewrite due to frequent security issues ]

Post by Ivan Shmakov
I hope that the Debian security team takes care of this for me.

It still requires testing in dev and a push to production.

FTR, I've scanned through the DSAs back to 2012, and the only
mod_rewrite issue I was able to find [1] is associated with the
[P] RewriteRule flag, which I don't use anyway. (And even if I
do, ProxyPassMatch was found to be similarly compromised.)

[1] http://www.debian.org/security/2012/dsa-2405

Post by Eli the Bearded
Unless I really need it, and alas PHP qualifies on that count, I'll
avoid it.

Conversely, the latest DSA for PHP5 [2] was issued earlier this
month, preceded by one another [3] last December.

But I have to admit that I'm biased against PHP /irrespective/
of these. Frankly, I find even Bash a more convenient language
to use, while my preferences for Web server-side programming
would be Perl or perhaps even something non-mainstream, like
SWI-Prolog.

[2] http://www.debian.org/security/2014/dsa-2868
[3] http://www.debian.org/security/2013/dsa-2816

--
FSF associate member #7257