Discussion:
Problem with encoding of filenames - SOLVED, follow-up question
(too old to reply)
Ivan Shmakov
2017-10-28 08:12:42 UTC
Permalink
[…]
The strange redirects are due to some experimenting with .htaccess,
I’ll have to fix that, disabled it for now.
(I’ve suspected something along these lines.)
Ivan Shmakov also noted that I claim html4 compliance but should move
to html5 if I want to use “unencoded” UTF-8 in ‘href’. Clicking the
button in the footer seems to indeed validate, so I wonder what the
exact problem is. I vaguely remember that in the past I decided not
to move to html5, but forgot for what reason. Maybe I will for this
reason.
Frankly, I’m unsure if HTML4 allowed whitespace in href (and I’m
pretty sure it didn’t allow UTF-8; hence I suspect that failing
to catch that may be due to a bug in the validator), but at
least the validator at [1] correctly reports space characters as
(HTML5?) errors:

3. Error: Bad value Antropozofi/Valentin Wember – Waar gaan we
eigenlijk heen%3F.pdf for attribute href on element a: Illegal
character in path segment: space is not allowed.

[1] http://validator.nu/?doc=http://hendrikmaryns.name/antro.shtml
Lastly, it turns out the problem had nothing to do with the encoding.
The reason I thought so were the error messages, which contained
Not Found
The requested URL /Antropozofi/Spirituele opgaven België – Luc
Vandecasteele2.pdf was not found on this server.
Additionally, a 404 Not Found error was encountered while trying to
use an ErrorDocument to handle the request.
If I understand this right, the server is misconfigured and I should
fix this by providing my own 404.html file. Can someone point me to
the right place on how to do this?
Actually, the error message above is kind of self-documenting
in that it points to the ‘ErrorDocument’ directive. See [2].

[2] http://httpd.apache.org/docs/2.4/mod/core.html#errordocument

However, the problem is not in the “document,” but rather in the
Content-Type: header, which is:

Content-Type: text/html; charset=iso-8859-1

At the same time, Apache includes the (supposed) filename in the
response “as is”: in UTF-8.

Curiously, adding ‘AddDefaultCharset utf-8’ [3] to my .htaccess
didn’t seem to have any effect on the 404 response header, so
I’m interested in how it can be fixed, too. (Reading [4]
wasn’t enlightening so far, either. Cross-posting to
news:comp.infosystems.www.servers.unix, as the question is
specific to server software, not HTML.)

[3] http://httpd.apache.org/docs/2.4/mod/core.html#adddefaultcharset
[4] http://httpd.apache.org/docs/2.4/mod/mod_mime.html
--
FSF associate member #7257 np. Flight of the Phoenix — Jumpy
Thomas 'PointedEars' Lahn
2017-10-28 22:46:22 UTC
Permalink
[Will you *please* stop this amok-crossposting? Usenet is _not_ your
personal private support forum/playground. If you must crosspost, then
crosspost to the *correct* newsgroup (see charters and taglines), and
*set Followup-To*. In particular, Apache is _not_ a UNIX-*only* Web server
(RTFM).

X-Post & F’up2 <news:comp.infosystems.www.authoring.misc>]


Ivan Shmakov wrote in <news:comp.infosystems.www.authoring.html>:

[Fixed quotes; see <http://www.netmeister.org/news/learn2quote.html>]
Post by Ivan Shmakov
The strange redirects are due to some experimenting with .htaccess,
I’ll have to fix that, disabled it for now.
(I’ve suspected something along these lines.)
Me too.
Post by Ivan Shmakov
Ivan Shmakov also noted that I claim html4 compliance but should move
to html5 if I want to use “unencoded” UTF-8 in ‘href’.
That was and is “not even wrong”. Sorry to break this to you, but you have
been listening to a *wannabe*.

<https://unicode.org/faq/>
<https://www.w3.org/TR/html/links.html#element-attrdef-a-href>
Post by Ivan Shmakov
Clicking the button in the footer seems to indeed validate, so I wonder
what the exact problem is. I vaguely remember that in the past I
decided not to move to html5, but forgot for what reason. Maybe I will
for this reason.
Frankly, I’m unsure if HTML4 allowed whitespace in href
It does not, and that is not hard to find out either. Just RTFSpec:

<http://www.w3.org/TR/1999/REC-html401-19991224/struct/links.html#adef-href>
Post by Ivan Shmakov
(and I’m pretty sure it didn’t allow UTF-8;
Percent-encoded characters according to RFC 3986 & children: no problem.

Unescaped non-ASCII characters: *big* problem.
Post by Ivan Shmakov
hence I suspect that failing to catch that may be due to a bug in the
validator),
Sure, blame the Validator for your incompetence. What else is new? :->
Post by Ivan Shmakov
but at least the validator at [1] correctly reports space characters as
It is more likely that an HTML5-supporting validator will catch this error
because HTML5 is not based on a DTD that can be checked against. This
encourages validator developers to check more carefully against the
Specification *prose*.

It certainly is so in the case in the case of the *W3C* Validator. Why are
you not using *it* instead (<https://validator.w3.org/>)? It has been
supporting HTML5 for several years now (although as an implicit switch to
the HTML5 validator – the “Nu Html Checker” at
<https://validator.w3.org/nu/> – when the HTML5 doctype is recognized or
selected).
Post by Ivan Shmakov
3. Error: Bad value Antropozofi/Valentin Wember – Waar gaan we
eigenlijk heen%3F.pdf for attribute href on element a: Illegal
character in path segment: space is not allowed.
Correct. Neither are unescaped non-ASCII characters. Supportive UA
behavior to the contrary is *implementation-dependent*.
Post by Ivan Shmakov
[2] http://httpd.apache.org/docs/2.4/mod/core.html#errordocument
However, the problem is not in the “document,” but rather in the
Content-Type: text/html; charset=iso-8859-1
At the same time, Apache includes the (supposed) filename in the
response “as is”: in UTF-8.
Curiously, adding ‘AddDefaultCharset utf-8’ [3] to my .htaccess
didn’t seem to have any effect on the 404 response header,
[3] says

| AllowOverride: FileInfo

On the other hand, if the error message files are UTF-8 encoded – and

| $ file -i /usr/share/apache2/error/HTTP_NOT_FOUND.html.var
| /usr/share/apache2/error/HTTP_NOT_FOUND.html.var: text/html; charset=utf-8
|
| $ dpkg -S /usr/share/apache2/error/HTTP_NOT_FOUND.html.var
| apache2-data: /usr/share/apache2/error/HTTP_NOT_FOUND.html.var
|
| $ dpkg -l apache2-data | awk '/^.i/ {print $3}'
| 2.4.23-4

suggests just that –, “AddDefaultCharset” is stupidly set to “On” (the
previous default) or “iso-8859-1” and it *works* with the OP, then it would
be no surprise that the error messages are garbled.
Post by Ivan Shmakov
so I’m interested in how it can be fixed, too.
AddDefaultCharset off

or (with Apache 2.4.x+)

# AddDefaultCharset on

(disabling it, therefore falling back to the default, which should be “off”)
in the httpd.conf/apache2.conf. LART that stuck-in-the-1980s server admin
if necessary. (Unicode 1.0.0 was published in 1991.)
Post by Ivan Shmakov
Cross-posting to
news:comp.infosystems.www.servers.unix, as the question is
specific to server software, not HTML.)
,-------------.
: ↑ Go to top :
`-------------'
Post by Ivan Shmakov
[3] http://httpd.apache.org/docs/2.4/mod/core.html#adddefaultcharset
As you can read there, “AddDefaultCharset” != “off” is a *deprecated*
approach:

,-<http://httpd.apache.org/docs/2.4/mod/core.html.en#adddefaultcharset>
|
| […]
| AddDefaultCharset should only be used when all of the text resources to
| which it applies are known to be in that character encoding and it is too
| inconvenient to label their charset individually. One such example is to
| add the charset parameter to resources containing generated content, such
| as legacy CGI scripts, that might be vulnerable to cross-site scripting
| attacks due to user-provided data being included in the output. Note,
| however, that a better solution is to just fix (or delete) those scripts,
| since setting a default charset does not protect users that have enabled
| the "auto-detect character encoding" feature on their browser.

It has been deprecated for more than 10 years:

<https://bz.apache.org/bugzilla/show_bug.cgi?id=23421>

Fun fact: Before the Apache default was changed in 2004 CE, the problem with
this default was *obvious* in the Bugzilla interface (but IIRC using a
different URI then) because the reporter of this bug (Martin Dürst) has a
name that contains a non-ASCII character which Bugzilla properly served
UTF-8-encoded, but Apache’s header field default caused HTML UAs to
interpret it as ISO-8859-1 regardless of the correct Content-Type “meta”
element (IIRC); so his name was displayed as “Martin Dürst” there for quite
some time.
Post by Ivan Shmakov
(Reading [4] wasn’t enlightening so far, either.
[4] http://httpd.apache.org/docs/2.4/mod/mod_mime.html
This module has nothing to do with the problem.


PointedEars
--
Sometimes, what you learn is wrong. If those wrong ideas are close to the
root of the knowledge tree you build on a particular subject, pruning the
bad branches can sometimes cause the whole tree to collapse.
-- Mike Duffy in cljs, <news:***@94.75.214.39>
Ivan Shmakov
2017-10-31 14:35:55 UTC
Permalink
[…]
If you must crosspost, then crosspost to the *correct* newsgroup (see
charters and taglines), and *set Followup-To*. In particular, Apache
is _not_ a UNIX-*only* Web server (RTFM).
Do you care to suggest any “Unix-only” HTTP server? All those
I have any experience with (Apache, GNU Libmicrohttpd, Lighttpd,
Nginx, Perl HTTP::Daemon) appear to be quite cross-platform.

BusyBox httpd, perhaps? But surely we don’t have a whole
newsgroup to discuss just a single server – and a somewhat
rarely used one at that?

As I see it (but feel free to quote the part of the charter that
says that only “Unix-only” servers are to be discussed), the
.servers.unix newsgroup’s purpose is to discuss issues related
to running Web servers /on/ Unix-like systems. As such, and
given that both the servers in question run on such systems (one
likely, and the other definitely), I believe that the discussion
at hand is appropriate for news:comp.infosystems.www.servers.unix
(to which I hereby set Followup-To:.)

That said, I admit that it would have made sense for me to
respond with /two/ articles instead of one — the one covering
HTML issues staying in .authoring.html, and the Apache-related
one cross-posted (and Followup-To: set) to .servers.unix.

It did feel like microposting at the time, though. But just in
case, I’m doing it now.
X-Post & F’up2 <news:comp.infosystems.www.authoring.misc>
I don’t see how .authoring.misc is (more) relevant for either of
the topics brought up in this discussion.

[…]
Post by Ivan Shmakov
[2] http://httpd.apache.org/docs/2.4/mod/core.html#errordocument
However, the problem is not in the “document,” but rather in the
Content-Type: text/html; charset=iso-8859-1
At the same time, Apache includes the (supposed) filename in the
response “as is”: in UTF-8.
Curiously, adding ‘AddDefaultCharset utf-8’ [3] to my .htaccess
didn’t seem to have any effect on the 404 response header,
(Not that it should’ve, on a second thought.)
[3] says
| AllowOverride: FileInfo
It is turned on for the directory in question, so that shouldn’t
be a problem.

AllowOverride FileInfo Indexes AuthConfig Limit
On the other hand, if the error message files are UTF-8 encoded – and
| $ file -i /usr/share/apache2/error/HTTP_NOT_FOUND.html.var
| /usr/share/apache2/error/HTTP_NOT_FOUND.html.var: text/html; charset=utf-8
[…]
suggests just that
AFAICT, the /usr/share/apache2/error/HTTP_*.html.var files are
only used if ‘localized-error-pages’ is enabled (as in:
# a2enconf -- localized-error-pages.) Otherwise, I suppose
Apache just uses the error messages built into the binary –
whose Content-Type: may just as well be hardcoded.
– “AddDefaultCharset” is stupidly set to “On” (the previous default)
or “iso-8859-1” and it *works* with the OP, then it would be no
surprise that the error messages are garbled.
It doesn’t seem to be the case for my server, however. For one
thing, grep(1) reveals only commented-out lines:

$ grep -rF --exclude=\*~ -- AddDefaultCharset /etc/apache2/
…/mods-available/proxy.conf: # AddDefaultCharset off
…/conf-available/charset.conf:# Read the documentation before enabling AddDefaultCharset.
…/conf-available/charset.conf:#AddDefaultCharset UTF-8
$

Then again, the files whose suffixes lack explicit AddCharset
are served with Content-Type: having no ‘charset’ altogether.
Post by Ivan Shmakov
so I’m interested in how it can be fixed, too.
AddDefaultCharset off
or (with Apache 2.4.x+)
# AddDefaultCharset on
(disabling it, therefore falling back to the default, which should be
“off”) in the httpd.conf/apache2.conf. LART that stuck-in-the-1980s
server admin if necessary.
Indeed. And don’t forget the namecalling thing; after all, it’s
the only way to ensure that the request is dealt with
immediately. (And not, say, at the earliest admin’s convenience
– as surely will be the case otherwise.)
Post by Ivan Shmakov
Cross-posting to news:comp.infosystems.www.servers.unix, as the
question is specific to server software, not HTML.)
↑ Go to top
Explained my reasoning there.
Post by Ivan Shmakov
[3] http://httpd.apache.org/docs/2.4/mod/core.html#adddefaultcharset
As you can read there, “AddDefaultCharset” != “off” is a *deprecated*
http://httpd.apache.org/docs/2.4/mod/core.html.en#adddefaultcharset
AddDefaultCharset should only be used when all of the text resources
to which it applies are known to be in that character encoding and it
is too inconvenient to label their charset individually.
Which is, incidentally, exactly my case.
One such example is to add the charset parameter to resources
containing generated content, such as legacy CGI scripts, that might
be vulnerable to cross-site scripting attacks due to user-provided
data being included in the output. Note, however, that a better
solution is to just fix (or delete) those scripts, since setting a
default charset does not protect users that have enabled the
“auto-detect character encoding” feature on their browser.
I see no formal deprecation notice in the text above. And while
the wording suggests that using it for scripts is a work-around
rather than a proper solution, it wasn’t quite my intent.
https://bz.apache.org/bugzilla/show_bug.cgi?id=23421
I agree with the reasoning that the specific ‘AddDefaultCharset
iso-8859-1’ default setting discussed there rarely makes sense,
but I see no reason to avoid AddDefaultCharset (utf-8 or other
encoding; and especially for specific directories) in general.

[…]
--
FSF associate member #7257 http://am-1.org/~ivan/
Loading...