Installing Tidy
From DreamHost
| The instructions provided in this article or section are considered advanced. You are expected to be knowledgeable in the UNIX shell. |
NOTICE: These instructions are a work in progress and may not be fully tested. They may in fact be downright wrong. Try them at your own risk.
Tidy HTML is both a standalone program and a library that cleans up HTML documents. One possible use is for web apps that allow users to post HTML documents. In particular, if you want to give your users a rich text editor like TinyMCE or FCK Editor, you'll want to use Tidy. You might not think it's necessary--so what if the article your user wrote is malformed? Here's an example that can get you into big trouble:
<p>Hahaha! Now everything on your page below my post is going to be a <a href="http://evil.com">link!
To use Tidy from PHP, there are two solutions: installing the PHP extension and the hackish way. I'll address the hackish way first.
Contents |
The Hackish Way: Use shell_exec()
On my server (wilshire), Tidy is installed as a Linux program in the path, but not as an Apache extension. So, to run it from PHP, I can call it from the Linux shell using PHP's Swiss Army Knife: the shell_exec() function.
From the PHP documentation:
shell_exec — Execute command via shell and return the complete output as a string
Let's say you have user-supplied HTML coming from $_POST or something like that. It's stored in $bad_html. Tidy likes to operate on files, so you'll need to use file_get_contents() and file_put_contents() as intermediaries. The -m option tells Tidy to modify the source file, rather than just writing to stdout. (If anyone knows a way to omit these steps, please edit this page). The --show-body-only config option causes Tidy to output the contents of the body tag only. Without that option, Tidy would wrap everything in html and body tags, and that's no good if we want to display their content inside our own page.
$file = rand(0, 10000); // Give us a new random filename
file_put_contents("temp/$file", $bad_html);
shell_exec("tidy -m --show-body-only yes temp/$file");
$good_html = file_get_contents("temp/$file");
unlink("temp/$file"); // Clean up after ourselves
The Less So, But Still Quite Hackish Way: Use proc_open()
This is very similar to the previous method, but doesn't need a temp file at all. Only works in (PHP 4 >= 4.3.0, PHP 5).
$descriptorspec = array(
0 => array("pipe", "r"), // stdin is a pipe that the child will read from
1 => array("pipe", "w"), // stdout is a pipe that the child will write to
2 => array("pipe", "r") // stderr
);
$process = proc_open('tidy -m --show-body-only yes', $descriptorspec, $pipes);
if (is_resource($process)) {
// $pipes now looks like this:
// 0 => writeable handle connected to child stdin
// 1 => readable handle connected to child stdout
// 2 => stderr pipe
// writes the bad html to the tidy process that is reading from stdin.
fwrite($pipes[0], $bad_html);
fclose($pipes[0]);
// reads the good html from the tidy process that is writing to stdout.
$good_html = stream_get_contents($pipes[1]);
fclose($pipes[1]);
// don't care about the stderr, but you might.
// It is important that you close any pipes before calling
// proc_close in order to avoid a deadlock
$return_value = proc_close($process);
}
// now use $good_html for whatever
Install your own PHP
If you want to use the PHP bindings, you'll have to install Tidy as a PHP extension.
Get Prep Script
Available on the wiki. It downloads an unpacks source code for PHP and various extensions.
Install libtoolize
Compiling Tidy requires libtoolize, part of the GNU Autotools. Unfortunately, it's not installed on Dreamhost servers, so you'll have to download it, install it, and add it to your $PATH.
Install Tidy
It's very hard to find the right package. The CVS repository doesn't seem to provide libtidy, it only has the standalone. To get libtidy, I went to http://tidy.sourceforge.net/src/old/ and grabbed the last one on the list
Modify & Run PHP Install Script
The PHP 5 install scripts on the wiki have to be modified. Add the --with-tidy option.

