一起学习PHP中的Tidy扩展库

这个扩展估计很多同学可能都没听说过，这可不是泰迪熊呀，而是一个处理 HTML 相关操作的扩展，主要是可以用于 HTML 、 XHTML 、 XML 这类数据格式内容的格式化及展示。

关于 Tidy 库

Tidy 库扩展是随 PHP 一起发布的，也就是说，我们可以在编译安装 PHP 时加上 --with-tidy 来一起安装这个扩展，也可以在事后通过源码包中 ext/ 文件夹下的 tidy 目录中的源码来进行安装。同时，Tidy 扩展还需要依赖一个 tidy 函数库，我们需要在操作系统上安装，如果是 CentOS 的话，直接 yum install libtidy-devel 就可以了。

Tidy 格式化

首先我们来看一下如何通过这个 Tidy 扩展库来格式化一段 HTML 代码。

$content = <<<EOF
<html><head><title>test</title></head> <body><p>error<br>another line</i></body>
</html>
EOF;

$tidy = new Tidy();
$config = [
        'indent'=>true,
        'output-xhtml'=>true,
];
$tidy->parseString($content, $config);
$tidy->cleanRepair();

echo $tidy, PHP_EOL;
// <html xmlns="http://www.w3.org/1999/xhtml">
//   <head>
//     <title>
//       test
//     </title>
//   </head>
//   <body>
//     <p>
//       error<br />
//       another line
//     </p>
//   </body>
// </html>

我们定义的 $content 中的这段 HTML 代码是没有任何格式的非常不规范的一段 HTML 代码。通过实例化一个 Tidy 对象之后，使用 parseString() 方法，并执行 cleanRepair() 方法之后，再直接打印 $tidy 对象，我们就获得了格式化之后的 HTML 代码。看起来是不是非常地规范，不管是 xmlns 还是缩进格式都非常标准。

parseString() 方法有两个参数，第一个参数就是需要格式化的字符串。第二个参数是格式化的配置，这个配置接收的是一个数组，同时它内部的内容也必须是 Tidy 组件中所定义的那些配置信息。这些配置信息我们可以在文后的第二条链接中进行查询。这里我们只配置了两个内容， indent 表示是否应用缩进块级，output-xhtml 表示是否输出为 xhtml 。

cleanRepair() 方法用于对已解析的内容执行清除和修复的操作，其实也就是格式化的清理工作。

注意我们在测试代码中是直接打印的 Tidy 对象，也就是说，这个对象实现了 __toString() ，而它真正的样子其实是这样的。

var_dump($tidy);
// object(tidy)#1 (2) {
//     ["errorBuffer"]=>
//     string(112) "line 1 column 1 - Warning: missing <!DOCTYPE> declaration
//   line 1 column 70 - Warning: discarding unexpected </i>"
//     ["value"]=>
//     string(195) "<html xmlns="http://www.w3.org/1999/xhtml">
//     <head>
//       <title>
//         test
//       </title>
//     </head>
//     <body>
//       <p>
//         error<br />
//         another line
//       </p>
//     </body>
//   </html>"
//   }

各种属性信息获取

var_dump($tidy->isXml()); // bool(false)

var_dump($tidy->isXhtml()); // bool(false)

var_dump($tidy->getStatus()); // int(1)

var_dump($tidy->getRelease());  // string(10) "2017/11/25"

var_dump($tidy->getHtmlVer()); // int(500)

我们可以通过 Tidy 对象的属性获取一些关于待处理文档的信息，比如是否是 XML ，是否是 XHTML 内容。

getStatus() 返回的是 Tidy 对象的状态信息，当前这个 1 表示的是有警告或辅助功能错误的信息，从上面打印的 Tidy 对象的内容我们就可以看出，在这个对象的 errorBuffer 属性中是有 warning 报警信息的。

getRelease() 返回的是当前 Tidy 组件的版本信息，也就是你在操作系统上安装的那个 tidy 组件的信息。getHtmlVer() 返回的是检测到的 HTML 版本，这里的 500 没有更多的说明和介绍资料，不知道这个 500 是什么意思。

除了上面的这些内容之后，我们还可以获得前面 $config 中的配置信息及相关的说明。

var_dump($tidy->getOpt('indent')); // int(1)

var_dump($tidy->getOptDoc('output-xhtml'));
// string(489) "This option specifies if Tidy should generate pretty printed output, writing it as extensible HTML. <br/>This option causes Tidy to set the DOCTYPE and default namespace as appropriate to XHTML, and will use the corrected value in output regardless of other sources. <br/>For XHTML, entities can be written as named or numeric entities according to the setting of <code>numeric-entities</code>. <br/>The original case of tags and attributes will be preserved, regardless of other options. "

getOpt() 方法需要一个参数，也就是需要查询的 $config 中配置的信息内容，如果是查看我们没有在 $config 中配置的参数的话，那么返回就都是默认的配置值。getOptDoc() 非常贴心，它返回的是关于某个参数的说明文档。

最后，是更加干货的一些方法，可以直接操作节点。

echo $tidy->head(), PHP_EOL;
// <head>
//   <title>
//   test
// </title>
// </head>

$body = $tidy->body();

var_dump($body);
// object(tidyNode)#2 (9) {
//     ["value"]=>
//     string(60) "<body>
//     <p>
//       error<br />
//       another line
//     </p>
//   </body>"
//     ["name"]=>
//     string(4) "body"
//     ["type"]=>
//     int(5)
//     ["line"]=>
//     int(1)
//     ["column"]=>
//     int(40)
//     ["proprietary"]=>
//     bool(false)
//     ["id"]=>
//     int(16)
//     ["attribute"]=>
//     NULL
//     ["child"]=>
//     array(1) {
//       [0]=>
//       object(tidyNode)#3 (9) {
//         ["value"]=>
//         string(37) "<p>
// ………………
// ………………

echo $tidy->html(), PHP_EOL;
// <html xmlns="http://www.w3.org/1999/xhtml">
//   <head>
//     <title>
//       test
//     </title>
//   </head>
//   <body>
//     <p>
//       error<br />
//       another line
//     </p>
//   </body>
// </html>

echo $tidy->root(), PHP_EOL;
// <html xmlns="http://www.w3.org/1999/xhtml">
//   <head>
//     <title>
//       test
//     </title>
//   </head>
//   <body>
//     <p>
//       error<br />
//       another line
//     </p>
//   </body>
// </html>

相信不需要过多地解释就能够看出，head() 返回的就是 <head> 标签里面的内容，而 body() 、html() 也都是对应的相关标签，root() 返回的则是根结点的全部内容，可以看作是整个文档内容。

这些方法函数返回的内容其实都是一个 TidyNode 对象，这个我们在后面再详细地说明。

直接转换为字符串

上面的操作代码我们都是基于 parseString() 这个方法。它没有返回值，或者说返回的只是一个布尔类型的成功失败标识。如果我们需要获取格式化之后的内容，只能直接将对象当做字符串或者使用 root() 来获得所有的内容。其实，还有一个方法直接就是返回一个格式化后的字符串的。

$tidy = new Tidy();
$repair = $tidy->repairString($content, $config);

echo $repair, PHP_EOL;
// <html xmlns="http://www.w3.org/1999/xhtml">
//   <head>
//     <title>
//       test
//     </title>
//   </head>
//   <body>
//     <p>
//       error<br />
//       another line
//     </p>
//   </body>
// </html>

repairString() 方法的参数和 parseString() 是一模一样的，唯一不同的就是它是返回的一个字符串，而不是在 Tidy 对象内部进行操作。

转换错误信息

在最开始的测试代码中，我们使用 var_dump() 打印 Tidy 对象时就看到了 errorBuffer 这个变量里是有错误信息的。这回我们再来一个有更多问题的 HTML 代码片断。

$html = <<<HTML
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

<p>paragraph</p>
HTML;
$tidy = new Tidy();
$tidy->parseString($html);
$tidy->cleanRepair();

echo $tidy->errorBuffer, PHP_EOL;
// line 4 column 1 - Warning: <p> isn't allowed in <head> elements
// line 4 column 1 - Info: <head> previously mentioned
// line 4 column 1 - Warning: inserting implicit <body>
// line 4 column 1 - Warning: inserting missing 'title' element

$tidy ->diagnose();
echo $tidy->errorBuffer, PHP_EOL;
// line 4 column 1 - Warning: <p> isn't allowed in <head> elements
// line 4 column 1 - Info: <head> previously mentioned
// line 4 column 1 - Warning: inserting implicit <body>
// line 4 column 1 - Warning: inserting missing 'title' element
// Info: Doctype given is "-//W3C//DTD XHTML 1.0 Strict//EN"
// Info: Document content looks like XHTML 1.0 Strict
// Tidy found 3 warnings and 0 errors!

在这段测试代码中，我们又使用了一个新的 diagnose() 方法，它的作用是对文档进行诊断测试，并且在 errorBuffer 这个对象变量中添加有关文档的更多信息。

TidyNode 操作

之前我们说到过，head()、html()、body()、root() 这几个方法返回的都是一个 TidyNode 对象，那么这个对象有什么特殊的地方吗？

$html = <<<EOF
<html><head>
<?php echo '<title>title</title>'; ?>
<#
  /* JSTE code */
  alert('Hello World');
#>
</head>
<body>

<?php
  // PHP code
  echo 'hello world!';
?>

<%
  /* ASP code */
  response.write("Hello World!")
%>

<!-- Comments -->
Hello World
</body></html>
Outside HTML
EOF;

$tidy = new Tidy();
$tidy->parseString($html);

$tidyNode = $tidy->html();

showNodes($tidyNode);

function showNodes($node){

    if($node->isComment()){
        echo '========', PHP_EOL,'This is Comment Node :"', $node->value, '"', PHP_EOL;
    }
    if($node->isText()){
        echo '--------', PHP_EOL,'This is Text Node :"', $node->value, '"', PHP_EOL;
        }
    if($node->isAsp()){
        echo '++++++++', PHP_EOL,'This is Asp Script :"', $node->value, '"', PHP_EOL;
        }
    if($node->isHtml()){
        echo '********', PHP_EOL,'This is HTML Node :"', $node->value, '"', PHP_EOL;
        }
    if($node->isPhp()){
        echo '########', PHP_EOL,'This is PHP Script :"', $node->value, '"', PHP_EOL;
        }
    if($node->isJste()){
        echo '@@@@@@@@', PHP_EOL,'This is JSTE Script :"', $node->value, '"', PHP_EOL;
    }

    if($node->name){
        // getParent()
        if($node->getParent()){
            echo '&&&&&&&& ', $node->name ,' getParent is : ', $node->getParent()->name, PHP_EOL;
        }

        // hasSiblings
        echo '^^^^^^^^ ', $node->name, ' has siblings is : ';
        var_dump($node->hasSiblings());
        echo PHP_EOL;
    }

    if($node->hasChildren()){
        foreach($node->child as $child){
            showNodes($child);
        }
    }
}

// ………………
// ………………
// ********
// This is HTML Node :"<head>
// <?php echo '<title>title</title>'; ><#
//   /* JSTE code */
//   alert('Hello World');
// #>
// <title></title>
// </head>
// "
// &&&&&&&& head getParent is : html
// ^^^^^^^^ head has siblings is : bool(true)
// ………………
// ………………
// ++++++++
// This is Asp Script :"<%
//   /* ASP code */
//   response.write("Hello World!")
// %>" 
// ………………
// ………………

这段代码具体的测试步骤和各个函数的解释就不详细地一一列举说明了。大家通过代码就可以看出来，我们的 TidyNode 对象可以判断各个节点的内容，比如是否还有子结点、是否有兄弟结点。对象结点内容，可以判断结点的格式，是否是注释、是否是文本、是否是 JS 代码、是否是 PHP 代码、是否是 ASP 代码之类的内容。不知道看到这里的你是什么感觉，反正我是觉得这个玩意就非常有意思了，特别是判断 PHP 代码这些的方法。

信息统计函数

最后我们再来看一下 Tidy 扩展库中的一些统计函数。

$html = <<<EOF
<p>test</i>
<bogustag>bogus</bogustag>
EOF;
$config = array('accessibility-check' => 3,'doctype'=>'bogus');
$tidy = new Tidy();
$tidy->parseString($html, $config);

echo 'tidy access count: ', tidy_access_count($tidy), PHP_EOL;
echo 'tidy config count: ', tidy_config_count($tidy), PHP_EOL;
echo 'tidy error count: ', tidy_error_count($tidy), PHP_EOL;
echo 'tidy warning count: ', tidy_warning_count($tidy), PHP_EOL;

// tidy access count: 4
// tidy config count: 2
// tidy error count: 1
// tidy warning count: 6

其实它们返回的这些数量都是一些错误信息的数量。tidy_access_count() 表示的是遇到的辅助功能警告数量，tidy_config_count() 是配置信息错误的数量，另外两个从名字就看出来了，也就不用我多说了。