使用PHP和XML实现网页爬虫

作者：╭ァ你不懂的悲殇发布日期:2023-10-18 浏览:969

下面是一个使用PHP和XML实现网页爬虫的简单示例：

<?php
// 创建一个新的XML文档
$xml = new DOMDocument();
$xml->formatOutput = true;

// 创建根元素
$root = $xml->createElement("urls");
$xml->appendChild($root);

// 网页爬虫函数
function crawl($url) {
    global $xml, $root;

    // 创建一个新的URL元素
    $urlElement = $xml->createElement("url");
    $root->appendChild($urlElement);

    // 添加URL属性
    $urlAttribute = $xml->createAttribute("link");
    $urlAttribute->value = $url;
    $urlElement->appendChild($urlAttribute);

    // 获取网页内容
    $content = file_get_contents($url);

    // 使用正则表达式匹配所有的链接
    preg_match_all('/<a\s+href=["\'](.*?)["\'].*?>/i', $content, $matches);

    // 遍历所有匹配的链接
    foreach ($matches[1] as $link) {
        // 如果链接以http或https开头，则继续爬取
        if (strpos($link, "http") === 0 || strpos($link, "https") === 0) {
            crawl($link);
        }
    }
}

// 调用网页爬虫函数
crawl("https://example.com");

// 保存XML文档
$xml->save("urls.xml");
?>

这个示例代码会从指定的URL开始爬取网页，然后使用正则表达式匹配所有的链接，并继续爬取这些链接。最终，爬取到的URL将保存在一个XML文件中。

上一篇：PHP和XML：如何实现国际化和多语言支持

下一篇：PHP表单验证：密码强度校验与规则设置