python 网络爬虫

作者：城若幻影发布日期:2026-03-11 浏览:204

import requests
from bs4 import BeautifulSoup

# 定义一个简单的网络爬虫函数，用于获取网页内容
def simple_web_crawler(url):
    try:
        # 发送HTTP请求获取网页内容
        response = requests.get(url)
        response.raise_for_status()  # 检查请求是否成功

        # 使用BeautifulSoup解析HTML内容
        soup = BeautifulSoup(response.text, 'html.parser')

        # 打印网页标题
        print("网页标题:", soup.title.string)

        # 返回网页的文本内容
        return soup.get_text()

    except requests.exceptions.RequestException as e:
        print("请求出错:", e)
        return None

# 示例URL
url = 'https://example.com'

# 调用爬虫函数并打印结果
result = simple_web_crawler(url)
if result:
    print("网页内容:\n", result[:500])  # 只打印前500个字符

解释说明：

导入库：
- requests：用于发送HTTP请求。
- BeautifulSoup：用于解析HTML内容。
定义函数 simple_web_crawler：
- 接受一个URL作为参数。
- 使用requests.get()发送HTTP GET请求获取网页内容。
- 使用BeautifulSoup解析HTML，并提取网页标题和文本内容。
- 处理可能的异常，如网络错误或请求失败。
示例调用：
- 设置一个示例URL（https://example.com）。
- 调用爬虫函数并打印结果，只显示前500个字符的内容。

这个示例展示了如何使用Python编写一个简单的网络爬虫来抓取网页内容。

上一篇：python 判断为空

下一篇：python map()