爬虫-BeautifulSoup

发表于 2019-12-22 更新于 2025-06-30 分类于 Python Python教程爬虫本文字数： 13k 阅读时长 ≈ 24 分钟

bs4 概要

Beautiful Soup 是一个可以轻松从网页中抓取信息的库。它位于 HTML 或 XML 解析器之上，提供用于迭代、搜索和修改解析树的 Python 式语法。

安装指南

安装 Beautiful Soup

1	pip install beautifulsoup4

安装解析器

Beautiful Soup 支持 Python 标准库中的 HTML 解析器，还支持一些第三方的解析器，其中一个是 lxml parser 。

1	pip install lxml

另一个可供选择的解析器是纯 Python 实现的 html5lib , html5lib 的解析方式与浏览器相同。

1	pip install html5lib

下表描述了几种解析器的优缺点:

解析器	使用方法	优势	劣势
Python 标准库	`BeautifulSoup(markup, "html.parser")`	Python的内置标准库；执行速度较快；容错能力强	速度没有 lxml 快；容错没有 html5lib强
lxml HTML 解析器	`BeautifulSoup(markup, "lxml")`	速度快；容错能力强	额外的 C 依赖
lxml XML 解析器	`BeautifulSoup(markup, ["lxml-xml"])` or `BeautifulSoup(markup, "xml")`	速度快；唯一支持 XML 的解析器	额外的 C 依赖
html5lib	`BeautifulSoup(markup, "html5lib")`	最好的容错性；以浏览器的方式解析文档；生成 HTML5 格式的文档	速度慢；额外的 Python 依赖

如果可以，推荐使用 lxml 来获得更高的速度。

注意，如果一段文档格式不标准，那么在不同解析器生成的 Beautiful Soup 数可能不一样。查看解析器之间的区别了解更多细节。

基本使用

基本用法

from bs4 import BeautifulSoup
import requests

# 从网页获取内容
url = "http://example.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

# 从字符串创建
html_doc = "<html><head><title>示例</title></head><body><p>内容</p></body></html>"
soup = BeautifulSoup(html_doc, "html.parser")

# 从文件创建
with open("index.html") as fp:
    soup = BeautifulSoup(fp, "html.parser")

搜索文档树

`find()` 和 `find_all()`

# 查找所有<a>标签
soup.find_all('a')

# 查找id为"link2"的标签
soup.find(id="link2")

# 查找class为"sister"的所有标签
soup.find_all(class_="sister")

# 使用正则表达式查找
import re
soup.find_all(re.compile("^b"))  # 查找所有以b开头的标签

CSS选择器

标签选择器

# 选择所有<p>标签
soup.select('p')

# 选择所有<div>和<p>标签
soup.select('div, p')

类选择器

# 选择class为"content"的所有元素
soup.select('.content')

# 选择同时具有"main"和"article"两个class的元素
soup.select('.main.article')

# 选择<p>标签且class为"highlight"的元素
soup.select('p.highlight')

ID选择器

# 选择id为"header"的元素
soup.select('#header')

# 选择<div>标签且id为"sidebar"的元素
soup.select('div#sidebar')

属性选择器

# 选择具有href属性的所有<a>标签
soup.select('a[href]')

# 选择href属性值为"http://example.com"的<a>标签
soup.select('a[href="http://example.com"]')

# 选择href属性包含"example"的<a>标签
soup.select('a[href*="example"]')

# 选择href属性以"https"开头的<a>标签
soup.select('a[href^="https"]')

# 选择href属性以".com"结尾的<a>标签
soup.select('a[href$=".com"]')

组合选择器

# 选择<div>内部的所有<p>标签
soup.select('div p')

# 选择id为"main"元素内部的所有class为"article"的元素
soup.select('#main .article')

# 选择<ul>的直接子<li>元素
soup.select('ul > li')

# 选择id为"nav"的元素的直接子<a>标签
soup.select('#nav > a')

# 选择紧跟在<h1>后面的<p>元素
soup.select('h1 + p')

# 选择<h2>之后的所有<p>兄弟元素
soup.select('h2 ~ p')

伪类选择器

# 选择每个父元素下的第一个<p>标签
soup.select('p:first-of-type')

# 选择每个父元素下的最后一个<li>标签
soup.select('li:last-of-type')

# 选择每个父元素下的第二个<p>标签
soup.select('p:nth-of-type(2)')

# 选择奇数位置的<li>标签
soup.select('li:nth-of-type(odd)')

# 选择偶数位置的<tr>标签
soup.select('tr:nth-of-type(even)')

获取标签属性

字典方式获取属性

# 获取单个属性值，属性不存在时会抛出 KeyError
tag['attribute_name']

# 示例：获取<a>标签的href属性
link = soup.find('a')
href = link['href']  # 获取href属性
print(href)

`get()` 方法获取属性

# 更安全的获取方式，可设置默认值
tag.get('attribute_name', default=None)

# 示例：安全获取图片的src属性
img = soup.find('img')
src = img.get('src', 'default_image.jpg')  # 如果src不存在则返回默认值
print(src)

获取所有属性

# 获取标签的所有属性字典
tag.attrs

# 示例：获取<div>的所有属性
div = soup.find('div')
attributes = div.attrs  # 返回字典，如 {'class': ['container'], 'id': 'main'}
print(attributes)

检查属性是否存在

# 检查标签是否具有某属性
'attribute_name' in tag.attrs

# 示例：检查是否有class属性
has_class = 'class' in tag.attrs
print(has_class)

获取标签内容

获取文本内容

string 属性

# 获取单个字符串内容（仅当标签内只有单个字符串时有效）
tag.string

# 示例：
title = soup.find('title').string
print(title)

get_text() / text 方法

# 获取标签及其所有子标签的文本内容（常用）
tag.get_text(separator=' ', strip=False)
tag.text  # get_text()的快捷方式

# 示例：
paragraph = soup.find('p')
print(paragraph.get_text())  # 获取所有文本，包括子标签的
print(paragraph.get_text('|', strip=True))  # 用|分隔文本并去除空白

获取多个文本节点

# 获取所有子节点的字符串内容（包括空白）
tag.strings  # 生成器，返回所有字符串
tag.stripped_strings  # 生成器，返回去除空白的字符串

# 示例：
div = soup.find('div')
for string in div.stripped_strings:
    print(string)

保留HTML结构的获取

# 获取包含HTML标签的内容
str(tag)  # 包含标签本身
tag.decode_contents()  # 只包含子内容

# 示例：
div = soup.find('div')
print(str(div))  # 输出整个<div>...</div>
print(div.decode_contents())  # 只输出<div>内部的内容

修改文档结构

`replace_with()` - 替换节点

功能：用新内容完全替换当前标签/节点
特点：

原标签及其所有内容都会被移除
可以替换为字符串、标签或其他可解析内容
返回被替换的内容

示例：

from bs4 import BeautifulSoup

html = "<p>原始段落</p>"
soup = BeautifulSoup(html, 'html.parser')
p = soup.p

# 用新标签替换
new_div = soup.new_tag("div")
new_div.string = "新内容"
p.replace_with(new_div)
print(soup)  # 输出: <div>新内容</div>

# 用字符串替换
p.replace_with("纯文本内容")

`unwrap()` - 移除父标签

功能：移除当前标签但保留其内容
特点：

相当于将当前标签"解包"
保留所有子节点在原位置
返回被移除的标签

示例：

html = "<div><p><b>加粗文本</b></p></div>"
soup = BeautifulSoup(html, 'html.parser')
p_tag = soup.p

# unwrap() 操作
removed_tag = p_tag.unwrap()

print("当前文档：")
print(soup)  # 输出: <div><b>加粗文本</b></div>

print("\n返回值类型：", type(removed_tag))  # 输出: <class 'bs4.element.Tag'>
print("返回值内容：", removed_tag)        # 输出: <p></p> (空标签)

`decompose()` - 彻底删除节点

功能：从文档树中完全删除当前标签及其所有内容
特点：

彻底销毁节点，无法恢复
不返回任何内容
与extract()不同，后者会返回被移除的内容

示例：

html = "<div><p>保留这段</p><p>删除这段</p></div>"
soup = BeautifulSoup(html, 'html.parser')
to_remove = soup.find_all('p')[1]

to_remove.decompose()
print(soup)  # 输出: <div><p>保留这段</p></div>

clear() - 清空内容

功能：移除当前标签的所有子节点和内容
特点：
- 保留标签本身
- 清空所有子节点和文本
- 常用于"重置"标签内容
示例：
1
2
3
4
5
6
html = "<div><p>内容1</p><p>内容2</p></div>"
soup = BeautifulSoup(html, 'html.parser')
div = soup.div

div.clear()
print(soup) # 输出: <div></div>

`insert()` - 在指定位置插入

功能：在子节点列表的指定位置插入内容
参数：

position：插入位置（基于0的索引）
new_content：要插入的内容
特点：
可以插入字符串或标签
不会替换现有内容，只是插入

示例：

html = "<ul><li>第一项</li><li>第三项</li></ul>"
soup = BeautifulSoup(html, 'html.parser')
ul = soup.ul

# 在位置1插入新项目
new_li = soup.new_tag("li")
new_li.string = "第二项"
ul.insert(1, new_li)
print(soup)
# 输出: <ul><li>第一项</li><li>第二项</li><li>第三项</li></ul>

`append()` - 末尾添加内容

功能：在当前标签的末尾添加新内容
特点：

可以添加字符串或标签
与Python列表的append类似
会修改原文档结构

示例：

html = "<div>开始</div>"
soup = BeautifulSoup(html, 'html.parser')
div = soup.div

# 添加字符串
div.append("中间")

# 添加新标签
new_span = soup.new_tag("span")
new_span.string = "结束"
div.append(new_span)

print(soup)  # 输出: <div>开始中间<span>结束</span></div>

`extract()` - 提取节点

功能：将指定的标签或字符串从文档树中完全移除，并返回被移除的内容
特点：

完全移除：节点及其所有内容从原位置删除
可复用：返回的节点保持完整，可插入到其他位置
破坏性操作：会改变原始文档结构
与unwrap()对比：
- unwrap() 只移除标签外壳，内容保留在原位
- extract() 移除整个节点（包括内容和标签）

示例：

from bs4 import BeautifulSoup

html = "<div><p>段落1</p><p id='target'>段落2</p></div>"
soup = BeautifulSoup(html, "html.parser")

# 提取特定段落
target = soup.find(id="target").extract()

print("剩余文档:")
print(soup)  # 输出: <div><p>段落1</p></div>

print("\n提取的内容:")
print(target)  # 输出: <p id="target">段落2</p>

# 创建新容器并插入到文档开头
new_section = soup.new_tag("section")
new_section.append(target)

# 直接插入到soup对象的最前面
soup.insert(0, new_section)
print("\n插入后的内容:")
print(soup)  # <section><p id="target">段落2</p></section><div><p>段落1</p></div>

`wrap()` - 包裹节点

功能： 在当前标签或字符串外包裹一个新的父标签
特点：

非破坏性：保留原有内容不变
返回新标签：返回的是新创建的包裹标签
可链式调用：支持连续包裹多层
与unwrap()相反：添加结构而不是移除结构

示例：

html = "<p>需要强调的内容</p>"
soup = BeautifulSoup(html, "html.parser")
p = soup.p

# 基础包裹
container = soup.new_tag("div")
container["class"] = "container"  # 正确设置 class
p.wrap(container)

# 链式包裹（创建多层结构）
p.wrap(soup.new_tag("article")).wrap(soup.new_tag("section"))

print(soup.prettify())
"""
输出:
<div class="container">
 <section>
  <article>
   <p>
    需要强调的内容
   </p>
  </article>
 </section>
</div>
"""

bs4实例

示例1：爬取三国演义

import urllib.request
import urllib.parse
from bs4 import BeautifulSoup
import time


def handle_request(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'
    }
    request = urllib.request.Request(url, headers=headers)
    return request


# 下载章节内容函数
def download_text(href):
    # 构建请求对象
    request = handle_request(href)
    # 获取网页内容
    content = urllib.request.urlopen(request).read().decode()
    # print(content)
    # exit()
    soup = BeautifulSoup(content, 'lxml')
    # 获取内容
    odiv = soup.find('div', class_="chapter_content")
    return odiv.text


def parse_content(content):
    # 生成一个对象
    soup = BeautifulSoup(content, 'lxml')
    # <div class="book-mulu">(.*?)</div>
    # <li><a href="(/book/sanguoyanyi/\d+\.html)">(.*?)</a></li>
    a_list = soup.select('.book-mulu > ul > li > a')
    # print(a_list)
    # print(len(a_list))
    # 遍历这个列表，获取title和href，然后获取内容
    for element in a_list:
        # 根据对象获取内容
        title = element.text
        # 根据对象获取属性
        href = 'http://www.shicimingju.com' + element['href']
        # 根据href获取内容的函数
        print('开始下载---%s' % title)
        text = download_text(href)
        print('结束下载---%s' % title)
        string = title + '\n' + text + '\n'
        # 将标题和内容写入到文件中
        with open('三国演义.txt', 'a', encoding='utf8') as fp:
            fp.write(string)
        time.sleep(2)


def main():
    url = 'http://www.shicimingju.com/book/sanguoyanyi.html'
    # 构建请求对象
    request = handle_request(url)
    # 发送请求对象，获取响应
    content = urllib.request.urlopen(request).read().decode()
    # 解析内容
    parse_content(content)


if __name__ == '__main__':
    main()

示例2：爬取智联

import urllib.request
import urllib.parse
from bs4 import BeautifulSoup
import json
import time


# http://sou.zhaopin.com/jobs/searchresult.ashx

class ZhiLianSpider(object):
    # url中不变的内容，要和参数进行拼接组成完整的url
    url = 'http://sou.zhaopin.com/jobs/searchresult.ashx?'

    def __init__(self, jl, kw, start_page, end_page):
        # 将上面的参数都保存为自己的成员属性
        self.jl = jl
        self.kw = kw
        self.start_page = start_page
        self.end_page = end_page
        # 定义一个空列表，用来存放所有的工作信息
        self.items = []

    # 根据page拼接指定的url，然后生成请求对象
    def handle_request(self, page):
        data = {
            'jl': self.jl,
            'kw': self.kw,
            'p': page
        }
        url_now = self.url + urllib.parse.urlencode(data)
        # 构建请求对象
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'
        }
        request = urllib.request.Request(url=url_now, headers=headers)
        return request

    # 解析内容函数
    def parse_content(self, content):
        # 生成对象
        soup = BeautifulSoup(content, 'lxml')
        # 思路：先找到所有的table，因为一个工作岗位就是一个table，遍历这个table的列表，然后通过table对象的select、find方法去找每一条记录的具体信息
        table_list = soup.select('#newlist_list_content_table > table')[1:]
        # 遍历这个table_list,依次获取每一个数据
        for table in table_list:
            # 获取职位名称
            zwmc = table.select('.zwmc > div > a')[0].text
            # print(zwmc)
            # 获取公司名称
            gsmc = table.select('.gsmc > a')[0].text
            # 获取职位月薪
            zwyx = table.select('.zwyx')[0].text
            # 获取工作地点
            gzdd = table.select('.gzdd')[0].text
            # 获取发布时间
            gxsj = table.select('.gxsj > span')[0].text
            # 存放到字典中
            item = {
                '职位名称': zwmc,
                '公司名称': gsmc,
                '职位月薪': zwyx,
                '工作地点': gzdd,
                '更新时间': gxsj,
            }
            # 再存放到列表中
            self.items.append(item)

    # 爬取程序
    def run(self):
        # 搞个循环，循环爬取每一页数据
        for page in range(self.start_page, self.end_page + 1):
            print('开始爬取第%s页' % page)
            request = self.handle_request(page)
            # 发送请求，获取内容
            content = urllib.request.urlopen(request).read().decode()
            # 解析内容
            self.parse_content(content)
            print('结束爬取第%s页' % page)
            time.sleep(2)

        # 将列表数据保存到文件中
        string = json.dumps(self.items, ensure_ascii=False)
        with open('zhilian.txt', 'w', encoding='utf8') as fp:
            fp.write(string)


def main():
    jl = input('请输入工作地点:')
    kw = input('请输入工作关键字:')
    start_page = int(input('请输入起始页码:'))
    end_page = int(input('请输入结束页码:'))

    # 创建对象，启动爬取程序
    spider = ZhiLianSpider(jl, kw, start_page, end_page)
    spider.run()


if __name__ == '__main__':
    main()

bs4 概要

安装指南

安装 Beautiful Soup

安装解析器

基本使用

基本用法

搜索文档树

find() 和 find_all()

CSS选择器

标签选择器

类选择器

ID选择器

属性选择器

组合选择器

伪类选择器

获取标签属性

字典方式获取属性

get() 方法获取属性

获取所有属性

检查属性是否存在

获取标签内容

获取文本内容

获取多个文本节点

保留HTML结构的获取

修改文档结构

replace_with() - 替换节点

unwrap() - 移除父标签

decompose() - 彻底删除节点

clear() - 清空内容

insert() - 在指定位置插入

append() - 末尾添加内容

extract() - 提取节点

wrap() - 包裹节点

bs4实例

示例1：爬取三国演义

示例2：爬取智联

`find()` 和 `find_all()`

`get()` 方法获取属性

`replace_with()` - 替换节点

`unwrap()` - 移除父标签

`decompose()` - 彻底删除节点

`clear()` - 清空内容

`insert()` - 在指定位置插入

`append()` - 末尾添加内容

`extract()` - 提取节点

`wrap()` - 包裹节点