【Python】BeautifulSoup使用


title: 【Python】BeautifulSoup基本使用

type: categories

date: 2017-02-24 14:26:55

categories: Python


tags:


BeautifulSoup是Python中用來解析HTML、XML等文檔的強大工具。


Beautiful Soup將復雜HTML文檔轉換成一個復雜的樹形結構,每個節點都是Python對象,所有對象可以歸納為4種: Tag , NavigableString , BeautifulSoup , Comment。


html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>這個是b標簽</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc)

1、Tag標簽對象


tag = soup.p
print(type(tag)) # <class 'bs4.element.Tag'>

標簽的屬性可以被添加,刪除或修改,操作如字典


tag['class'] = "very"   # 修改屬性值
tag['id'] = 1 # 添加id屬性
print(tag) # <p class="verybold" id="1"><b>你好</b></p>

del tag['id'] # 刪除id屬性
print(tag) # <p class="very"><b>你好</b></p>

多值屬性 返回list類型數據


css_soup = BeautifulSoup('<p class="body strikeout"></p>')
print(css_soup.p['class']) # ['body', 'strikeout']

非多值屬性 返回string


n_css_soup = BeautifulSoup('<p id="body name"></p>')
print(n_css_soup.p['id']) # body name2、NavigableString:標簽中的字符串

2、NavigableString標簽中的字符串


print(type(tag.string))         # <class 'bs4.element.NavigableString'>
print(tag.string) # 這個是b標簽
tag.string.replace_with('you are beautiful')
print(tag.string) # you are beautiful

3、BeautifulSoup對象


print(type(soup))           # <class 'bs4.BeautifulSoup'>
print(soup.name) # [document]

4、Comment註釋及特殊字符串


Comment 對象是一個特殊類型的 NavigableString 對象


markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
mark_soup = BeautifulSoup(markup)
print(type(mark_soup.b.string)) # <class 'bs4.element.Comment'>
print(mark_soup.b.string) # Hey, buddy. Want to buy a used parser?
print(mark_soup.b.prettify()) # comment 也可以prettify()格式化輸出
#<b>
# <!--Hey, buddy. Want to buy a used parser?-->
#</b>

contents


子節點以list形式輸出


print(soup.head)                # <head><title>The Dormouse's story</title></head>
print(soup.head.contents) # [<title>The Dormouse's story</title>]
print(soup.head.contents[0]) # <title>The Dormouse's story</title>

children


子節點迭代器


print(type(soup.head.children))     # <class 'list_iterator'>
for child in soup.head.children:
print(child)

descendants


子孫節點生成器 遞歸循環的列出子節點和子孫節點


print(type(soup.head.descendants))  # <class 'generator'>
for child in soup.head.descendants:
print(child)
# <title>The Dormouse's story</title> 子節點
# The Dormouse's story 孫節點

strings


標簽中的字符串


for str in soup.strings:
print(repr(str))
# '\n'
# "The Dormouse's story"
# '\n'
# '\n'
# 'you are beautiful'
# '\n'
# 'Once upon a time there were three little sisters; and their names were\n'
# 'Elsie'
# ',\n'
# 'Lacie'
# ' and\n'
# 'Tillie'
# ';\nand they lived at the bottom of a well.'
# '\n'
# '...'
# '\n

stripped_strings


去除空格空行


for str in soup.stripped_strings:
print(repr(str))
# "The Dormouse's story"
# 'you are beautiful'
# 'Once upon a time there were three little sisters; and their names were'
# 'Elsie'
# ','
# 'Lacie'
# 'and'
# 'Tillie'
# ';\nand they lived at the bottom of a well.'
# '...'

http://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/#id7

0 個評論

要回覆文章請先登錄註冊