Python爬虫:Selenium获取页面元素属性

来自CloudWiki
跳转至: 导航搜索

既然我们有很多方式来定位页面的元素,那么接下来就可以考虑获取以下元素的属性了,尤其是用Selenium进行网络爬虫的时候。

get_attribute获取属性

以百度首页的logo为例,获取logo相关属性

<img hidefocus="true" id="s_lg_img" class="index-logo-src" src="//www.baidu.com/img/PCtm_d9c8750bed0b3c7d089fa7d55720d6cf.png" width="270" height="129" onerror="this.src='//www.baidu.com/img/flexible/logo/pc/index.png';this.onerror=null;" usemap="#mp">

获取logo的图片地址:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time

browser = webdriver.Chrome("C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe")

browser.get(r'https://www.baidu.com')  
time.sleep(2)

logo = browser.find_element_by_class_name('index-logo-src')
print(logo)
print(logo.get_attribute('src'))

time.sleep(2)

# 关闭浏览器
#browser.close()

输出:

<selenium.webdriver.remote.webelement.WebElement (session="e95b18c43a330745af019e0041f0a8a4", element="7dad5fc0-610b-45b6-b543-9e725ee6cc5d")>
https://www.baidu.com/img/PCtm_d9c8750bed0b3c7d089fa7d55720d6cf.png

获取文本和链接

以热榜为例,获取热榜文本和链接

<a class="title-content tag-width c-link c-font-medium c-line-clamp1" href="https://www.baidu.com/s?cl=3&tn=baidutop10&fr=top1000&wd=%E5%90%84%E5%9C%B0%E8%B4%AF%E5%BD%BB%E5%8D%81%E4%B9%9D%E5%B1%8A%E5%85%AD%E4%B8%AD%E5%85%A8%E4%BC%9A%E7%B2%BE%E7%A5%9E%E7%BA%AA%E5%AE%9E&rsv_idx=2&rsv_dl=fyb_n_homepage&sa=fyb_n_homepage&hisfilter=1" target="_blank"><span class="title-content-index c-index-single c-index-single-hot1">1</span><span class="title-content-title">各地贯彻十九届六中全会精神纪实</span></a>

获取热榜的文本,用的是text属性,直接调用即可

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time

browser = webdriver.Chrome("C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe")


browser.get(r'https://www.baidu.com')  
time.sleep(2)

logo = browser.find_element_by_xpath('//*[@id="hotsearch-content-wrapper"]/li[1]/a')
#logo = browser.find_element_by_css_selector('#hotsearch-content-wrapper > li:nth-child(1) > a')

print(logo.text)
print(logo.get_attribute('href'))

time.sleep(2)

# 关闭浏览器
#browser.close()

输出:

1各地贯彻十九届六中全会精神纪实
https://www.baidu.com/s?cl=3&tn=baidutop10&fr=top1000&wd=%E5%90%84%E5%9C%B0%E8%B4%AF%E5%BD%BB%E5%8D%81%E4%B9%9D%E5%B1%8A%E5%85%AD%E4%B8%AD%E5%85%A8%E4%BC%9A%E7%B2%BE%E7%A5%9E%E7%BA%AA%E5%AE%9E&rsv_idx=2&rsv_dl=fyb_n_homepage&sa=fyb_n_homepage&hisfilter=1

获取其他属性

除了属性和文本值外,还有id、位置、标签名和大小等属性。

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time

browser = webdriver.Chrome("C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe")


browser.get(r'https://www.baidu.com')  
time.sleep(2)

logo = browser.find_element_by_class_name('index-logo-src')
print(logo.id)
print(logo.location)
print(logo.tag_name)
print(logo.size)

# 关闭浏览器
#browser.close()

输出:

6af39c9b-70e8-4033-8a74-7201ae09d540
{'x': 490, 'y': 46}
img
{'height': 129, 'width': 270}