Python爬虫:Selenium获取页面元素属性
来自CloudWiki
既然我们有很多方式来定位页面的元素,那么接下来就可以考虑获取以下元素的属性了,尤其是用Selenium进行网络爬虫的时候。
get_attribute获取属性
以百度首页的logo为例,获取logo相关属性
<img hidefocus="true" id="s_lg_img" class="index-logo-src" src="//www.baidu.com/img/PCtm_d9c8750bed0b3c7d089fa7d55720d6cf.png" width="270" height="129" onerror="this.src='//www.baidu.com/img/flexible/logo/pc/index.png';this.onerror=null;" usemap="#mp">
获取logo的图片地址:
from selenium import webdriver from selenium.webdriver.common.keys import Keys import time browser = webdriver.Chrome("C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe") browser.get(r'https://www.baidu.com') time.sleep(2) logo = browser.find_element_by_class_name('index-logo-src') print(logo) print(logo.get_attribute('src')) time.sleep(2) # 关闭浏览器 #browser.close()
输出:
<selenium.webdriver.remote.webelement.WebElement (session="e95b18c43a330745af019e0041f0a8a4", element="7dad5fc0-610b-45b6-b543-9e725ee6cc5d")> https://www.baidu.com/img/PCtm_d9c8750bed0b3c7d089fa7d55720d6cf.png
获取文本和链接
以热榜为例,获取热榜文本和链接
<a class="title-content tag-width c-link c-font-medium c-line-clamp1" href="https://www.baidu.com/s?cl=3&tn=baidutop10&fr=top1000&wd=%E5%90%84%E5%9C%B0%E8%B4%AF%E5%BD%BB%E5%8D%81%E4%B9%9D%E5%B1%8A%E5%85%AD%E4%B8%AD%E5%85%A8%E4%BC%9A%E7%B2%BE%E7%A5%9E%E7%BA%AA%E5%AE%9E&rsv_idx=2&rsv_dl=fyb_n_homepage&sa=fyb_n_homepage&hisfilter=1" target="_blank"><span class="title-content-index c-index-single c-index-single-hot1">1</span><span class="title-content-title">各地贯彻十九届六中全会精神纪实</span></a>
获取热榜的文本,用的是text属性,直接调用即可
from selenium import webdriver from selenium.webdriver.common.keys import Keys import time browser = webdriver.Chrome("C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe") browser.get(r'https://www.baidu.com') time.sleep(2) logo = browser.find_element_by_xpath('//*[@id="hotsearch-content-wrapper"]/li[1]/a') #logo = browser.find_element_by_css_selector('#hotsearch-content-wrapper > li:nth-child(1) > a') print(logo.text) print(logo.get_attribute('href')) time.sleep(2) # 关闭浏览器 #browser.close()
输出:
1各地贯彻十九届六中全会精神纪实 https://www.baidu.com/s?cl=3&tn=baidutop10&fr=top1000&wd=%E5%90%84%E5%9C%B0%E8%B4%AF%E5%BD%BB%E5%8D%81%E4%B9%9D%E5%B1%8A%E5%85%AD%E4%B8%AD%E5%85%A8%E4%BC%9A%E7%B2%BE%E7%A5%9E%E7%BA%AA%E5%AE%9E&rsv_idx=2&rsv_dl=fyb_n_homepage&sa=fyb_n_homepage&hisfilter=1
获取其他属性
除了属性和文本值外,还有id、位置、标签名和大小等属性。
from selenium import webdriver from selenium.webdriver.common.keys import Keys import time browser = webdriver.Chrome("C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe") browser.get(r'https://www.baidu.com') time.sleep(2) logo = browser.find_element_by_class_name('index-logo-src') print(logo.id) print(logo.location) print(logo.tag_name) print(logo.size) # 关闭浏览器 #browser.close()
输出:
6af39c9b-70e8-4033-8a74-7201ae09d540 {'x': 490, 'y': 46} img {'height': 129, 'width': 270}