Python: lxml xpath to extract content -


below code able extract pe reuters link below. however, method not robust webpage stock has 2 lines lesser , result shift of data. how can encounter problem. point straight part of pe extract data not know how it. link 1: http://www.reuters.com/finance/stocks/financialhighlights?symbol=myeg.kl link 2: http://www.reuters.com/finance/stocks/financialhighlights?symbol=annj.kl

from lxml import html import lxml  page2 = requests.get('http://www.reuters.com/finance/stocks/financialhighlights?symbol=myeg.kl') treea = html.fromstring(page2.content) tree4 = treea.xpath('//td[@class]/text()') pe= tree4[37] 

this part wish code can extract part changes of webpage not affected.

 <tr class="stripe">                 <td>p/e ratio (ttm)</td>                 <td class="data">36.79</td>                 <td class="data">25.99</td>                 <td class="data">21.70</td>             </tr> 

use text find first td extract sibling td's:

 treea.xpath('//td[contains(.,"p/e ratio")]/following-sibling::td/text()') 

that work regardless:

in [8]: page2 = requests.get('http://www.reuters.com/finance/stocks/financialhighlights?symbol=myeg.kl')  in [9]: treea = html.fromstring(page2.content)     in [10]: tree4 = treea.xpath('//td[contains(.,"p/e ratio")]/following-sibling::td/text()')  in [11]: print(tree4) ['36.79', '25.99', '21.41']  in [12]: page2 = requests.get('http://www.reuters.com/finance/stocks/financialhighlights?symbol=annj.kl') in [13]: treea = html.fromstring(page2.content)  in [14]: tree4 = treea.xpath('//td[contains(.,"p/e ratio")]/following-sibling::td/text()')  in [15]: print(tree4) ['--', '25.49', '17.30'] 

Comments

Popular posts from this blog

java - Jasper subreport showing only one entry from the JSON data source when embedded in the Title band -

serialization - Convert Any type in scala to Array[Byte] and back -

SonarQube Plugin for Jenkins does not find SonarQube Scanner executable -