xpathとPython

スクレイピングしてみましょう。まずは定番の urllibとlxmlで。

こーど†

#!/usr/bin/env python3
# coding: UTF-8

from urllib import request
from lxml import html

URL = "http://ellidanus-b.ddo.jp/illyasviel/"
html2 = request.urlopen(URL).read() 
print(type(html2))
data3 = html.fromstring(html2) # 決してstr(html2)してはいけない。化ける

ttls = data3.xpath("//a")
for x in ttls:
	if type(x.text) is str and len(x.text) > 3:
		#print(type(x))
		xname = x.text
		xlink = x.attrib['href']
		print(xname,"\t",xlink)

↑

実行結果†

<class 'bytes'>
FrontPage        http://ellidanus-b.ddo.jp/illyasviel/? plugin=related&page=FrontPage 
バックアップ     http://ellidanus-b.ddo.jp/illyasviel/?cmd=backup&page=FrontPage
リロード         http://ellidanus-b.ddo.jp/illyasviel/
単語検索         http://ellidanus-b.ddo.jp/illyasviel/?cmd=search
最終更新         http://ellidanus-b.ddo.jp/illyasviel/?RecentChanges
ログイン         http://ellidanus-b.ddo.jp/illyasviel/?plugin=loginform&pcmd=login&page=FrontPage
→戦国メモ       http://ellidanus-b.ddo.jp/illyasviel/quiz01/
→バイク専科     http://ellidanus-b.ddo.jp/motorcycles/
今さらPython     http://ellidanus-b.ddo.jp/illyasviel/?%E4%BB%8A%E3%81%95%E3%82%89Python
:
:

コメントしてる行に気をつけて。実は以下のようなミスをしてしまったの。

data3 = html.fromstring(str(html2))

こうなると変な文字化けみたいになって。
気づかなくて大ハマリしてしまったわ。

↑

ファイルから読み込み†

なお、ネットからでなくローカルファイルから読み込むにはこうする。

from lxml import html

fn = "0.html"
with open(fn, mode='r') as f:
	html2 = f.read().encode('utf-8')

print(type(html2))
data3 = html.fromstring(html2) # str()してはいけない。化ける

ttls = data3.xpath("//dl//dd//a")
:
:

fromstringに食わせるのはbyteである必要があるので注意。

それから、div id=みたいなのでひっかける場合の書き方は以下ね。

for d in data3.xpath("//div[@id=\"novel_honbun\"]/p"):
   :
   :

Ruby+Nokogiriでも同じように書いたから、あっちやった人なら「ああ」とわかるわよね？

こーど†

実行結果†

ファイルから読み込み†

最新の20件