スクレイピングのメモ

2024-11-30#Tech

はじめに

スクレイピングとは、ウェブサイトのデータをスクリプト等を利用して自動的に収集する技術やプロセスを指します。

下記のような用途で利用されることがあります。

ショッピングサイト等で商品情報や価格の収集
SNSやブログからコメントデータ等の収集

スクレイピングは自動的にデータを取得しに行くため、サーバへ負荷をかけてしまいます。
そのため、サイトごとのルールに従うようにしてください。

静的サイト

サーバから完全なHTMLが返却されるサイトの場合は、こちらの方法でデータ収集をするのが一般的です。

今回はPythonで下記のライブラリを利用します。
例として、今回はサイトのタイトルを取得しています。

requests
beautifulsoup4

スクリプト

import requests
from bs4 import BeautifulSoup

url = "https://www.bkds-hi.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
print(soup.find('title').text)

動的サイト

Javascript等でコンテンツが変動するものやログインが必要なものはWebDriverを利用することが一般的です。

今回はPythonで下記のライブラリを利用します。

selenium

また、Pythonのライブラリ以外にもChromeとそれに対応したdriverをインストールする必要があります。

# Chromeのインストール
wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
sudo apt install ./google-chrome-stable_current_amd64.deb

# Chrome Driverのインストール
wget https://storage.googleapis.com/chrome-for-testing-public/129.0.6668.89/linux64/chromedriver-linux64.zip
unzip chromedriver-linux64.zip
mv chromedriver-linux64/chromedriver /usr/local/bin/

スクリプト

from selenium import webdriver

driver = webdriver.Chrome()
driver.get("https://bkds-hi.com")
print(driver.title)
driver.quit()

上記のスクリプトでは、下記のようにブラウザが起動します。

selenium_gui

データ収集の際には、GUIは必要がないため、headlessオプションを適応することが多い。

from selenium import webdriver

options = webdriver.ChromeOptions()
options.add_argument("--headless")

driver = webdriver.Chrome(options=options)
driver.get("https://bkds-hi.com")
print(driver.title)
driver.quit()

<- 新しいメモ
windowsのshutdownメモ

古いメモ ->
influxdbのメモ