# Web crawler - basic tools
<font size="1pt">This Jupyter notebook was created for the course Introduction to Data Science at the University of Ljubljana. @szitnik</font>

## Fetching data from the Web

Python distribution already includes [*urllib* library](https://docs.python.org/3/library/urllib.html) which enables easy communication using HTTP requests. For those who would prefer a more feature-rich library we propose to take a look at the [*Requests* library](https://requests.readthedocs.io/en/master/). It is a high-level HTTP library and proposed to use by default for Python and supports multiple connections, sessions handling, proxies, etc. 

Let's retrieve [http://evem.gov.si](http://evem.gov.si) using a simple *urllib* library:

In [None]:
import urllib 

WEB_PAGE_ADDRESS = "http://evem.gov.si"

print(f"Retrieving web page URL '{WEB_PAGE_ADDRESS}'")

request = urllib.request.Request(
    WEB_PAGE_ADDRESS, 
    headers={'User-Agent': 'fri-ieps-TEST'}
)

with urllib.request.urlopen(request) as response: 
    html = response.read().decode("utf-8")
    print(f"Retrieved Web content: \n\n'\n{html}\n'")
    

We can observe that the Web content we recieve is not expected. The HTML code includes a Javascript (JS) code that Web browser normally executes. The JS code above would *redirect* browser to the [*http://evem.gov.si/evem/drzavljani/zacetna.evem*](http://evem.gov.si/evem/drzavljani/zacetna.evem).

Libraries that enable us such functionality should automatically execute Javascript code. Generally, for this purpose they simulate browser such as Google Chrome or Firefox. An example of such library is [Selenium](https://www.selenium.dev/) ([Python API](https://selenium-python.readthedocs.io/)).

Selenium supports multiple browser drivers, so let's download and use a [ChromeDriver](https://chromedriver.chromium.org/downloads). After that we can try to visit the eVem Web page again but now using Selenium.

In [None]:
!conda install -y selenium webdriver-manager

In [None]:
import time
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options

# New version using chromedrivermanager
#from webdriver_manager.chrome import ChromeDriverManager
#
#driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
#driver.get("https://www.google.com")


WEB_DRIVER_LOCATION = "/Users/slavkoz/Downloads/chromedriver-mac-arm64/chromedriver"
WEB_PAGE_ADDRESS = "https://spot.gov.si/spot/drzavljani/zacetna.evem"
TIMEOUT = 5

chrome_options = Options()
# If you comment the following line, a browser will show ...
#chrome_options.add_argument("--headless")

#Adding a specific user agent
chrome_options.add_argument("user-agent=fri-ids-TEST")

print(f"Retrieving web page URL '{WEB_PAGE_ADDRESS}'")
driver = webdriver.Chrome(service=Service(WEB_DRIVER_LOCATION), options=chrome_options)
driver.get(WEB_PAGE_ADDRESS)

# Timeout needed for Web page to render (read more about it)
time.sleep(TIMEOUT)

html = driver.page_source

print(f"Retrieved Web content (truncated to first 900 chars): \n\n'\n{html[:900]}\n'\n")

page_msg = driver.find_element(By.CSS_SELECTOR, ".inside-text")

print(f"Web page message: '{page_msg.text}'")

driver.close()

Check also the *[WebDriverWait](https://en.wikipedia.org/wiki/Selenium_(software)#Selenium_WebDriver)* object to wait and check if the desired Web page loaded. Get familiar with different [options for locating elements](https://selenium-python.readthedocs.io/locating-elements.html).

The code above outputs a Deprecation Warning. You should udpate the code to use *[webdriver-manager](https://pypi.org/project/webdriver-manager/)* (Note that there is an issue for M1 processors architecture due to driver renaming). 