Scenario:
I work in a company and my paylips are downloadable in an aspx portal. One by one, not in block.
I needed them all, for burocaracy reasons and in order to archive them.
How: py, selenium. I tried with beutifulsoup but it didn't work.
Explenation and Code
Web Driver
I used webdriver Chrome with some options in order to save the pdf-files when browser opens it. Look pref in code below.
Creating a class
class PaylipsScaper:
# Init
def __init__(self, username, password):
self.username = username
self.password = password
# Options
chrome_options = webdriver.ChromeOptions()
prefs = {
"plugins.always_open_pdf_externally": True,
"download.default_directory": "C:\\tmp", # folder save files
"download.prompt_for_download": False,
"download.directory_upgrade": True,
"safebrowsing.enabled": True
}
chrome_options.add_experimental_option("prefs",prefs)
chrome_options.headless = True # If True hide browser
self.driver = webdriver.Chrome(executable_path='chromedriver.exe', options=chrome_options)
Login
The login phase is quite easy, just select by id the area and insert the value. I just kept attention to iframe, because in that case you have to use switch_to.frame before.
# Manage login page
def login(self, url):
driver = self.driver
driver.get(url)
driver.switch_to.frame("FunArea")
username = driver.find_element_by_id("login")
password = driver.find_element_by_id("pwd")
username.send_keys(self.username)
time.sleep(1)
password.send_keys(self.password)
driver.find_element_by_id("CmdInvia").click()
Loop table using XPATH
I created a class that wrap the selenium driver in order to keep all cleans.
I just reproduced the clicks done by myself.
At the beginning I tried with CSS selector but for the structure of the pages was a better solution to use XPATH
(to get the XPATH with Chrome see here)
By the way I don't like the time.sleep but it was useful to avoid navigations problems during the process.
# Inside PaylipsScaper class
def get_num_rows(self, num_rows = 1):
driver = self.driver
self.click_to_payslips_area()
num_rows = len(driver.find_elements_by_xpath("//table[@id='ContTab']/tbody/tr/td/div/table/tbody/tr"))
return num_rows
[...other stuff...]
try:
bot = PaylipsScaper(username, password)
bot.login(url_website)
wait = WebDriverWait(bot.driver, 10)
num_rows = bot.get_num_rows()
for row in range(1,num_rows+1):
paylip_year = bot.get_val_in_cedolino_row(row, 4)
paylip_month = bot.get_val_in_cedolino_row(row, 5)
paylip_type = bot.get_val_in_cedolino_row(row, 7)
bot.driver.execute_script("arguments[0].click();", WebDriverWait(bot.driver, 20).until(EC.element_to_be_clickable((By.XPATH, "/html/body/form/div/table/tbody/tr/td/div/table/tbody/tr["+str(row)+"]/td[10]/img"))))
time.sleep(2)
filepdf= dirpath + "\\*.pdf"
list_of_files = glob.glob(filepdf)
file_name = max(list_of_files, key=os.path.getctime)
current_paylip = Paylip(paylip_year, paylip_month, paylip_type, file_name)
bot.rename_and_move (current_paylip)
print("Downloaded:")
print(current_paylip)
Save pdf file in folder and rename it
It's quite a brute solution anyway I got the last pdf saved in a folder and renamed it with the informations from the website.
Then I moved the files in sub-folders by year.
def rename_and_move(self, urrent_paylip):
if current_paylip.paylip_month == "" :
new_file_name='Cud_'+str(current_paylip.paylip_year)+'_'+str(current_paylip.paylip_type).replace(" ", "_").replace("Completo", "").replace("NORMALE", "")+'.pdf'
elif "TREDICESIMA" in current_paylip.paylip_type:
new_file_name = 'Cedolino_'+str(current_paylip.paylip_year)+'_'+str(current_paylip.paylip_month)+'_Tredicesima.pdf'
else:
new_file_name = 'Cedolino_'+str(current_paylip.paylip_year)+'_'+str(current_paylip.paylip_month)+'.pdf'
print(new_file_name)
new_file_name = os.path.join(dirpath, new_file_name)
# Rename file and move it in the year-directory
os.rename(current_paylip.file_name, new_file_name)
current_paylip.file_name = new_file_name
# Check if path with year directory exist otherwise create it
dirin=os.path.split(new_file_name)
newdir=dirin[0]+'\\'+current_paylip.paylip_year
if os.path.exists(newdir)==False:
# Create directory
os.mkdir(newdir)
# Move file in the year-directory
if os.path.exists(newdir+"\\"+dirin[1]):
# If file already exist, delete it
os.remove(newdir+"\\"+dirin[1])
shutil.move (current_paylip.file_name,newdir+"\\"+dirin[1])
return
Final situation
Got it. I have a folder with subfolders by year and in each one all the paylips with a standard name format.
What I learned:
- Use of Selenium in py.
- Simple automation can save a lot of time and avoid manual boring tasks.
- How to write my first article here.(it's a personal task so not so useful for you but better than nothing after all)
Future improvements:
- input parameters
- (re)try to use css selector instead of xpath selector
- (re)try to use BeautifulSoup
- save last paylips saved in order, next run, to save only the not already saved paylips
- read pdf and report data in file(eg google sheets)
Of course the code is useful just for me and my colleagues. Anyway I hope that the idea and process can be a good idea to someone else.
Top comments (0)