Python webscraping for beginners
Introduction
This tutorial is for people who have done almost no programming before. It will teach you how to get/send data to websites using python.
This has many uses in security. In this example, we will be using it to abuse a job application website. THIS IS OKAY BECAUSE WE HAVE PERMISSION FROM THE WEBSITE OWNER (us). OTHERWISE, THIS WOULD BE ILLEGAL.
Setting up Python
You might have python installed already, to check:
- On Windows: from the start menu, search for “cmd” and open it. Then type
python --version
and press enter. - On Mac/Linx: open a terminal(by searching for it in the search bar) and type
python --version
and press enter.
For the first part of the tutorial we will be using the interactive python shell. To open it, type python
in the
terminal and press enter.
If you do not have python installed, you can use replit to run python code in your browser. To do this, go to the link and sign up. Then click on the “new repl” button and select “python” from the list of languages. For the first part of the tutorial we will be using the interactive python shell. To open it, click on the “shell” button in the top right.
Python Programming basics
Please run the code examples in the interactive python shell. After each example, try changing the code to see what happens. Even though it takes longer, typing the code out yourself is better than copying and pasting. It will help you remember the code better.
Whitespace
In python, whitespace is used to separate blocks of code. You can use tabs or spaces, but you must be consistent. If you mix tabs and spaces, you will get an error. If you chose to use spaces, you should use 4 spaces.
Comments
Comments are lines of code that are not executed. They are used to explain what the code does. In python, comments start
with a #
.
# This is a comment
Printing
You can output data to the console using the print
function.
print("Hello, world!")
or
print(server_url)
Functions
Sometimes you will want to reuse code. You can do this by creating a function. In python, you can call a function by typing the name of the function, followed by parentheses. Any information you want to pass to the function goes inside the parentheses, separated by commas. The print function is an example of a function.
Variables
Variables are used to store data. In python, you can create a variable by typing the name of the variable, followed by an equals sign, followed by the value you want to store in the variable.
Variable names must:
- Start with a letter or underscore
- Contain only letters, numbers, and underscores
server_url = "https://jobs.luhack.uk"
Normally, you should use descriptive variable names. Avoid single letter variable names like x
and y
. The only
exception
is when you are using a loop, where i
is commonly used as a variable name.
In python, there is a strong convention to use snake_case for variable names. This means that variable names should be
lowercase,
with words separated by underscores. (This might not seem important, but it will make your life much easier.)
Variable Types
There are many different types of variables in python. Some of the most common are:
- Strings: Text data (
"hello"
,"123"
) - Integers: Whole numbers (
1
,2
,3
) - Arrays: Ordered sets of strings,integers, or similar to store multiple values (
[1, 2, 3]
,["a", "b", "c"]
)
Casting
You can convert between different types of variables using casting. In python, you can cast a variable to a string using
the str
function, an integer using the int
function, and an array using the list
function.
print(str(123))
print(int("123"))
print(list("abc"))
Concatenation
Concatenation is used to combine two strings. In python, you can concatenate two strings using the +
operator.
print("Hello" + "world")
If you try to concatenate a string and an integer, you will get an error. You can fix this by casting the integer to a string.
print("Hello" + str(123))
Comparisons
Comparisons are used to compare two values. In python, the most common comparisons are:
==
: Equal to!=
: Not equal to>
: Greater than<
: Less than
You can try these out in the interactive python shell.
print(1 == 1)
print(1 != 2)
print(1 < 2)
print(1 > 2)
Conditionals
NOTE you may find it easier to use a file and then run it using python filename.py
from here on.
Conditionals are used to run code based on whether a condition is true or false. In python, you can use the if
and else
keywords to create conditionals.
if server_url == "https://jobs.luhack.uk":
print("Correct URL")
else:
print("Incorrect URL")
Loops
Loops are used to run code multiple times. In python, you can use the for
keyword to create loops.
for i in range(5):
print(i)
Using Conditionals to Control Loops
You can use conditionals to control loops. In python, you can use the break
keyword to stop a loop, and the continue
keyword to skip the rest of the code in the loop and start the next iteration.
for i in range(5):
if i == 3:
break
print(i)
List Operations
As lists have an order, you can access individual elements using their index. The index starts at 0. (In computer science, we often start counting at 0, not 1.)
my_list = ["a", "b", "c"]
print(my_list[0])
You can also add elements to a list using the append
function.
my_list = ["a", "b", "c"]
my_list.append("d")
print(my_list)
You can also remove elements from a list using the remove
function.
my_list = ["a", "b", "c"]
my_list.remove("b")
print(my_list)
You can also count the number of elements in a list using the len
function.
my_list = ["a", "b", "c"]
print(len(my_list))
Strings to Lists
You can split a string into a list using the split
function.
my_string = "a,b,c"
my_list = my_string.split(",")
print(my_list)
Lists to Strings
You can join a list into a string using the join
function.
my_list = ["a", "b", "c"]
my_string = ",".join(my_list)
print(my_string)
Libraries
Sometimes you will need to use code that is not built into python. You can do this by importing a library. In python,
you can import a library using the import
keyword.
import random
print(random.randint(1, 10))
(Try running this code multiple times)
Dot Notation
You might notice that we use random.randint
instead of just randint
. This is because randint
is a function that
is part of the random
library. In python, you can access functions that are part of a library using dot notation.
We will not be getting into this in this tutorial. We will give you the code you need to use, just take it as given for now.
Actually building a web scraper
Now that you know the basics of python, we can start building a web scraper. We will be using the requests
library to
find and then apply for jobs on LUHack’s job board.
Installing the requests library
Most of the time the requests library is already installed. To check, run the following code in the interactive python shell.
import requests
If you get an error, you will need to install the requests library. You can do this by running the following command in the terminal.
pip install requests
Getting the job board
To get the job board, we will use the requests.get
function. This function takes a URL as an argument and returns a
response object.
import requests
server_url = "https://jobs.luhack.uk"
response = requests.get(server_url)
print(response.text)
# 200 means that the request was successful, 404 means that the page was not found, etc.
print(response.status_code)
View a job listing
Given that the first job is listed at https://jobs.luhack.uk/0
, what is the first job listed on the board?
Getting fields from the job listing
Job listings always return the same fields, in the same format.
title,company,location,salary
Print out the items as a list of strings. (The output will be returned to you as a single string.)
Going to an incorrect job listing
What happens if you go to https://jobs.luhack.uk/some/random/path
?
How many jobs?
Assuming that the jobs are listed at https://jobs.luhack.uk/0
, https://jobs.luhack.uk/1
, https://jobs.luhack.uk/2
,
etc.
How many jobs are there?
Applying for a job
To apply for a job, we need to send a POST request to the job listing. We will use the requests.post
function to do
this.
The server expects the data to be in the following format:
name,email,cv_url
import requests
server_url = "https://jobs.luhack.uk"
response = requests.post(server_url + "/0", data="John Doe,[email protected],http://example.com/john_doe_cv.pdf")
print(response.text)
Applying for all jobs
Now that you know how to apply for a job, can you apply for all the jobs on the job board?
Applying for only well paid jobs
Can you apply for only the jobs that pay more than £50,000?