Python webscraping for beginners

Introduction

This tutorial is for people who have done almost no programming before. It will teach you how to get/send data to websites using python.

This has many uses in security. In this example, we will be using it to abuse a job application website. THIS IS OKAY BECAUSE WE HAVE PERMISSION FROM THE WEBSITE OWNER (us). OTHERWISE, THIS WOULD BE ILLEGAL.

Setting up Python

You might have python installed already, to check:

For the first part of the tutorial we will be using the interactive python shell. To open it, type python in the terminal and press enter.

If you do not have python installed, you can use replit to run python code in your browser. To do this, go to the link and sign up. Then click on the “new repl” button and select “python” from the list of languages. For the first part of the tutorial we will be using the interactive python shell. To open it, click on the “shell” button in the top right.

Python Programming basics

Please run the code examples in the interactive python shell. After each example, try changing the code to see what happens. Even though it takes longer, typing the code out yourself is better than copying and pasting. It will help you remember the code better.

Whitespace

In python, whitespace is used to separate blocks of code. You can use tabs or spaces, but you must be consistent. If you mix tabs and spaces, you will get an error. If you chose to use spaces, you should use 4 spaces.

Comments

Comments are lines of code that are not executed. They are used to explain what the code does. In python, comments start with a #.

# This is a comment

Printing

You can output data to the console using the print function.

print("Hello, world!")

or

print(server_url)

Functions

Sometimes you will want to reuse code. You can do this by creating a function. In python, you can call a function by typing the name of the function, followed by parentheses. Any information you want to pass to the function goes inside the parentheses, separated by commas. The print function is an example of a function.

Variables

Variables are used to store data. In python, you can create a variable by typing the name of the variable, followed by an equals sign, followed by the value you want to store in the variable.

Variable names must:

server_url = "https://jobs.luhack.uk"

Normally, you should use descriptive variable names. Avoid single letter variable names like x and y. The only exception is when you are using a loop, where i is commonly used as a variable name. In python, there is a strong convention to use snake_case for variable names. This means that variable names should be lowercase, with words separated by underscores. (This might not seem important, but it will make your life much easier.)

Variable Types

There are many different types of variables in python. Some of the most common are:

Casting

You can convert between different types of variables using casting. In python, you can cast a variable to a string using the str function, an integer using the int function, and an array using the list function.

print(str(123))
print(int("123"))
print(list("abc"))

Concatenation

Concatenation is used to combine two strings. In python, you can concatenate two strings using the + operator.

print("Hello" + "world")

If you try to concatenate a string and an integer, you will get an error. You can fix this by casting the integer to a string.

print("Hello" + str(123))

Comparisons

Comparisons are used to compare two values. In python, the most common comparisons are:

You can try these out in the interactive python shell.

print(1 == 1)
print(1 != 2)
print(1 < 2)
print(1 > 2)

Conditionals

NOTE you may find it easier to use a file and then run it using python filename.py from here on.

Conditionals are used to run code based on whether a condition is true or false. In python, you can use the if and else keywords to create conditionals.

if server_url == "https://jobs.luhack.uk":
    print("Correct URL")
else:
    print("Incorrect URL")

Loops

Loops are used to run code multiple times. In python, you can use the for keyword to create loops.

for i in range(5):
    print(i)

Using Conditionals to Control Loops

You can use conditionals to control loops. In python, you can use the break keyword to stop a loop, and the continue keyword to skip the rest of the code in the loop and start the next iteration.

for i in range(5):
    if i == 3:
        break
    print(i)

List Operations

As lists have an order, you can access individual elements using their index. The index starts at 0. (In computer science, we often start counting at 0, not 1.)

my_list = ["a", "b", "c"]
print(my_list[0])

You can also add elements to a list using the append function.

my_list = ["a", "b", "c"]
my_list.append("d")
print(my_list)

You can also remove elements from a list using the remove function.

my_list = ["a", "b", "c"]
my_list.remove("b")
print(my_list)

You can also count the number of elements in a list using the len function.

my_list = ["a", "b", "c"]
print(len(my_list))

Strings to Lists

You can split a string into a list using the split function.

my_string = "a,b,c"
my_list = my_string.split(",")
print(my_list)

Lists to Strings

You can join a list into a string using the join function.

my_list = ["a", "b", "c"]
my_string = ",".join(my_list)
print(my_string)

Libraries

Sometimes you will need to use code that is not built into python. You can do this by importing a library. In python, you can import a library using the import keyword.

import random

print(random.randint(1, 10))

(Try running this code multiple times)

Dot Notation

You might notice that we use random.randint instead of just randint. This is because randint is a function that is part of the random library. In python, you can access functions that are part of a library using dot notation.

We will not be getting into this in this tutorial. We will give you the code you need to use, just take it as given for now.

Actually building a web scraper

Now that you know the basics of python, we can start building a web scraper. We will be using the requests library to find and then apply for jobs on LUHack’s job board.

Installing the requests library

Most of the time the requests library is already installed. To check, run the following code in the interactive python shell.

import requests

If you get an error, you will need to install the requests library. You can do this by running the following command in the terminal.

pip install requests

Getting the job board

To get the job board, we will use the requests.get function. This function takes a URL as an argument and returns a response object.

import requests

server_url = "https://jobs.luhack.uk"
response = requests.get(server_url)
print(response.text)
# 200 means that the request was successful, 404 means that the page was not found, etc.
print(response.status_code)

View a job listing

Given that the first job is listed at https://jobs.luhack.uk/0, what is the first job listed on the board?

Getting fields from the job listing

Job listings always return the same fields, in the same format.

title,company,location,salary

Print out the items as a list of strings. (The output will be returned to you as a single string.)

Going to an incorrect job listing

What happens if you go to https://jobs.luhack.uk/some/random/path?

How many jobs?

Assuming that the jobs are listed at https://jobs.luhack.uk/0, https://jobs.luhack.uk/1, https://jobs.luhack.uk/2, etc. How many jobs are there?

Applying for a job

To apply for a job, we need to send a POST request to the job listing. We will use the requests.post function to do this.

The server expects the data to be in the following format:

name,email,cv_url
import requests

server_url = "https://jobs.luhack.uk"
response = requests.post(server_url + "/0", data="John Doe,[email protected],http://example.com/john_doe_cv.pdf")
print(response.text)

Applying for all jobs

Now that you know how to apply for a job, can you apply for all the jobs on the job board?

Applying for only well paid jobs

Can you apply for only the jobs that pay more than £50,000?