Scrapy masterclass: Python web scraping and data pipelines

Focused View

Ahmed Elfakharany

5:42:36

119 View

1. Introduction

1.1 Resources.html

1. Introduction.mp4

06:59

2. Scrapy installation.html

2. Xpath first steps

1.1 xpath_node_types.zip

1. Xpath 101 node types.mp4

07:03

2.1 XPath 102 Cheat Sheet.pdf

2. Xpath 102 basic syntax.mp4

10:37

3.1 XPath 103 Cheat Sheet Axes (node relations).pdf

3. XPath 103 Axes (Node Relations).mp4

06:59

4. Revisiting our real-estate web scraping example.mp4

07:08

3. Hello Scrapy

1. What is a web bot Is it ethical.mp4

06:37

2. The Scrapy Shell.mp4

09:36

3.1 Create your own Scrapy project.html

3. Creating your first Scrapy project.mp4

08:48

4.1 Create your own Scrapy spider.html

4. Creating your first Scrapy spider.mp4

09:32

5.1 Combining XPath queries.html

5. Handling combined queries using the getall() method.mp4

05:29

6.1 Item Loaders.html

6.2 The Scrapy project.html

6. Data cleansing using Item Loaders.mp4

13:21

7.1 Crawl Spiders.html

7. Pagination and link-following using Crawl Spiders.mp4

09:06

4. Scrapy web-scraping scenarios

1.1 Login bot.html

1. Login to websites.mp4

07:29

2. Changing the user-agent.mp4

03:05

3.1 Handling AJAX requests.html

3. Handling AJAX requests 1.mp4

08:17

4.1 Handling AJAX requests.html

4. Handling AJAX requests 2.mp4

04:52

5.1 Handling AJAX requests.html

5. Handling AJAX requests 3.mp4

03:24

6. Caching responses.mp4

06:08

7. Image harvesting.mp4

17:17

8.1 Images storage to S3 and FTP.html

8. Scraped images storage in FTP and AWS S3.mp4

05:37

5. Data transformation using Scrapy Pipelines

1.1 Classifieds Ads project.html

1. Introduction and sample project (classifieds ads scraping).mp4

15:51

2.1 Remove duplicates pipeline.html

2.2 Removing duplicates pipeline.html

2. Removing ads with duplicate titles.mp4

04:28

3.1 Dropping Ads with no phones pipeline.html

3. Removing ads with no phone numbers.mp4

02:45

6. Data loading (storage) using Scrapy's pipelines

1.1 MongoDB pipeline.html

1. Storing scraped data in MongoDB.mp4

10:09

2.1 MySQL Pipeline.html

2. Storing scraped data in MySQL.mp4

09:30

3.1 Using Vault to store sensitive data for Scrapy.html

3. Using Vault to sore sensitive Scrapy settings.mp4

09:11

4.1 S3 Pipeline.html

4. Storing data to AWS S3 bucket.mp4

08:08

5. Using Amazon Glue and Athena to query the data from S3 (extra lecture).mp4

06:39

7. Scrapy Middleware (or how to avoid getting banned)

1.1 Phone Models Project.html

1. Phone-models project and spider rate-limiting.mp4

12:41

2.1 Rotating user-agents project.html

2. Rotating user-agents middleware.mp4

05:42

3.1 Rotating proxies.html

3. Rotating proxies middleware.mp4

10:21

8. Handling JavaScript websites using Splash

1. What is Splash.mp4

05:31

2. Introduction to Docker (optional).mp4

09:22

3. Test-driving Splash.mp4

05:45

4.1 Wikipedia with Splash.html

4. Integrating Scrapy with Splash.mp4

12:58

5.1 Handling scrolling pages with Splash.html

5. Dealing with infinitely-scrolling pages using Splash.mp4

15:13

9. Browser automation using Selenium and Scrapy

1. What is Selenium.mp4

08:14

2.1 firefox-how-to.pdf

2.2 Revisiting infinitely-scrolling pages (medium.com).html

2. Revisiting infinitely-scrolling pages (medium.com).mp4

20:09

3.1 Clicking buttons (Yahoo Finance).html

3. Clicking buttons (Yahoo Finance).mp4

12:35

Description

Work on 7 real-world web-scraping projects using Scrapy, Splash, and Selenium. Build data pipelines locally and on AWS

What You'll Learn?

Extract data from the most difficult web sites using Scrapy
Build ETL pipelines and store data in CSV, JSON, MySQL, MongoDB, and S3
Avoid getting banned and evade bot-protection techniques
Use Splash for scraping JavaScript-powered websites
Harness the power of Selenium browser automation to scrape any website
Deploy your Scrapy bots in local and AWS environments

Who is this for?

Anyone who wants to automate data collection from websites (web scraping) using Scrapy

Anyone who wants to build a business around web scraping and data collection

Data engineers, data scientists, ML engineers who want to master web scraping for their data collection needs

Developers, DevOps engineers or IT professionals who want to switch careers to data engineering

Python programmers who want to know more about Scrapy or web scraping in general

More details

Description
This is the era of data!
Everyone is telling you what to do with the data that you already have. But how can you "have" this data?
Most of the Data Engineering / Data Science discussions today focus on how to analyze and process datasets to draw some useful information out of them. However, they all assume that those datasets are already available to you. That they've been collected somehow. They spend little time showing how you can obtain this dataset firsthand! This course fills this gap.
Scrapy for building powerful web scraping pipelines is all about walking you through the process of extracting data of interest from websites. True, there are a lot of datasets already available for you to consume either for free or at some cost. However, what if those datasets are outdated? What if they don't address your specific needs? You'd better know how to build your own dataset from scratch no matter how unstructured your data source was.
Scrapy is a Python web scraping framework. Thousands of companies and professionals use it to collect data and build datasets. Then they can sell them or use them in their own projects. Today, you can be one of those professionals. Even build your own business around data harvesting!
Today, data scientists and data engineers are among the most highly paid in the industry. Yet, if they don't have enough data to work on, they can do nothing.
In this class, I'll show you how to obtain, organize, and store unstructured data from within websites' HTML, CSS, and JavaScript. Having mastered that skill, you can start your data engineering/data science career with an extra skillset under your belt: web scraping.
You will also learn the next steps after you obtain your data. ETL (Extract, Transform, and Load) starts with Scrapy (Extract). But this course covers the other two aspects (Transform and Load). Using Scrapy pipelines, we'll see how we can store our data to SQL, and NoSQL databases, Elastic Search clusters, event brokers like Kafka, object storage like S3, and message queues like AWS SQS.
Even if you know nothing about web scraping or data harvesting, even if all of this seems new to you, you've come to the right place.
I've designed this class for total beginners. It will walk you from "What is web scraping? What is Scrapy? Why should I learn and use it?"Â all the way up to "Now I have several gigabytes of web-scraped data from dozens of websites. Let's figure out how we can put them to effective use".
Web scraping can be as easy as extracting some text from some HTML page do going several levels deep among several websites, crawling each link, and hoping from one page to another. It can also get incredibly challenging when websites place blockers to disallow web bots from accessing them. Don't worry, we'll address all use-cases and, together, figure out how we can overcome them.
Who this course is for:
Anyone who wants to automate data collection from websites (web scraping) using Scrapy
Anyone who wants to build a business around web scraping and data collection
Data engineers, data scientists, ML engineers who want to master web scraping for their data collection needs
Developers, DevOps engineers or IT professionals who want to switch careers to data engineering
Python programmers who want to know more about Scrapy or web scraping in general

User Reviews

Rating

average 0

Total votes0

Focused display

Python

Ahmed Elfakharany

Instructor's Courses

Ahmed is a senior DevOps engineer who has helped many companies make the transition to the cloud. He worked with small startups as well as large enterprises. Kubernetes and microservices are his specialty. AWS is his cloud provider of choice although he also uses Azure and Google when necessary. Ahmed has taught several thousands students the basics of Docker and DevOps tools (Ansible, Vagrant, Terraform, Packer, CI/CD and many others). He's recently added data engineering and MLOps to his skill set so that he can provide even more learning value to students who want to seek that path.

Udemy

View courses Udemy

Students take courses primarily to improve job-related skills.Some courses generate credit toward technical certification. Udemy has made a special effort to attract corporate trainers seeking to create coursework for employees of their company.