Review
"Wow--this book will help you to bring your data science projects from idea all the way
to production. Chris and Antje have covered all of the important concepts and the
key AWS services, with plenty of real-world examples to get you started
on your data science journey."
--Jeff Barr,
Vice President & Chief Evangelist,
Amazon Web Services
"It's very rare to find a book that comprehensively covers the full end-to-end process of
model development and deployment! If you're an ML practitioner, this book is a must!"
--Ramine Tinati,
Managing Director/Chief Data Scientist Applied Intelligence,
Accenture
"This book is a great resource for building scalable machine learning solutions on AWS
cloud. It includes best practices for all aspects of model building, including training,
deployment, security, interpretability, and MLOps."
--Geeta Chauhan,
AI/PyTorch Partner Engineering Head,
Facebook AI
"The landscape of tools on AWS for data scientists and engineers can be absolutely
overwhelming. Chris and Antje have done the community a service by providing a map
that practitioners can use to orient themselves, find the tools they need to get the
job done and build new systems that bring their ideas to life."
--Josh Wills,
Author, Advanced Analytics with Spark (O'Reilly)
"Successful data science teams know that data science isn't just modeling but needs a
disciplined approach to data and production deployment. We have an army of tools for all
of these at our disposal in major clouds like AWS. Practitioners will appreciate this
comprehensive, practical field guide that demonstrates not just how to apply
the tools but which ones to use and when."
--Sean Owen,
Principal Solutions Architect,
Databricks
From the Author
With this practical book, AI and machine learning (ML) practitioners will learn how
to successfully build and deploy data science projects on Amazon Web Services
(AWS). The Amazon AI and ML stack unifies data science, data engineering, and
application development to help level up your skills. This guide shows you how to
build and run pipelines in the cloud, then integrate the results into applications in
minutes instead of days. Throughout the book, authors Chris Fregly and Antje Barth
demonstrate how to reduce cost and improve performance.
* Apply the Amazon AI and ML stack to real-world use cases for natural language
processing, computer vision, fraud detection, conversational devices, and more.
* Use automated ML (AutoML) to implement a specific subset of use cases with
Amazon SageMaker Autopilot.
* Dive deep into the complete model development life cycle for a BERT-based natural
language processing (NLP) use case including data ingestion and analysis,
and more.
* Tie everything together into a repeatable ML operations (MLOps) pipeline.
* Explore real-time ML, anomaly detection, and streaming analytics on real-time
data streams with Amazon Kinesis and Amazon Managed Streaming for Apache
Kafka (Amazon MSK).
* Learn security best practices for data science projects and workflows, including
AWS Identity and Access Management (IAM), authentication, authorization, and
more.
Overview of the Chapters
Chapter 1 provides an overview of the broad and deep Amazon AI and ML stack, an
enormously powerful and diverse set of services, open source libraries, and infrastructure
to use for data science projects of any complexity and scale.
Chapter 2 describes how to apply the Amazon AI and ML stack to real-world use
cases for recommendations, computer vision, fraud detection, natural language
understanding (NLU), conversational devices, cognitive search, customer support,
industrial predictive maintenance, home automation, Internet of Things (IoT),
healthcare, and quantum computing.
Chapter 3 demonstrates how to use AutoML to implement a specific subset of these
use cases with SageMaker Autopilot.
Chapters 4-9 dive deep into the complete model development life cycle (MDLC) for a
BERT-based NLP use case, including data ingestion and analysis, feature selection
and engineering, model training and tuning, and model deployment with SageMaker,
Amazon Athena, Amazon Redshift, Amazon EMR, TensorFlow, PyTorch, and serverless
Apache Spark.
Chapter 10 ties everything together into repeatable pipelines using MLOps with Sage
Maker Pipelines, Kubeflow Pipelines, Apache Airflow, MLflow, and TFX.
Chapter 11 demonstrates real-time ML, anomaly detection, and streaming analytics
on real-time data streams with Amazon Kinesis and Apache Kafka.
Chapter 12 presents a comprehensive set of security best practices for data science
projects and workflows, including IAM, authentication, authorization, network isolation,
data encryption at rest, post-quantum network encryption in transit, governance,
and auditability.
Throughout the book, we provide tips to reduce cost and improve performance for
data science projects on AWS.
Who Should Read This Book
This book is for anyone who uses data to make critical business decisions. The guidance
here will help data analysts, data scientists, data engineers, ML engineers,
research scientists, application developers, and DevOps engineers broaden their
understanding of the modern data science stack and level up their skills in the cloud.
The Amazon AI and ML stack unifies data science, data engineering, and application
development to help users level up their skills beyond their current roles. We show
how to build and run pipelines in the cloud, then integrate the results into applications
in minutes instead of days.
Ideally, and to get most out of this book, we suggest readers have the following
knowledge:
* Basic understanding of cloud computing
* Basic programming skills with Python, R, Java/Scala, or SQL
* Basic familiarity with data science tools such as Jupyter Notebook, pandas,
NumPy, or scikit-learn
From the Inside Flap
Wow - this book will help you to bring your data science projects from idea all the way to production. Chris and Antje have covered all of the important concepts and the key AWS services, with plenty of real-world examples to get you started on your data science journey.
Jeff Barr
Vice President & Chief Evangelist at Amazon Web Services
It's very rare to find a book that comprehensively covers the full end-to-end process of model development and deployment! If you're a ML practitioner, this book is a must!
Ramine TinatiManaging Director/Chief Data Scientist Applied Intelligence at Accenture
This book is a great resource for building scalable machine learning solutions on AWS cloud. It includes best practices for all aspects of model building, including training, deployment, security, interpretability, and MLOps.
Geeta Chauhan
AI/PyTorch Partner Engineering Head, Facebook AI
The landscape of tools on AWS for data scientists and engineers can be absolutely overwhelming. Chris and Antje have done the community a service by providing a map that practitioners can use to orient themselves, find the tools they need to get the job done and build new systems that bring their ideas to life."
Josh Wills
Author, Advanced Analytics with Spark
Successful Data Science teams know that data science isn't just modeling but needs a disciplined approach to data and production deployment. We have an army of tools for all of these at our disposal in major clouds like AWS. Practitioners will appreciate this comprehensive, practical field guide that demonstrates not just how to apply the tools but which ones to use and when.
Sean Owen
Principal Solutions Architect at Databricks
This is the most extensive resource I know about ML on AWS, unequaled in breadth and depth. While ML literature often focuses on science, Antje and Chris take a different approach and dive deep into practical architectural concepts needed to serve science in production, such as security, data engineering, monitoring, CICD, and costs management. The book is state-of-the-art on the science as well: It presents advanced concepts such as Transformer architectures, AutoML, online learning, distillation, compilation, bayesian model tuning, and bandits. It stands out by providing both a business-friendly description of services and concepts and low-level implementation tips and instructions. A must-read for individuals and organizations building ML systems on AWS or willing to improve their knowledge of AWS data science stack
Olivier Cruchant
Principal ML Specialist Solutions Architect at AWS
This book is a great resource to not only understand the end-to-end machine learning workflow in detail and build operationally efficient machine learning workloads at scale on AWS. Highly recommend this book for anyone building machine learning workloads on AWS!
Shelbee Eigenbrode
AI/ML Specialist Solutions Architect, Amazon Web Services
This book is a comprehensive resource for diving into data science on AWS. The authors provide a good balance of theory, discussion, and hands-on examples to guide the reader through implementing all phases of machine learning applications using AWS services. A great resource to not just get started but to scale and secure end-to-end ML applications.
Sireesha Muppala, PhD
Principal Solutions Architect, AI/ML, Amazon Web Services
Implementing a robust end-to-end machine learning workflow is a daunting challenge, complicated by the wide range of tools and technologies available; the authors do an impressive job of guiding both novice and expert practitioners through this task leveraging the power of AWS services.
Brent Rabowsky
Data Scientist, AWS
Using real-world examples, Chris and Antje provide indispensable and comprehensive guidance for building and managing ML and AI applications in AWS.
Dean Wampler
Author, Programming Scala
Doing MLOps and Data Science on AWS is exciting and intimidating due to the vast quantity of services and methodologies available. This book is a welcome guide to getting Machine Learning into production on the AWS platform, whether you want to do ML with AWS Lambda or with AWS Sagemaker.
Noah Gift
Duke Faculty and Founder Pragmatic AI Labs
Data Science on AWS provides an in-depth look at the modern data science stack on AWS. Machine learning practitioners will learn about the services, open-source libraries, and infrastructure they can leverage during each phase of the ML pipeline and how to tie it all together using MLOps. This book is a great resource and a definite must-read for anyone looking to level up their ML skills using AWS.
Kesha Williams, A Cloud Guru
As AWS continues to generate explosive growth, the data science practitioner today needs to know how to operate in the cloud. This book takes the practitioner through key topics in cloud-based data science such as SageMaker, AutoML, Model Deployment, and MLOps cloud security best practices. It's a bookshelf must-have for those looking to keep pace with machine learning on AWS.
Josh Patterson
Author, Kubeflow Operations Guide
AWS is an extremely powerful tool, a visionary and leader in cloud computing. The variety of available services can be impressive, which is where this book is a big deal. Antje and Chris have crafted a complete AWS guide to building ML/AI pipelines complying with best-in-class practices. Allow yourself to keep calm and go to production.
Andy Petrella
CEO and Founder of Kensu
This book is a must-have for anyone who wants to learn how to organize a data science project in production on AWS. It covers the full journey from research to production and covers the AWS tools and services that could be used for each step along the way.
Rustem Feyzkhanov
Machine Learning Engineer at Instrumental, AWS ML Hero
Chris and Antje manage to compress all of AWS AI in this great book. If you plan to build AI using AWS, this book has you covered from the beginning to the end and more. Well done!
Francesco Mosconi
Author and Founder @ Zero to Deep Learning
Chris and Antje expertly guide ML practitioners through the complex and sometimes overwhelming landscape of managed cloud services on AWS. Because this book serves as a comprehensive atlas of services and their interactions toward the completion of end-to-end data science workflows from data ingestion to predictive application, you'll quickly find a spot for it on your desk as a vital quick reference!
Benjamin Bengfort
Rotational Labs
This book covers the different AWS tools for data science and how to select the right ones and make them work together.
Holden Karau,
Author, Learning Spark and Kubeflow for Machine Learning
Chris and Antje have done a phenomenal job in translating business use cases and case studies into implementable solutions using AWS. The book is written in such a way that it is easy to read along with a tested and well managed code base to accompany it. This book is highly recommended to anyone interested in Data Science, Data Engineering and Machine Learning Engineering at Scale.
Shreenidhi Bharadwaj
Sr Principal, Private Equity/Venture Capital Advisory (M&A), West Monroe Partners
"Chris and Antje have written the definite guide on how to build AI/ML & Data Science solutions using AWS. They describe all the different steps of the AI/ML life cycle from development through production in comprehensive detail, and AI/ML practitioners of all levels will greatly benefit from reading this book. Highly recommended!"
Jan Neumann,
Executive Director, Machine Learning, Comcast
From the Back Cover
Benefits of Cloud Computing
Cloud computing enables the on-demand delivery of IT resources via the internet
with pay-as-you-go pricing. So instead of buying, owning, and maintaining our own
data centers and servers, we can acquire technology such as compute power, storage,
databases, and other services on an as-needed basis. Similar to a power company
sending electricity instantly when we flip a light switch in our home, the cloud provisions
IT resources on-demand with the click of a button or invocation of an API.
"There is no compression algorithm for experience" is a famous quote by Andy Jassy,
CEO, Amazon Web Services. The quote expresses the company's long-standing experience
in building reliable, secure, and performant services since 2006.
AWS has been continually expanding its service portfolio to support virtually any
cloud workload, including many services and features in the area of artificial intelligence
and machine learning. Many of these AI and machine learning services stem
from Amazon's pioneering work in recommender systems, computer vision, speech/
text, and neural networks over the past 20 years. A paper from 2003 titled "Amazon.
com Recommendations: Item-to-Item Collaborative Filtering" recently won the
Institute of Electrical and Electronics Engineers award as a paper that withstood the
"test of time." Let's review the benefits of cloud computing in the context of data science
projects on AWS.
Agility
Cloud computing lets us spin up resources as we need them. This enables us to
experiment quickly and frequently. Maybe we want to test a new library to run dataquality
checks on our dataset, or speed up model training by leveraging the newest
generation of GPU compute resources. We can spin up tens, hundreds, or even thousands
of servers in minutes to perform those tasks. If an experiment fails, we can
always deprovision those resources without any risk.
Cost Savings
Cloud computing allows us to trade capital expenses for variable expenses. We only
pay for what we use with no need for upfront investments in hardware that may
become obsolete in a few months. If we spin up compute resources to perform our
data-quality checks, data transformations, or model training, we only pay for the time
those compute resources are in use. We can achieve further cost savings by leveraging
Amazon EC2 Spot Instances for our model training. Spot Instances let us take advantage
of unused EC2 capacity in the AWS cloud and come with up to a 90% discount
compared to on-demand instances. Reserved Instances and Savings Plans allow us to
save money by prepaying for a given amount of time.
Elasticity
Cloud computing enables us to automatically scale our resources up or down to
match our application needs. Let's say we have deployed our data science application
to production and our model is serving real-time predictions. We can now automatically
scale up the model hosting resources in case we observe a peak in model
requests. Similarly, we can automatically scale down the resources when the number
of model requests drops. There is no need to overprovision resources to handle peak
loads.
Innovate Faster
Cloud computing allows us to innovate faster as we can focus on developing applications
that differentiate our business, rather than spending time on the undifferentiated
heavy lifting of managing infrastructure. The cloud helps us experiment with
new algorithms, frameworks, and hardware in seconds versus months.
Deploy Globally in Minutes
Cloud computing lets us deploy our data science applications globally within
minutes. In our global economy, it is important to be close to our customers. AWS
has the concept of a Region, which is a physical location around the world where
AWS clusters data centers. Each group of logical data centers is called an Availability
Zone (AZ). Each AWS Region consists of multiple, isolated, and physically separate
AZs within a geographic area. The number of AWS Regions and AZs is continuously
growing.
We can leverage the global footprint of AWS Regions and AZs to deploy our data science
applications close to our customers, improve application performance with
ultra-fast response times, and comply with the data-privacy restrictions of each
Region.
Smooth Transition from Prototype to Production
One of the benefits of developing data science projects in the cloud is the smooth
transition from prototype to production. We can switch from running model prototyping
code in our notebook to running data-quality checks or distributed model
training across petabytes of data within minutes. And once we are done, we can
deploy our trained models to serve real-time or batch predictions for millions of
users across the globe.
Prototyping often happens in single-machine development environments using
Jupyter Notebook, NumPy, and pandas. This approach works fine for small data sets.
When scaling out to work with large datasets, we will quickly exceed the single
machine's CPU and RAM resources. Also, we may want to use GPUs--or multiple
machines--to accelerate our model training. This is usually not possible with a single
machine.
The next challenge arises when we want to deploy our model (or application) to production.
We also need to ensure our application can handle thousands or millions of
concurrent users at global scale.
Production deployment often requires a strong collaboration between various teams
including data science, data engineering, application development, and DevOps. And
once our application is successfully deployed, we need to continuously monitor and
react to model performance and data-quality issues that may arise after the model is
pushed to production.
Developing data science projects in the cloud enables us to transition our models
smoothly from prototyping to production while removing the need to build out our
own physical infrastructure. Managed cloud services provide us with the tools to
automate our workflows and deploy models into a scalable and highly performant
production environment.
About the Author
Chris Fregly, Principal Developer Advocate, AI and Machine Learning @ AWS (San Francisco)Chris Fregly is a Principal Developer Advocate for AI and Machine Learning at Amazon Web Services (AWS) based in San Francisco, California. He is co-author of the O'Reilly Book, "Data Science on AWS."
Chris is also the Founder of many AI-focused global meetups including the global "Data Science on AWS" Meetup. He regularly speaks at AI and Machine Learning conferences across the world including O'Reilly AI, Open Data Science Conference (ODSC), and Nvidia GPU Technology Conference (GTC).
Previously, Chris was Founder at PipelineAI where he worked with many AI-first startups and enterprises to continuously deploy ML/AI Pipelines using Spark ML, Kubernetes, TensorFlow, Kubeflow, Amazon EKS, and Amazon SageMaker.
Antje Barth, Senior Developer Advocate, AI and Machine Learning @ AWS (Dusseldorf)
Antje Barth is a Senior Developer Advocate for AI and Machine Learning at Amazon Web Services (AWS) based in Dsseldorf, Germany. She is co-author of the O'Reilly Book, "Data Science on AWS."
Antje is also co-founder of the Dsseldorf chapter of Women in Big Data. She frequently speaks at AI and Machine Learning conferences and meetups around the world, including the O'Reilly AI and Strata conferences. Besides ML/AI, Antje is passionate about helping developers leverage Big Data, container and Kubernetes platforms in the context of AI and Machine Learning.
Previously, Antje worked in technical evangelism and solutions engineering at MapR and Cisco where she worked with many companies to build and deploy cloud-based AI solutions using AWS and Kubernetes.