
Cleaning Data for Effective Data Science: Doing the other 80% of the work with Python, R, and command-line tools
Category
Author
Publication
Packt Publishing
Data cleaning is the all-important first step to successful data science, data analysis, and machine learning. If you work with any kind of data, this book is your go-to resource, arming you with the insights and heuristics experienced data scientists had to learn the hard way.
In a light-hearted and engaging exploration of different tools, techniques, and datasets real and fictitious, Python veteran David Mertz teaches you the ins and outs of data preparation and the essential questions you should be asking of every piece of data you work with.
Using a mixture of Python, R, and common command-line tools, Cleaning Data for Effective Data Science follows the data cleaning pipeline from start to end, focusing on helping you understand the principles underlying each step of the process. You'll look at data ingestion of a vast range of tabular, hierarchical, and other data formats, impute missing values, detect unreliable data and statistical anomalies, and generate synthetic features. The long-form exercises at the end of each chapter let you get hands-on with the skills you've acquired along the way, also providing a valuable resource for academic courses.
Review
"Far more time is usually spent in extracting, cleaning, normalizing, or fixing data that ultimately feeds a data scientist's models than is spent on the "data science" itself. Despite this, data cleaning has so far lacked a comprehensive resource to teach newcomers about the practices that some of us have had to learn the hard way over many years. Cleaning Data for Effective Data Science is the first book I've seen that really meets that need. It's well-written and literate, with coherent and understandable explanations of both the structures used in handling real-world data and the many ways things can go wrong. When I give talks about data cleaning, I'm often asked to recommend a book on this topic, and I've never had a really good answer. No more! I predict that this book will be a standard for a rising generation of data engineers, and deservedly so."
--Naomi Ceder, Former Chair, Python Software Foundation, Co-Founder/Organizer Trans*Code Hackday
"The subject of Cleaning Data for Effective Data Science is vital yet, sadly, neglected in the literature. I and my fellow practitioners have learned most of what this book teaches on the job by trial and error and/or mentoring by more experienced peers, rather than academic courses or books. This book vastly surpassed my high expectations; I found it to be highly pragmatic yet quite usefully structured and sequenced. I met almost all the topics I would expect, plus I've actually LEARNED stuff I could have used myself (for example, usable heuristics to detect, and possibly correct, sampling bias in one's input data sets). Some topics covered are elementary to intermediate, while others are wickedly advanced (such as t-SNEthe book does not delve into its theoretical underpinnings; it just shows how to use relevant Python and R packages). To summarize, I strongly urge every colleague I interact with, who has anything to do with data processing, to buy this book. It WILL be well worth their time and energy!"
--Alex Martelli, Fellow, Python Software Foundation, Co-author of Python Cookbook and Python in a Nutshell
"In data science, as in Victorian life, cleanliness is next to godliness. David takes us as Data Scientists (or, perhaps more accurately, Data Janitors) on an all-encompassing tour of the landscape of data cleaning. Intrepidly and meticulously, he marches us through the muddy waters of data sources, formats, and flaws, and sifts through the myriad of tools, algorithms, and methods available for handling those. If your data is dirty (which it is), then this book will be a quintessential reference atlas."
--
Stfan van der Walt, Senior Research Data Scientist, Berkeley Institute for Data Science, Founder of scikit-image, and Co-author of Elegant SciPy
"Exploring and validating data is famously not the most exciting part of data science. But it is important, and it can make the difference between the success or failure of a data science project. In Cleaning Data for Effective Data Science, David Mertz has curated a powerful set of tools and skills; he presents them clearly, and with just the right level of detail. If you work with data, this book will help you do your job better. I should add, it's a surprisingly fun read!"
--Allen Downey, Professor at Olin College, Author of Think Python, Think Bayes, and Think DSP
About the Author
David Mertz, Ph.D. is the founder of KDM Training, a partnership dedicated to educating developers and data scientists in machine learning and scientific computing. He created a data science training program for Anaconda Inc. and was a senior trainer for them. With the advent of deep neural networks, he has turned to training our robot overlords as well.
He previously worked for 8 years with D. E. Shaw Research and was also a Director of the Python Software Foundation for 6 years. David remains co-chair of its Trademarks Committee and Scientific Python Working Group. His columns, Charming Python and XML Matters, were once the most widely read articles in the Python world.
- Ingest and work with common data formats like JSON, CSV, SQL and NoSQL databases, PDF, and binary serialized data structures
- Understand how and why we use tools such as pandas, SciPy, scikit-learn, Tidyverse, and Bash
- Apply useful rules and heuristics for assessing data quality and detecting bias, like Benford's law and the 68-95-99.7 rule
- Identify and handle unreliable data and outliers, examining z-score and other statistical properties
- Impute sensible values into missing data and use sampling to fix imbalances
- Use dimensionality reduction, quantization, one-hot encoding, and other feature engineering techniques to draw out patterns in your data
- Work carefully with time series data, performing de-trending and interpolation
This book is designed to benefit software developers, data scientists, aspiring data scientists, teachers, and students who work with data. If you want to improve your rigor in data hygiene or are looking for a refresher, this book is for you.
Basic familiarity with statistics, general concepts in machine learning, knowledge of a programming language (Python or R), and some exposure to data science are helpful.
- Data Ingestion Tabular Formats
- Data Ingestion - Hierarchical Formats
- Data Ingestion - Repurposing Data Sources
- The Vicissitudes of Error - Anomaly Detection
- The Vicissitudes of Error - Data Quality
- Rectification and Creation - Value Imputation
- Rectification and Creation - Feature Engineering
- Ancillary Matters - Closure/Glossary