Informatyka
Michael Walker
Many individuals who know how to run machine learning algorithms do not have a good sense of the statistical assumptions they make and how to match the properties of the data to the algorithm for the best results.As you start with this book, models are carefully chosen to help you grasp the underlying data, including in-feature importance and correlation, and the distribution of features and targets. The first two parts of the book introduce you to techniques for preparing data for ML algorithms, without being bashful about using some ML techniques for data cleaning, including anomaly detection and feature selection. The book then helps you apply that knowledge to a wide variety of ML tasks. You’ll gain an understanding of popular supervised and unsupervised algorithms, how to prepare data for them, and how to evaluate them. Next, you’ll build models and understand the relationships in your data, as well as perform cleaning and exploration tasks with that data. You’ll make quick progress in studying the distribution of variables, identifying anomalies, and examining bivariate relationships, as you focus more on the accuracy of predictions in this book.By the end of this book, you’ll be able to deal with complex data problems using unsupervised ML algorithms like principal component analysis and k-means clustering.
Data Engineering Best Practices. Architect robust and cost-effective data solutions in the cloud era
Richard J. Schiller, David Larochelle
Revolutionize your approach to data processing in the fast-paced business landscape with this essential guide to data engineering. Discover the power of scalable, efficient, and secure data solutions through expert guidance on data engineering principles and techniques. Written by two industry experts with over 60 years of combined experience, it offers deep insights into best practices, architecture, agile processes, and cloud-based pipelines. You’ll start by defining the challenges data engineers face and understand how this agile and future-proof comprehensive data solution architecture addresses them. As you explore the extensive toolkit, mastering the capabilities of various instruments, you’ll gain the knowledge needed for independent research. Covering everything you need, right from data engineering fundamentals, the guide uses real-world examples to illustrate potential solutions. It elevates your skills to architect scalable data systems, implement agile development processes, and design cloud-based data pipelines. The book further equips you with the knowledge to harness serverless computing and microservices to build resilient data applications.By the end, you'll be armed with the expertise to design and deliver high-performance data engineering solutions that are not only robust, efficient, and secure but also future-ready.
Data Engineering Best Practices. Architect robust and cost-effective data solutions in the cloud era
Richard J. Schiller, David Larochelle
Revolutionize your approach to data processing in the fast-paced business landscape with this essential guide to data engineering. Discover the power of scalable, efficient, and secure data solutions through expert guidance on data engineering principles and techniques. Written by two industry experts with over 60 years of combined experience, it offers deep insights into best practices, architecture, agile processes, and cloud-based pipelines. You’ll start by defining the challenges data engineers face and understand how this agile and future-proof comprehensive data solution architecture addresses them. As you explore the extensive toolkit, mastering the capabilities of various instruments, you’ll gain the knowledge needed for independent research. Covering everything you need, right from data engineering fundamentals, the guide uses real-world examples to illustrate potential solutions. It elevates your skills to architect scalable data systems, implement agile development processes, and design cloud-based data pipelines. The book further equips you with the knowledge to harness serverless computing and microservices to build resilient data applications.By the end, you'll be armed with the expertise to design and deliver high-performance data engineering solutions that are not only robust, efficient, and secure but also future-ready.
Data Engineering with Alteryx. Helping data engineers apply DataOps practices with Alteryx
Paul Houghton
Alteryx is a GUI-based development platform for data analytic applications.Data Engineering with Alteryx will help you leverage Alteryx’s code-free aspects which increase development speed while still enabling you to make the most of the code-based skills you have.This book will teach you the principles of DataOps and how they can be used with the Alteryx software stack. You’ll build data pipelines with Alteryx Designer and incorporate the error handling and data validation needed for reliable datasets. Next, you’ll take the data pipeline from raw data, transform it into a robust dataset, and publish it to Alteryx Server following a continuous integration process.By the end of this Alteryx book, you’ll be able to build systems for validating datasets, monitoring workflow performance, managing access, and promoting the use of your data sources.
Gareth Eagar
This book, authored by a Senior Data Architect with 25 years of experience, helps you gain expertise in the AWS ecosystem for data engineering. This revised edition updates every chapter to cover the latest AWS services and features, provides a refreshed view on data governance, and introduces a new section on building modern data platforms. You will learn how to implement a data mesh, work with open-table formats such as Apache Iceberg, and apply DataOps practices for automation and observability.You will begin by exploring core concepts and essential AWS tools used by data engineers, along with modern data management approaches. You will then design and build data pipelines, review raw data sources, transform data, and understand how it is consumed by various stakeholders. The book also covers data governance, populating data marts and warehouses, and how a data lakehouse fits into the architecture. You will explore AWS tools for analysis, SQL queries, visualizations, and learn how AI and machine learning generate insights from data. Later chapters cover transactional data lakes, data meshes, and building a complete AWS data platform.By the end, you will be able to confidently implement data engineering pipelines on AWS.*Email sign-up and proof of purchase required
Gareth Eagar
This book, authored by a Senior Data Architect with 25 years of experience, helps you gain expertise in the AWS ecosystem for data engineering. This revised edition updates every chapter to cover the latest AWS services and features, provides a refreshed view on data governance, and introduces a new section on building modern data platforms. You will learn how to implement a data mesh, work with open-table formats such as Apache Iceberg, and apply DataOps practices for automation and observability.You will begin by exploring core concepts and essential AWS tools used by data engineers, along with modern data management approaches. You will then design and build data pipelines, review raw data sources, transform data, and understand how it is consumed by various stakeholders. The book also covers data governance, populating data marts and warehouses, and how a data lakehouse fits into the architecture. You will explore AWS tools for analysis, SQL queries, visualizations, and learn how AI and machine learning generate insights from data. Later chapters cover transactional data lakes, data meshes, and building a complete AWS data platform.By the end, you will be able to confidently implement data engineering pipelines on AWS.*Email sign-up and proof of purchase required
Gareth Eagar
Written by a Senior Data Architect with over twenty-five years of experience in the business, Data Engineering for AWS is a book whose sole aim is to make you proficient in using the AWS ecosystem. Using a thorough and hands-on approach to data, this book will give aspiring and new data engineers a solid theoretical and practical foundation to succeed with AWS.As you progress, you’ll be taken through the services and the skills you need to architect and implement data pipelines on AWS. You'll begin by reviewing important data engineering concepts and some of the core AWS services that form a part of the data engineer's toolkit. You'll then architect a data pipeline, review raw data sources, transform the data, and learn how the transformed data is used by various data consumers. You’ll also learn about populating data marts and data warehouses along with how a data lakehouse fits into the picture. Later, you'll be introduced to AWS tools for analyzing data, including those for ad-hoc SQL queries and creating visualizations. In the final chapters, you'll understand how the power of machine learning and artificial intelligence can be used to draw new insights from data.By the end of this AWS book, you'll be able to carry out data engineering tasks and implement a data pipeline on AWS independently.
Dmitry Foshin, Dmitry Anoshin, Tonya Chernyshova, Sergii...
Data Engineering with Azure Databricks is your essential guide to building scalable, secure, and high-performing data pipelines using the powerful Databricks platform on Azure. Designed for data engineers, architects, and developers, this book demystifies the complexities of Spark-based workloads, Delta Lake, Unity Catalog, and real-time data processing.Beginning with the foundational role of Azure Databricks in modern data engineering, you’ll explore how to set up robust environments, manage data ingestion with Auto Loader, optimize Spark performance, and orchestrate complex workflows using tools like Azure Data Factory and Airflow.The book offers deep dives into structured streaming, Delta Live Tables, and Delta Lake’s ACID features for data reliability and schema evolution. You’ll also learn how to manage security, compliance, and access controls using Unity Catalog, and gain insights into managing CI/CD pipelines with Azure DevOps and Terraform.With a special focus on machine learning and generative AI, the final chapters guide you in automating model workflows, leveraging MLflow, and fine-tuning large language models on Databricks. Whether you're building a modern data lakehouse or operationalizing analytics at scale, this book provides the tools and insights you need.
Dmitry Foshin, Dmitry Anoshin, Tonya Chernyshova, Sergii...
Data Engineering with Azure Databricks is your essential guide to building scalable, secure, and high-performing data pipelines using the powerful Databricks platform on Azure. Designed for data engineers, architects, and developers, this book demystifies the complexities of Spark-based workloads, Delta Lake, Unity Catalog, and real-time data processing.Beginning with the foundational role of Azure Databricks in modern data engineering, you’ll explore how to set up robust environments, manage data ingestion with Auto Loader, optimize Spark performance, and orchestrate complex workflows using tools like Azure Data Factory and Airflow.The book offers deep dives into structured streaming, Delta Live Tables, and Delta Lake’s ACID features for data reliability and schema evolution. You’ll also learn how to manage security, compliance, and access controls using Unity Catalog, and gain insights into managing CI/CD pipelines with Azure DevOps and Terraform.With a special focus on machine learning and generative AI, the final chapters guide you in automating model workflows, leveraging MLflow, and fine-tuning large language models on Databricks. Whether you're building a modern data lakehouse or operationalizing analytics at scale, this book provides the tools and insights you need.
Pulkit Chadha
Written by a Senior Solutions Architect at Databricks, Data Engineering with Databricks Cookbook will show you how to effectively use Apache Spark, Delta Lake, and Databricks for data engineering, starting with comprehensive introduction to data ingestion and loading with Apache Spark.What makes this book unique is its recipe-based approach, which will help you put your knowledge to use straight away and tackle common problems. You’ll be introduced to various data manipulation and data transformation solutions that can be applied to data, find out how to manage and optimize Delta tables, and get to grips with ingesting and processing streaming data. The book will also show you how to improve the performance problems of Apache Spark apps and Delta Lake. Advanced recipes later in the book will teach you how to use Databricks to implement DataOps and DevOps practices, as well as how to orchestrate and schedule data pipelines using Databricks Workflows. You’ll also go through the full process of setup and configuration of the Unity Catalog for data governance.By the end of this book, you’ll be well-versed in building reliable and scalable data pipelines using modern data engineering technologies.
Pulkit Chadha
Written by a Senior Solutions Architect at Databricks, Data Engineering with Databricks Cookbook will show you how to effectively use Apache Spark, Delta Lake, and Databricks for data engineering, starting with comprehensive introduction to data ingestion and loading with Apache Spark.What makes this book unique is its recipe-based approach, which will help you put your knowledge to use straight away and tackle common problems. You’ll be introduced to various data manipulation and data transformation solutions that can be applied to data, find out how to manage and optimize Delta tables, and get to grips with ingesting and processing streaming data. The book will also show you how to improve the performance problems of Apache Spark apps and Delta Lake. Advanced recipes later in the book will teach you how to use Databricks to implement DataOps and DevOps practices, as well as how to orchestrate and schedule data pipelines using Databricks Workflows. You’ll also go through the full process of setup and configuration of the Unity Catalog for data governance.By the end of this book, you’ll be well-versed in building reliable and scalable data pipelines using modern data engineering technologies.
Roberto Zagni
dbt Cloud helps professional analytics engineers automate the application of powerful and proven patterns to transform data from ingestion to delivery, enabling real DataOps.This book begins by introducing you to dbt and its role in the data stack, along with how it uses simple SQL to build your data platform, helping you and your team work better together. You’ll find out how to leverage data modeling, data quality, master data management, and more to build a simple-to-understand and future-proof solution. As you advance, you’ll explore the modern data stack, understand how data-related careers are changing, and see how dbt enables this transition into the emerging role of an analytics engineer. The chapters help you build a sample project using the free version of dbt Cloud, Snowflake, and GitHub to create a professional DevOps setup with continuous integration, automated deployment, ELT run, scheduling, and monitoring, solving practical cases you encounter in your daily work.By the end of this dbt book, you’ll be able to build an end-to-end pragmatic data platform by ingesting data exported from your source systems, coding the needed transformations, including master data and the desired business rules, and building well-formed dimensional models or wide tables that’ll enable you to build reports with the BI tool of your choice.
Adi Wijaya, António Vilares
The second edition of Data Engineering with Google Cloud builds upon the success of the first edition by offering enhanced clarity and depth to data professionals navigating the intricate landscape of data engineering.Beyond its foundational lessons, this new edition delves into the essential realm of data governance within Google Cloud, providing you with invaluable insights into managing and optimizing data resources effectively. Written by a Data Strategic Cloud Engineer at Google, this book helps you stay ahead of the curve by guiding you through the latest technological advancements in the Google Cloud ecosystem. You’ll cover essential aspects, from exploring Cloud Composer 2 to the evolution of Airflow 2.5. Additionally, you’ll explore how to work with cutting-edge tools like Dataform, DLP, Dataplex, Dataproc Serverless, and Datastream to perform data governance on datasets.By the end of this book, you'll be equipped to navigate the ever-evolving world of data engineering on Google Cloud, from foundational principles to cutting-edge practices.
Adi Wijaya
With this book, you'll understand how the highly scalable Google Cloud Platform (GCP) enables data engineers to create end-to-end data pipelines right from storing and processing data and workflow orchestration to presenting data through visualization dashboards.Starting with a quick overview of the fundamental concepts of data engineering, you'll learn the various responsibilities of a data engineer and how GCP plays a vital role in fulfilling those responsibilities. As you progress through the chapters, you'll be able to leverage GCP products to build a sample data warehouse using Cloud Storage and BigQuery and a data lake using Dataproc. The book gradually takes you through operations such as data ingestion, data cleansing, transformation, and integrating data with other sources. You'll learn how to design IAM for data governance, deploy ML pipelines with the Vertex AI, leverage pre-built GCP models as a service, and visualize data with Google Data Studio to build compelling reports. Finally, you'll find tips on how to boost your career as a data engineer, take the Professional Data Engineer certification exam, and get ready to become an expert in data engineering with GCP.By the end of this data engineering book, you'll have developed the skills to perform core data engineering tasks and build efficient ETL data pipelines with GCP.
Eric Tome, Rupam Bhattacharjee, David Radford
Most data engineers know that performance issues in a distributed computing environment can easily lead to issues impacting the overall efficiency and effectiveness of data engineering tasks. While Python remains a popular choice for data engineering due to its ease of use, Scala shines in scenarios where the performance of distributed data processing is paramount.This book will teach you how to leverage the Scala programming language on the Spark framework and use the latest cloud technologies to build continuous and triggered data pipelines. You’ll do this by setting up a data engineering environment for local development and scalable distributed cloud deployments using data engineering best practices, test-driven development, and CI/CD. You’ll also get to grips with DataFrame API, Dataset API, and Spark SQL API and its use. Data profiling and quality in Scala will also be covered, alongside techniques for orchestrating and performance tuning your end-to-end pipelines to deliver data to your end users. By the end of this book, you will be able to build streaming and batch data pipelines using Scala while following software engineering best practices.
Eric Tome, Rupam Bhattacharjee, David Radford
Most data engineers know that performance issues in a distributed computing environment can easily lead to issues impacting the overall efficiency and effectiveness of data engineering tasks. While Python remains a popular choice for data engineering due to its ease of use, Scala shines in scenarios where the performance of distributed data processing is paramount.This book will teach you how to leverage the Scala programming language on the Spark framework and use the latest cloud technologies to build continuous and triggered data pipelines. You’ll do this by setting up a data engineering environment for local development and scalable distributed cloud deployments using data engineering best practices, test-driven development, and CI/CD. You’ll also get to grips with DataFrame API, Dataset API, and Spark SQL API and its use. Data profiling and quality in Scala will also be covered, alongside techniques for orchestrating and performance tuning your end-to-end pipelines to deliver data to your end users. By the end of this book, you will be able to build streaming and batch data pipelines using Scala while following software engineering best practices.
Data Governance Handbook. A practical approach to building trust in data
Wendy S. Batchelder
2.5 quintillion bytes! This is the amount of data being generated every single day across the globe. As this number continues to grow, understanding and managing data becomes more complex. Data professionals know that it’s their responsibility to navigate this complexity and ensure effective governance, empowering businesses with the right data, at the right time, and with the right controls.If you are a data professional, this book will equip you with valuable guidance to conquer data governance complexities with ease. Written by a three-time chief data officer in global Fortune 500 companies, the Data Governance Handbook is an exhaustive guide to understanding data governance, its key components, and how to successfully position solutions in a way that translates into tangible business outcomes.By the end, you’ll be able to successfully pitch and gain support for your data governance program, demonstrating tangible outcomes that resonate with key stakeholders.*Email sign-up and proof of purchase required
Data Governance Handbook. A practical approach to building trust in data
Wendy S. Batchelder
2.5 quintillion bytes! This is the amount of data being generated every single day across the globe. As this number continues to grow, understanding and managing data becomes more complex. Data professionals know that it’s their responsibility to navigate this complexity and ensure effective governance, empowering businesses with the right data, at the right time, and with the right controls.If you are a data professional, this book will equip you with valuable guidance to conquer data governance complexities with ease. Written by a three-time chief data officer in global Fortune 500 companies, the Data Governance Handbook is an exhaustive guide to understanding data governance, its key components, and how to successfully position solutions in a way that translates into tangible business outcomes.By the end, you’ll be able to successfully pitch and gain support for your data governance program, demonstrating tangible outcomes that resonate with key stakeholders.*Email sign-up and proof of purchase required
Gláucia Esppenchutz
Data Ingestion with Python Cookbook offers a practical approach to designing and implementing data ingestion pipelines. It presents real-world examples with the most widely recognized open source tools on the market to answer commonly asked questions and overcome challenges.You’ll be introduced to designing and working with or without data schemas, as well as creating monitored pipelines with Airflow and data observability principles, all while following industry best practices. The book also addresses challenges associated with reading different data sources and data formats. As you progress through the book, you’ll gain a broader understanding of error logging best practices, troubleshooting techniques, data orchestration, monitoring, and storing logs for further consultation.By the end of the book, you’ll have a fully automated set that enables you to start ingesting and monitoring your data pipeline effortlessly, facilitating seamless integration with subsequent stages of the ETL process.
Pradeep Pasupuleti, Beulah Salome Purra
A Data Lake is a highly scalable platform for storing huge volumes of multistructured data from disparate sources with centralized data management services. This book explores the potential of Data Lakes and explores architectural approaches to building data lakes that ingest, index, manage, and analyze massive amounts of data using batch and real-time processing frameworks. It guides you on how to go about building a Data Lake that is managed by Hadoop and accessed as required by other Big Data applications.This book will guide readers (using best practices) in developing Data Lake's capabilities. It will focus on architect data governance, security, data quality, data lineage tracking, metadata management, and semantic data tagging. By the end of this book, you will have a good understanding of building a Data Lake for Big Data.
Data Lake for Enterprises. Lambda Architecture for building enterprise data systems
Vivek Mishra, Tomcy John, Pankaj Misra
The term Data Lake has recently emerged as a prominent term in the big data industry. Data scientists can make use of it in deriving meaningful insights that can be used by businesses to redefine or transform the way they operate. Lambda architecture is also emerging as one of the very eminent patterns in the big data landscape, as it not only helps to derive useful information from historical data but also correlates real-time data to enable business to take critical decisions. This book tries to bring these two important aspects — data lake and lambda architecture—together.This book is divided into three main sections. The first introduces you to the concept of data lakes, the importance of data lakes in enterprises, and getting you up-to-speed with the Lambda architecture. The second section delves into the principal components of building a data lake using the Lambda architecture. It introduces you to popular big data technologies such as Apache Hadoop, Spark, Sqoop, Flume, and ElasticSearch. The third section is a highly practical demonstration of putting it all together, and shows you how an enterprise data lake can be implemented, along with several real-world use-cases. It also shows you how other peripheral components can be added to the lake to make it more efficient.By the end of this book, you will be able to choose the right big data technologies using the lambda architectural patterns to build your enterprise data lake.
Angelika Klidas, Kevin Hanegan
Data is more than a mere commodity in our digital world. It is the ebb and flow of our modern existence. Individuals, teams, and enterprises working with data can unlock a new realm of possibilities. And the resultant agility, growth, and inevitable success have one origin—data literacy.This comprehensive guide is written by two data literacy pioneers, each with a thorough footprint within the data and analytics commercial world and lectures at top universities in the US and the Netherlands. Complete with best practices, practical models, and real-world examples, Data Literacy in Practice will help you start making your data work for you by building your understanding of data literacy basics and accelerating your journey to independently uncovering insights.You’ll learn the four-pillar model that underpins all data and analytics and explore concepts such as measuring data quality, setting up a pragmatic data management environment, choosing the right graphs for your readers, and questioning your insights.By the end of the book, you'll be equipped with a combination of skills and mindset as well as with tools and frameworks that will allow you to find insights and meaning within your data for data-informed decision making.