Analiza danych
John Sukup
Trusted by data scientists, ML engineers, and software developers alike, scikit-learn offers a versatile, user-friendly framework for implementing a wide range of ML algorithms, enabling the efficient development and deployment of predictive models in real-world applications. This third edition of scikit-learn Cookbook will help you master ML with real-world examples and scikit-learn 1.5 features.This updated edition takes you on a journey from understanding the fundamentals of ML and data preprocessing, through implementing advanced algorithms and techniques, to deploying and optimizing ML models in production. Along the way, you’ll explore practical, step-by-step recipes that cover everything from feature engineering and model selection to hyperparameter tuning and model evaluation, all using scikit-learn.By the end of this book, you’ll have gained the knowledge and skills needed to confidently build, evaluate, and deploy sophisticated ML models using scikit-learn, ready to tackle a wide range of data-driven challenges.*Email sign-up and proof of purchase required
Luiz Felipe Martins, V Kishore Ayyadevara, Ruben...
With the SciPy Stack, you get the power to effectively process, manipulate, and visualize your data using the popular Python language. Utilizing SciPy correctly can sometimes be a very tricky proposition. This book provides the right techniques so you can use SciPy to perform different data science tasks with ease.This book includes hands-on recipes for using the different components of the SciPy Stack such as NumPy, SciPy, matplotlib, and pandas, among others. You will use these libraries to solve real-world problems in linear algebra, numerical analysis, data visualization, and much more. The recipes included in the book will ensure you get a practical understanding not only of how a particular feature in SciPy Stack works, but also of its application to real-world problems. The independent nature of the recipes also ensure that you can pick up any one and learn about a particular feature of SciPy without reading through the other recipes, thus making the book a very handy and useful guide.
Vishal Pathak, Subramanya Vajiraya, Noritaka Sekiyama, Tomohiro...
Organizations these days have gravitated toward services such as AWS Glue that undertake undifferentiated heavy lifting and provide serverless Spark, enabling you to create and manage data lakes in a serverless fashion. This guide shows you how AWS Glue can be used to solve real-world problems along with helping you learn about data processing, data integration, and building data lakes.Beginning with AWS Glue basics, this book teaches you how to perform various aspects of data analysis such as ad hoc queries, data visualization, and real-time analysis using this service. It also provides a walk-through of CI/CD for AWS Glue and how to shift left on quality using automated regression tests. You’ll find out how data security aspects such as access control, encryption, auditing, and networking are implemented, as well as getting to grips with useful techniques such as picking the right file format, compression, partitioning, and bucketing. As you advance, you’ll discover AWS Glue features such as crawlers, Lake Formation, governed tables, lineage, DataBrew, Glue Studio, and custom connectors. The concluding chapters help you to understand various performance tuning, troubleshooting, and monitoring options.By the end of this AWS book, you’ll be able to create, manage, troubleshoot, and deploy ETL pipelines using AWS Glue.
Aaron Ploetz, Devram Kandhare, Sudarshan Kadambi, Xun...
This is the golden age of open source NoSQL databases. With enterprises having to work with large amounts of unstructured data and moving away from expensive monolithic architecture, the adoption of NoSQL databases is rapidly increasing. Being familiar with the popular NoSQL databases and knowing how to use them is a must for budding DBAs and developers.This book introduces you to the different types of NoSQL databases and gets you started with seven of the most popular NoSQL databases used by enterprises today. We start off with a brief overview of what NoSQL databases are, followed by an explanation of why and when to use them. The book then covers the seven most popular databases in each of these categories: MongoDB, Amazon DynamoDB, Redis, HBase, Cassandra, In?uxDB, and Neo4j. The book doesn't go into too much detail about each database but teachesyou enough to get started with them.By the end of this book, you will have a thorough understanding of the different NoSQL databases and their functionalities, empowering you to select and use the rightdatabase according to your needs.
Siatka danych. Nowoczesna koncepcja samoobsługowej infrastruktury danych
Zhamak Dehghani
Dostęp do danych jest warunkiem rozwoju niejednej organizacji. Aby w pełni skorzystać z ich potencjału i uzyskać dzięki nim konkretną wartość, konieczne jest odpowiednie zarządzanie danymi. Obecnie stosowane rozwiązania w tym zakresie nie nadążają już za złożonością dzisiejszych organizacji, rozprzestrzenianiem się źródeł danych i rosnącymi aspiracjami inżynierów, którzy rozwijają techniki sztucznej inteligencji i analizy danych. Odpowiedzią na te potrzeby może być siatka danych (Data Mesh), jednak praktyczna implementacja tej koncepcji wymaga istotnej zmiany myślenia. Ta książka szczegółowo wyjaśnia paradygmat siatki danych, a przy tym koncentruje się na jego praktycznym zastosowaniu. Zgodnie z tym nowatorskim podejściem dane należy traktować jako produkt, a dziedziny - jako główne zagadnienie. Poza wyjaśnieniem paradygmatu opisano tu zasady projektowania wysokopoziomowej architektury komponentów siatki danych, a także przedstawiono wskazówki i porady dotyczące ewolucyjnej realizacji siatki danych w organizacji. Tematyka ta została potraktowana wszechstronnie: omówiono kwestie technologiczne, organizacyjne, jak również socjologiczne i kulturowe. Dzięki temu jest to cenna lektura zarówno dla architektów i inżynierów, jak i dla badaczy, analityków danych, wreszcie dla liderów i kierowników zespołów. W książce: wyczerpujące wprowadzenie do paradygmatu siatki danych siatka danych i jej komponenty projektowanie architektury siatki danych opracowywanie i realizacja strategii siatki danych zdecentralizowany model własności danych przejście z hurtowni i jezior danych do rozproszonej siatki danych Siatka danych: kolejny etap rozwoju technologii big data!
Sakti Mishra
Amazon EMR, formerly Amazon Elastic MapReduce, provides a managed Hadoop cluster in Amazon Web Services (AWS) that you can use to implement batch or streaming data pipelines. By gaining expertise in Amazon EMR, you can design and implement data analytics pipelines with persistent or transient EMR clusters in AWS.This book is a practical guide to Amazon EMR for building data pipelines. You'll start by understanding the Amazon EMR architecture, cluster nodes, features, and deployment options, along with their pricing. Next, the book covers the various big data applications that EMR supports. You'll then focus on the advanced configuration of EMR applications, hardware, networking, security, troubleshooting, logging, and the different SDKs and APIs it provides. Later chapters will show you how to implement common Amazon EMR use cases, including batch ETL with Spark, real-time streaming with Spark Streaming, and handling UPSERT in S3 Data Lake with Apache Hudi. Finally, you'll orchestrate your EMR jobs and strategize on-premises Hadoop cluster migration to EMR. In addition to this, you'll explore best practices and cost optimization techniques while implementing your data analytics pipeline in EMR.By the end of this book, you'll be able to build and deploy Hadoop- or Spark-based apps on Amazon EMR and also migrate your existing on-premises Hadoop workloads to AWS.
Anindita Mahapatra, Doug May
Delta helps you generate reliable insights at scale and simplifies architecture around data pipelines, allowing you to focus primarily on refining the use cases being worked on. This is especially important when you consider that existing architecture is frequently reused for new use cases.In this book, you’ll learn about the principles of distributed computing, data modeling techniques, and big data design patterns and templates that help solve end-to-end data flow problems for common scenarios and are reusable across use cases and industry verticals. You’ll also learn how to recover from errors and the best practices around handling structured, semi-structured, and unstructured data using Delta. After that, you’ll get to grips with features such as ACID transactions on big data, disciplined schema evolution, time travel to help rewind a dataset to a different time or version, and unified batch and streaming capabilities that will help you build agile and robust data products.By the end of this Delta book, you’ll be able to use Delta as the foundational block for creating analytics-ready data that fuels all AI/BI use cases.