Hadoop in Practice


Author: Alex Holmes
Publisher: Manning Publications
ISBN: 9781617292224
Category: Computers
Page: 487
View: 5651
DOWNLOAD NOW »
Annotation 'Summary Hadoop in Practice' provides over 100 tested, instantly useful techniques that will help you conquer big data, using Hadoop.

Hadoop in Action


Author: Chuck Lam,Mark Davis,Ajit Gaddam
Publisher: Manning Publications
ISBN: 9781617291227
Category:
Page: 525
View: 559
DOWNLOAD NOW »
The massive datasets required for most modern businesses are too large to safely store and efficiently process on a single server. Hadoop is an open source data processing framework that provides a distributed file system that can manage data stored across clusters of servers and implements the MapReduce data processing model so that users can effectively query and utilize big data. The new Hadoop 2.0 is a stable, enterprise-ready platform supported by a rich ecosystem of tools and related technologies such as Pig, Hive, YARN, Spark, Tez, and many more. Hadoop in Action, Second Edition, provides a comprehensive introduction to Hadoop and shows how to write programs in the MapReduce style. It starts with a few easy examples and then moves quickly to show how Hadoop can be used in more complex data analysis tasks. It covers how YARN, new in Hadoop 2, simplifies and supercharges resource management to make streaming and real-time applications more feasible. Included are best practices and design patterns of MapReduce programming. The book expands on the first edition by enhancing coverage of important Hadoop 2 concepts and systems, and by providing new chapters on data management and data science that reinforce a practical understanding of Hadoop. Purchase of the print book includes a free eBook in PDF, Kindle, and ePub formats from Manning Publications.

Professional Hadoop Solutions


Author: Boris Lublinsky,Kevin T. Smith,Alexey Yakubovich
Publisher: John Wiley & Sons
ISBN: 1118824180
Category: Computers
Page: 504
View: 8457
DOWNLOAD NOW »
The go-to guidebook for deploying Big Data solutions with Hadoop Today's enterprise architects need to understand how the Hadoop frameworks and APIs fit together, and how they can be integrated to deliver real-world solutions. This book is a practical, detailed guide to building and implementing those solutions, with code-level instruction in the popular Wrox tradition. It covers storing data with HDFS and Hbase, processing data with MapReduce, and automating data processing with Oozie. Hadoop security, running Hadoop with Amazon Web Services, best practices, and automating Hadoop processes in real time are also covered in depth. With in-depth code examples in Java and XML and the latest on recent additions to the Hadoop ecosystem, this complete resource also covers the use of APIs, exposing their inner workings and allowing architects and developers to better leverage and customize them. The ultimate guide for developers, designers, and architects who need to build and deploy Hadoop applications Covers storing and processing data with various technologies, automating data processing, Hadoop security, and delivering real-time solutions Includes detailed, real-world examples and code-level guidelines Explains when, why, and how to use these tools effectively Written by a team of Hadoop experts in the programmer-to-programmer Wrox style Professional Hadoop Solutions is the reference enterprise architects and developers need to maximize the power of Hadoop.

Apache Hadoop YARN

Moving beyond MapReduce and Batch Processing with Apache Hadoop 2
Author: Arun Murthy,Vinod Vavilapalli,Douglas Eadline,Joseph Niemiec,Jeff Markham
Publisher: Addison-Wesley Professional
ISBN: 0133441911
Category: Computers
Page: 400
View: 872
DOWNLOAD NOW »
“This book is a critically needed resource for the newly released Apache Hadoop 2.0, highlighting YARN as the significant breakthrough that broadens Hadoop beyond the MapReduce paradigm.” —From the Foreword by Raymie Stata, CEO of Altiscale The Insider’s Guide to Building Distributed, Big Data Applications with Apache Hadoop™ YARN Apache Hadoop is helping drive the Big Data revolution. Now, its data processing has been completely overhauled: Apache Hadoop YARN provides resource management at data center scale and easier ways to create distributed applications that process petabytes of data. And now in Apache Hadoop™ YARN, two Hadoop technical leaders show you how to develop new applications and adapt existing code to fully leverage these revolutionary advances. YARN project founder Arun Murthy and project lead Vinod Kumar Vavilapalli demonstrate how YARN increases scalability and cluster utilization, enables new programming models and services, and opens new options beyond Java and batch processing. They walk you through the entire YARN project lifecycle, from installation through deployment. You’ll find many examples drawn from the authors’ cutting-edge experience—first as Hadoop’s earliest developers and implementers at Yahoo! and now as Hortonworks developers moving the platform forward and helping customers succeed with it. Coverage includes YARN’s goals, design, architecture, and components—how it expands the Apache Hadoop ecosystem Exploring YARN on a single node Administering YARN clusters and Capacity Scheduler Running existing MapReduce applications Developing a large-scale clustered YARN application Discovering new open source frameworks that run under YARN

Hadoop Real-World Solutions Cookbook


Author: Tanmay Deshpande
Publisher: Packt Publishing Ltd
ISBN: 1784398004
Category: Computers
Page: 290
View: 6057
DOWNLOAD NOW »
Over 90 hands-on recipes to help you learn and master the intricacies of Apache Hadoop 2.X, YARN, Hive, Pig, Oozie, Flume, Sqoop, Apache Spark, and Mahout About This Book Implement outstanding Machine Learning use cases on your own analytics models and processes. Solutions to common problems when working with the Hadoop ecosystem. Step-by-step implementation of end-to-end big data use cases. Who This Book Is For Readers who have a basic knowledge of big data systems and want to advance their knowledge with hands-on recipes. What You Will Learn Installing and maintaining Hadoop 2.X cluster and its ecosystem. Write advanced Map Reduce programs and understand design patterns. Advanced Data Analysis using the Hive, Pig, and Map Reduce programs. Import and export data from various sources using Sqoop and Flume. Data storage in various file formats such as Text, Sequential, Parquet, ORC, and RC Files. Machine learning principles with libraries such as Mahout Batch and Stream data processing using Apache Spark In Detail Big data is the current requirement. Most organizations produce huge amount of data every day. With the arrival of Hadoop-like tools, it has become easier for everyone to solve big data problems with great efficiency and at minimal cost. Grasping Machine Learning techniques will help you greatly in building predictive models and using this data to make the right decisions for your organization. Hadoop Real World Solutions Cookbook gives readers insights into learning and mastering big data via recipes. The book not only clarifies most big data tools in the market but also provides best practices for using them. The book provides recipes that are based on the latest versions of Apache Hadoop 2.X, YARN, Hive, Pig, Sqoop, Flume, Apache Spark, Mahout and many more such ecosystem tools. This real-world-solution cookbook is packed with handy recipes you can apply to your own everyday issues. Each chapter provides in-depth recipes that can be referenced easily. This book provides detailed practices on the latest technologies such as YARN and Apache Spark. Readers will be able to consider themselves as big data experts on completion of this book. This guide is an invaluable tutorial if you are planning to implement a big data warehouse for your business. Style and approach An easy-to-follow guide that walks you through world of big data. Each tool in the Hadoop ecosystem is explained in detail and the recipes are placed in such a manner that readers can implement them sequentially. Plenty of reference links are provided for advanced reading.

Hadoop MapReduce Cookbook


Author: Srinath Perera
Publisher: Packt Publishing Ltd
ISBN: 1849517290
Category: Computers
Page: 300
View: 6800
DOWNLOAD NOW »
Individual self-contained code recipes. Solve specific problems using individual recipes, or work through the book to develop your capabilities. If you are a big data enthusiast and striving to use Hadoop to solve your problems, this book is for you. Aimed at Java programmers with some knowledge of Hadoop MapReduce, this is also a comprehensive reference for developers and system admins who want to get up to speed using Hadoop.

Managing Data in Motion

Data Integration Best Practice Techniques and Technologies
Author: April Reeve
Publisher: Newnes
ISBN: 0123977916
Category: Computers
Page: 204
View: 3682
DOWNLOAD NOW »
Managing Data in Motion describes techniques that have been developed for significantly reducing the complexity of managing system interfaces and enabling scalable architectures. Author April Reeve brings over two decades of experience to present a vendor-neutral approach to moving data between computing environments and systems. Readers will learn the techniques, technologies, and best practices for managing the passage of data between computer systems and integrating disparate data together in an enterprise environment. The average enterprise's computing environment is comprised of hundreds to thousands computer systems that have been built, purchased, and acquired over time. The data from these various systems needs to be integrated for reporting and analysis, shared for business transaction processing, and converted from one format to another when old systems are replaced and new systems are acquired. The management of the "data in motion" in organizations is rapidly becoming one of the biggest concerns for business and IT management. Data warehousing and conversion, real-time data integration, and cloud and "big data" applications are just a few of the challenges facing organizations and businesses today. Managing Data in Motion tackles these and other topics in a style easily understood by business and IT managers as well as programmers and architects. Presents a vendor-neutral overview of the different technologies and techniques for moving data between computer systems including the emerging solutions for unstructured as well as structured data types Explains, in non-technical terms, the architecture and components required to perform data integration Describes how to reduce the complexity of managing system interfaces and enable a scalable data architecture that can handle the dimensions of "Big Data"

Data-intensive Text Processing with MapReduce


Author: Jimmy Lin,Chris Dyer
Publisher: Morgan & Claypool Publishers
ISBN: 1608453421
Category: Computers
Page: 165
View: 6912
DOWNLOAD NOW »
Our world is being revolutionized by data-driven methods: access to large amounts of data has generated new insights and opened exciting new opportunities in commerce, science, and computing applications. Processing the enormous quantities of data necessary for these advances requires large clusters, making distributed computing paradigms more crucial than ever. MapReduce is a programming model for expressing distributed computations on massive datasets and an execution framework for large-scale data processing on clusters of commodity servers. The programming model provides an easy-to-understand abstraction for designing scalable algorithms, while the execution framework transparently handles many system-level details, ranging from scheduling to synchronization to fault tolerance. This book focuses on MapReduce algorithm design, with an emphasis on text processing algorithms common in natural language processing, information retrieval, and machine learning. We introduce the notion of MapReduce design patterns, which represent general reusable solutions to commonly occurring problems across a variety of problem domains. This book not only intends to help the reader "think in MapReduce", but also discusses limitations of the programming model as well. This volume is a printed version of a work that appears in the Synthesis Digital Library of Engineering and Computer Science. Synthesis Lectures provide concise, original presentations of important research and development topics, published quickly, in digital and print formats. For more information visit www.morganclaypool.com

Big Data For Dummies


Author: Judith Hurwitz,Alan Nugent,Fern Halper,Marcia Kaufman
Publisher: John Wiley & Sons
ISBN: 1118644174
Category: Computers
Page: 336
View: 7755
DOWNLOAD NOW »
Find the right big data solution for your business or organization Big data management is one of the major challenges facing business, industry, and not-for-profit organizations. Data sets such as customer transactions for a mega-retailer, weather patterns monitored by meteorologists, or social network activity can quickly outpace the capacity of traditional data management tools. If you need to develop or manage big data solutions, you'll appreciate how these four experts define, explain, and guide you through this new and often confusing concept. You'll learn what it is, why it matters, and how to choose and implement solutions that work. Effectively managing big data is an issue of growing importance to businesses, not-for-profit organizations, government, and IT professionals Authors are experts in information management, big data, and a variety of solutions Explains big data in detail and discusses how to select and implement a solution, security concerns to consider, data storage and presentation issues, analytics, and much more Provides essential information in a no-nonsense, easy-to-understand style that is empowering Big Data For Dummies cuts through the confusion and helps you take charge of big data solutions for your organization.

Apache Hive Cookbook


Author: Hanish Bansal,Saurabh Chauhan,Shrey Mehrotra
Publisher: Packt Publishing Ltd
ISBN: 1782161090
Category: Computers
Page: 268
View: 3535
DOWNLOAD NOW »
Easy, hands-on recipes to help you understand Hive and its integration with frameworks that are used widely in today's big data world About This Book Grasp a complete reference of different Hive topics. Get to know the latest recipes in development in Hive including CRUD operations Understand Hive internals and integration of Hive with different frameworks used in today's world. Who This Book Is For The book is intended for those who want to start in Hive or who have basic understanding of Hive framework. Prior knowledge of basic SQL command is also required What You Will Learn Learn different features and offering on the latest Hive Understand the working and structure of the Hive internals Get an insight on the latest development in Hive framework Grasp the concepts of Hive Data Model Master the key concepts like Partition, Buckets and Statistics Know how to integrate Hive with other frameworks such as Spark, Accumulo, etc In Detail Hive was developed by Facebook and later open sourced in Apache community. Hive provides SQL like interface to run queries on Big Data frameworks. Hive provides SQL like syntax also called as HiveQL that includes all SQL capabilities like analytical functions which are the need of the hour in today's Big Data world. This book provides you easy installation steps with different types of metastores supported by Hive. This book has simple and easy to learn recipes for configuring Hive clients and services. You would also learn different Hive optimizations including Partitions and Bucketing. The book also covers the source code explanation of latest Hive version. Hive Query Language is being used by other frameworks including spark. Towards the end you will cover integration of Hive with these frameworks. Style and approach Starting with the basics and covering the core concepts with the practical usage, this book is a complete guide to learn and explore Hive offerings.

Hadoop: The Definitive Guide


Author: Tom White
Publisher: "O'Reilly Media, Inc."
ISBN: 1449338771
Category: Computers
Page: 688
View: 1643
DOWNLOAD NOW »
Ready to unlock the power of your data? With this comprehensive guide, you’ll learn how to build and maintain reliable, scalable, distributed systems with Apache Hadoop. This book is ideal for programmers looking to analyze datasets of any size, and for administrators who want to set up and run Hadoop clusters. You’ll find illuminating case studies that demonstrate how Hadoop is used to solve specific problems. This third edition covers recent changes to Hadoop, including material on the new MapReduce API, as well as MapReduce 2 and its more flexible execution model (YARN). Store large datasets with the Hadoop Distributed File System (HDFS) Run distributed computations with MapReduce Use Hadoop’s data and I/O building blocks for compression, data integrity, serialization (including Avro), and persistence Discover common pitfalls and advanced features for writing real-world MapReduce programs Design, build, and administer a dedicated Hadoop cluster—or run Hadoop in the cloud Load data from relational databases into HDFS, using Sqoop Perform large-scale data processing with the Pig query language Analyze datasets with Hive, Hadoop’s data warehousing system Take advantage of HBase for structured and semi-structured data, and ZooKeeper for building distributed systems

Learning Akka


Author: Jason Goodwin
Publisher: Packt Publishing Ltd
ISBN: 1784393541
Category: Computers
Page: 274
View: 907
DOWNLOAD NOW »
Build fault tolerant concurrent and distributed applications with Akka About This Book Build networked applications that self-heal Scale out your applications to handle more traffic faster An easy-to-follow guide with a number of examples to ensure you get the best start with Akka Who This Book Is For This book is intended for beginner to intermediate Java or Scala developers who want to build applications to serve the high-scale user demands in computing today. If you need your applications to handle the ever-growing user bases and datasets with high performance demands, then this book is for you. Learning Akka will let you do more for your users with less code and less complexity, by building and scaling your networked applications with ease. What You Will Learn Use Akka to overcome the challenges of concurrent programming Resolve the issues faced in distributed computing with the help of Akka Scale applications to serve a high number of concurrent users Make your system fault-tolerant with self-healing applications Provide a timely response to users with easy concurrency Reduce hardware costs by building more efficient multi-user applications Maximise network efficiency by scaling it In Detail Software today has to work with more data, more users, more cores, and more servers than ever. Akka is a distributed computing toolkit that enables developers to build correct concurrent and distributed applications using Java and Scala with ease, applications that scale across servers and respond to failure by self-healing. As well as simplifying development, Akka enables multiple concurrency development patterns with particular support and architecture derived from Erlang's concept of actors (lightweight concurrent entities). Akka is written in Scala, which has become the programming language of choice for development on the Akka platform. Learning Akka aims to be a comprehensive walkthrough of Akka. This book will take you on a journey through all the concepts of Akka that you need in order to get started with concurrent and distributed applications and even build your own. Beginning with the concept of Actors, the book will take you through concurrency in Akka. Moving on to networked applications, this book will explain the common pitfalls in these difficult problem areas while teaching you how to use Akka to overcome these problems with ease. The book is an easy to follow example-based guide that will strengthen your basic knowledge of Akka and aid you in applying the same to real-world scenarios. Style and approach An easy-to-follow, example-based guide that will take you through building several networked-applications that work together while you are learning concurrent and distributed computing concepts. Each topic is explained while showing you how to design with Akka and how it is used to overcome common problems in applications. By showing Akka in context to the problems, it will help you understand what the common problems are in distributed applications and how to overcome them.

Programming Hive


Author: Edward Capriolo,Dean Wampler,Jason Rutherglen
Publisher: "O'Reilly Media, Inc."
ISBN: 1449319335
Category: Computers
Page: 328
View: 1605
DOWNLOAD NOW »
Describes the features and functions of Apache Hive, the data infrastructure for Hadoop.

MapReduce Design Patterns

Building Effective Algorithms and Analytics for Hadoop and Other Systems
Author: Donald Miner,Adam Shook
Publisher: "O'Reilly Media, Inc."
ISBN: 1449341985
Category: Computers
Page: 250
View: 5427
DOWNLOAD NOW »
Until now, design patterns for the MapReduce framework have been scattered among various research papers, blogs, and books. This handy guide brings together a unique collection of valuable MapReduce patterns that will save you time and effort regardless of the domain, language, or development framework you’re using. Each pattern is explained in context, with pitfalls and caveats clearly identified to help you avoid common design mistakes when modeling your big data architecture. This book also provides a complete overview of MapReduce that explains its origins and implementations, and why design patterns are so important. All code examples are written for Hadoop. Summarization patterns: get a top-level view by summarizing and grouping data Filtering patterns: view data subsets such as records generated from one user Data organization patterns: reorganize data to work with other systems, or to make MapReduce analysis easier Join patterns: analyze different datasets together to discover interesting relationships Metapatterns: piece together several patterns to solve multi-stage problems, or to perform several analytics in the same job Input and output patterns: customize the way you use Hadoop to load or store data "A clear exposition of MapReduce programs for common data processing patterns—this book is indespensible for anyone using Hadoop." --Tom White, author of Hadoop: The Definitive Guide

Learning Apache Drill


Author: Charles Givre,Paul Rogers
Publisher: O'Reilly Media
ISBN: 9781492032793
Category: Computers
Page: 175
View: 4777
DOWNLOAD NOW »
Apache Drill enables interactive analysis of massively large datasets, allowing you to execute SQL queries against data in many different data sources--including Hadoop and MongoDB clusters, HBase, or even your local file system--and get results quickly. With this practical guide, analysts and data scientists focused on business or research applications will learn how to incorporate Drill capabilities into complex programs, including how to use Drill queries to replace some MapReduce operations in a large-scale program. Drill committers Charles Givre and Paul Rogers provide an introduction to Drill and its ability to handle large files containing data in flexible formats with nested data structures and tables. You'll discover how this capability fills a gap in the Hadoop ecosystem. Additional topics show you how to: Prepare and organize data to maximize Drill performance Set expectations for Drill performance on different data types and volumes Reconcile Drill's schema-free features with schema-full JDBC and ODBC clients

Apache Oozie

The Workflow Scheduler for Hadoop
Author: Mohammad Kamrul Islam,Aravind Srinivasan
Publisher: "O'Reilly Media, Inc."
ISBN: 1449369774
Category: Computers
Page: 272
View: 9036
DOWNLOAD NOW »
Get a solid grounding in Apache Oozie, the workflow scheduler system for managing Hadoop jobs. With this hands-on guide, two experienced Hadoop practitioners walk you through the intricacies of this powerful and flexible platform, with numerous examples and real-world use cases. Once you set up your Oozie server, you’ll dive into techniques for writing and coordinating workflows, and learn how to write complex data pipelines. Advanced topics show you how to handle shared libraries in Oozie, as well as how to implement and manage Oozie’s security capabilities. Install and configure an Oozie server, and get an overview of basic concepts Journey through the world of writing and configuring workflows Learn how the Oozie coordinator schedules and executes workflows based on triggers Understand how Oozie manages data dependencies Use Oozie bundles to package several coordinator apps into a data pipeline Learn about security features and shared library management Implement custom extensions and write your own EL functions and actions Debug workflows and manage Oozie’s operational details

Machine Learning in Action


Author: Peter Harrington
Publisher: Manning Publications
ISBN: 9781617290183
Category: Computers
Page: 354
View: 2849
DOWNLOAD NOW »
Provides information on the concepts of machine theory, covering such topics as statistical data processing, data visualization, and forecasting.

Delivering Business Analytics

Practical Guidelines for Best Practice
Author: Evan Stubbs
Publisher: John Wiley & Sons
ISBN: 1118559444
Category: Business & Economics
Page: 368
View: 9083
DOWNLOAD NOW »
AVOID THE MISTAKES THAT OTHERS MAKE – LEARN WHAT LEADS TO BEST PRACTICE AND KICKSTART SUCCESS This groundbreaking resource provides comprehensive coverage across all aspects of business analytics, presenting proven management guidelines to drive sustainable differentiation. Through a rich set of case studies, author Evan Stubbs reviews solutions and examples to over twenty common problems spanning managing analytics assets and information, leveraging technology, nurturing skills, and defining processes. Delivering Business Analytics also outlines the Data Scientist’s Code, fifteen principles that when followed ensure constant movement towards effective practice. Practical advice is offered for addressing various analytics issues; the advantages and disadvantages of each issue’s solution; and how these solutions can optimally create organizational value. With an emphasis on real-world examples and pragmatic advice throughout, Delivering Business Analytics provides a reference guide on: The economic principles behind how business analytics leads to competitive differentiation The elements which define best practice The Data Scientist’s Code, fifteen management principles that when followed help teams move towards best practice Practical solutions and frequent missteps to twenty-four common problems across people and process, systems and assets, and data and decision-making Drawing on the successes and failures of countless organizations, author Evan Stubbs provides a densely packed practical reference on how to increase the odds of success in designing business analytics systems and managing teams of data scientists. Uncover what constitutes best practice in business analytics and start achieving it with Delivering Business Analytics.

R in a Nutshell

A Desktop Quick Reference
Author: Joseph Adler
Publisher: "O'Reilly Media, Inc."
ISBN: 1449358225
Category: Computers
Page: 724
View: 7660
DOWNLOAD NOW »
If you’re considering R for statistical computing and data visualization, this book provides a quick and practical guide to just about everything you can do with the open source R language and software environment. You’ll learn how to write R functions and use R packages to help you prepare, visualize, and analyze data. Author Joseph Adler illustrates each process with a wealth of examples from medicine, business, and sports. Updated for R 2.14 and 2.15, this second edition includes new and expanded chapters on R performance, the ggplot2 data visualization package, and parallel R computing with Hadoop. Get started quickly with an R tutorial and hundreds of examples Explore R syntax, objects, and other language details Find thousands of user-contributed R packages online, including Bioconductor Learn how to use R to prepare data for analysis Visualize your data with R’s graphics, lattice, and ggplot2 packages Use R to calculate statistical fests, fit models, and compute probability distributions Speed up intensive computations by writing parallel R programs for Hadoop Get a complete desktop reference to R

Tika in Action


Author: Chris A. Mattmann,Jukka L. Zitting
Publisher: Manning Publications Company
ISBN: 9781935182856
Category: Computers
Page: 229
View: 6917
DOWNLOAD NOW »
The information trapped in text files, PDFs, and other digital content is a valuable information asset that can be very difficult to discover and use. Apache Tika is an open source toolkit that makes it easy for search engines, content management systems and other applications to detect and extract content from digital documents in all major file formats. Tika in Actionis a hands-on guide for developers working with search engines, content management systems and other similar applications who want to exploit the information locked in digital documents. It introduces the world of mining text and binary documents as well as other information sources. The book shows where Tika fits within this landscape and how readers can use Tika to build and extend applications. The book's many case studies give real-world experience from domains ranging from search engines to digital asset management and scientific data processing.