- Introduction
- What is Big Data Spark?
PySpark - Python Spark Hadoop coding framework & testing
Quick Facts
particular | details | |||
---|---|---|---|---|
Medium of instructions
English
|
Mode of learning
Self study
|
Mode of Delivery
Video and Text Based
|
Course overview
Apache Spark is an open-source distributed software framework and collection of library services for real-time, massive data processing, and PySpark is its Python API. Learning PySpark will help individuals build more configurable pipelines and analyses. The Hands-On PySpark for Big Data Analysis online certification was developed by Packt Publishing and is made available by Udemy, an education platform that offers programs to help participants advance their technical knowledge.
Hands-On PySpark for Big Data Analysis online course is a short-term program that involves 3.5 hours of learning material and 26 downloadable resources, which are intended for participants who want to learn the methods for analyzing big data sets and building big data platforms for machine learning models and business intelligence applications. Hands-On PySpark for Big Data Analysis online training discusses topics like data wrangling, data analysis, data cleaning, and structured data operations as well as explains the functionalities of Spark notebooks, Spark SQL, and resilient distributed datasets.
The highlights
- Certificate of completion
- Self-paced course
- 3.5 hours of pre-recorded video content
- 26 downloadable resource
Program offerings
- Online course
- Learning resources
- 30-day money-back guarantee
- Unlimited access
- Accessible on mobile devices and tv
Course and certificate fees
Fees information
certificate availability
Yes
certificate providing authority
Udemy
Who it is for
What you will learn
After completing the Hands-On PySpark for Big Data Analysis certification course, participants will acquire knowledge of the functionalities of PySpark for big data analytics. Participants will explore the patterns with Spark SQL to improve their business intelligence and increase productivity. In this PySpark certification, participants will learn about concepts involved with data wrangling, data cleaning, and data analysis of big data as well as acquire the knowledge of the techniques for structured data operations. In this PySpark course, participants will also learn about the strategies involved with Spark notebooks, MLlib, and resilient distributed datasets.
The syllabus
Introduction
Setting up Hadoop Spark development environment
- Environment setup steps
- Installing Python
- Installing PyCharm
- Creating a project in the main Python environment
- Installing JDK
- Installing Spark 3 & Hadoop
- Running PySpark in the Console
- PyCharm PySpark Hello DataFrame
- PyCharm Hadoop Spark programming
- Special instructions for Mac users
- Quick tips - winutils permission
- Python basics
Creating a PySpark coding framework
- Structuring code with classes and methods
- How Spark works?
- Creating and reusing SparkSession
- Spark DataFrame
- Separating out Ingestion, Transformation and Persistence code
Logging and Error Handling
- Python Logging
- Managing log level through a configuration file
- Having custom logger for each Python class
- Error Handling with try except and raise
- Logging using log4p and log4python packages
Creating a Data Pipeline with Hadoop Spark and PostgreSQL
- Ingesting data from Hive
- Transforming ingested data
- Installing PostgreSQL
- Spark PostgreSQL interaction with Psycopg2 adapter
- Spark PostgreSQL interaction with JDBC driver
- Persisting transformed data in PostgreSQL
Reading configuration from properties file
- Organizing code further
- Reading configuration from a property file
Unit testing PySpark application
- Python unittest framework
- Unit testing PySpark transformation logic
- Unit testing an error
spark-submit
- PySpark spark-submit
- Thank you
Appendix - PySpark on Colab and DataFrame deep dive
- Running Python Spark 3 on Google Colab
- SparkSDL and Dataframe deep dive on Colab
Appendix - Big Data Hadoop Hive for beginners
- Big Data concepts
- Hadoop concepts
- Hadoop Distributed File System (HDFS)
- Understanding Google Cloud (GCP) Dataproc
- Signing up for a Google Cloud free trial
- Storing a file in HDFS
- MapReduce and YARN
- Hive
- Querying HDFS data using Hive
- Deleting the Cluster
- Analyzing a billion records with Hive