Introduction
Spark 2.0 features a new Dataset API. Now that Datasets support a full range of operations, you can avoid working with low-level RDDs in most cases. In 2.0, DataFrames no longer exist as a separate class; instead, DataFrame is defined as a special case of Dataset. Here is some example code to get you started with Spark 2.0 Datasets / DataFrames. Part 1 focuses on type-safe operations with Datasets, which provide compile time type safety. Part 2 focuses on DataFrames, which have untyped operations.Part 1: Datasets: Type-safe operations. (This blog post)
Part 2: DataFrame: Untyped operations. (Next blog post)
Dataset vs. DataFrame
A Dataset[T] is a parameterized type, where the type T is specified by the user and is associated with each element of the Dataset. A DataFrame, on the other hand, has no explicit type associated with it at compile time, from the user's point of view. Internally, a DataFrame is defined as a Dataset[Row], where Row is a generic row type defined by Spark SQL.Language
This blog post refers to the Scala API.Outline
- Reading Data In
- Data Exploration
- Statistics
- Functional Transformations
- Caching
- Getting Data Out
Reading Data In
Spark supports a number of input formats, including Hive, JDBC, Parquet, CSV, and JSON. Below is an example of reading JSON data into a Dataset.JSON example
Suppose you have this example JSON data, with one object per line:
{"name":"Alice", "dept":"Math", "age":21}
{"name":"Bob", "dept":"CS", "age":23}
{"name":"Carl", "dept":"Math", "age":25}
To read a JSON data file, first use the SparkSession object as an entry point, and access its DataFrameReader to read data into a DataFrame:
> val df = spark.read.json("/path/to/file.json") // "spark" is a SparkSession object
df1: org.apache.spark.sql.DataFrame
Then convert the DataFrame into Dataset[Student]:
> case class Student(name: String, dept: String, age: Long)
> val ds = df.as[Student]
ds: org.apache.spark.sql.Dataset[Student]
Data Exploration
When you first look into a new data set, you can explore its contents by printing out the schema, counting the number of rows, and displaying some of those rows.
Print Schema
To explore what is in this Dataset, you can print out the schema:
> ds.printSchema()
root
|-- age: long (nullable = true)
|-- dept: string (nullable = true)
|-- name: string (nullable = true)
Count Rows
To count the number of rows:
> ds.count()
res2: Long = 3
Display Rows
To display the first few rows in tabular format:
> ds.show()
|age|dept| name|
+---+----+-----+
| 21|Math|Alice|
| 23| CS| Bob|
| 25|Math| Carl|
Sample Rows
To get a sample of the data:
> val sample = ds.sample(withReplacement=false, fraction=0.3)
sample: org.apache.spark.sql.Dataset[Student]
|age|dept|name|
+---+----+----+
| 25|Math|Carl|
Statistics
A number of statistics functions are available for Datasets.
> val summary = ds.describe()
summary: org.apache.spark.sql.DataFrame
|summary| age|
Summary Statistics
To get summary statistics on numerical fields, call "describe":> val summary = ds.describe()
summary: org.apache.spark.sql.DataFrame
|summary| age|
+-------+----+
| count| 3|
| mean|23.0|
| stddev| 2.0|
| min| 21|
| max| 25|
Additional Statistical Functions, Approximate Frequent Items
The "stat" method returns a DataFrameStatFunctions object for statistical functions:> ds.stat
res11: org.apache.spark.sql.DataFrameStatFunctions
For example, "stat.freqItems" returns approximate frequent items for the given columns:
> val approxFreqItems = ds.stat.freqItems(Seq("dept"))
approxFreqItems: org.apache.spark.sql.DataFrame
|dept_freqItems|
|dept_freqItems|
+--------------+
| [CS, Math]|
Functional Transformations
The Dataset API supports functional transformations, such as "filter" and "map", much like the RDD API. These operators transform one Dataset[T] into another Dataset[U], where T and U are user-specified types. These operations have compile-time type safety, in the sense that each row is associated with a Scala object of a fixed type T (or U). This is in contrast to DataFrames, which are untyped. "Reduce" is an action that reduces the elements of a Dataset into a scalar value.
Filter
To filter for rows that satisfy a given predicate:
youngStudents: org.apache.spark.sql.Dataset[Student]
|age|dept| name|
+---+----+-----+
| 21|Math|Alice|
Map
To map over rows with a given lambda function:
names: org.apache.spark.sql.Dataset[String]
|value|
+-----+ |Alice| | Bob| | Carl|
|value|
+-----+ |Alice| | Bob| | Carl|
Reduce
To reduce the elements of a Dataset with a given reducer function:
> val totalAge = ds.map(_.age).reduce(_ + _)
totalAge: Long = 69
Join
You can join two Datasets. Suppose you want to join the "Students" Dataset with a new "Department" Dataset:
> case class Department(name: String, building: Int)
> val depts = Seq(Department("Math", 125), Department("CS", 110)).toDS()
|name|building|
+----+--------+ |Math| 125| | CS| 110|
To join the Students" Dataset with the new "Department" Dataset:
> val joined = ds.joinWith(depts, ds("dept") === depts("name"))
joined: org.apache.spark.sql.Dataset[(Student, Department)]
| _1| _2|
+---------------+----------+ |[21,Math,Alice]|[Math,125]| | [23,CS,Bob]| [CS,110]| | [25,Math,Carl]|[Math,125]|
> val deptSizes = ds.groupByKey(_.dept).count()
deptSizes: org.apache.spark.sql.Dataset[(String, Long)]
|value|count(1)|
+-----+--------+ | Math| 2| | CS| 1|
Additional aggregation functions are available in the "functions" object. The "avg" function calculates an average for each group:
> import org.apache.spark.sql.functions._
> val avgAge = ds.groupByKey(_.dept)
.agg(avg($"age").as[Double])
avgAge: org.apache.spark.sql.Dataset[(String, Double)]
|value|avg(age)|
+-----+--------+ | Math| 23.0| | CS| 23.0|
> val ordered = ds.orderBy("dept", "name")
ordered: org.apache.spark.sql.Dataset[Student]
|age|dept| name|
+---+----+-----+ | 23| CS| Bob| | 21|Math|Alice| | 25|Math| Carl|
> ds.cache()
> val studentArr = ds.collect()
studentArr: Array[Student] = Array(Student(Alice,Math,21), Student(Bob,CS,23), Student(Carl,Math,25))
To collect only the first few rows into a Scala Array:
> val firstTwo = ds.head(2)
firstTwo: Array[Student] = Array(Student(Alice,Math,21), Student(Bob,CS,23))
> val studentRdd = ds.rdd
studentRdd: org.apache.spark.rdd.RDD[Student]
> ds.write.json("/path/to/file.json")
joined: org.apache.spark.sql.Dataset[(Student, Department)]
| _1| _2|
+---------------+----------+ |[21,Math,Alice]|[Math,125]| | [23,CS,Bob]| [CS,110]| | [25,Math,Carl]|[Math,125]|
GroupByKey, Aggregation
To group elements of a Dataset and aggregate within each group:> val deptSizes = ds.groupByKey(_.dept).count()
deptSizes: org.apache.spark.sql.Dataset[(String, Long)]
|value|count(1)|
+-----+--------+ | Math| 2| | CS| 1|
Additional aggregation functions are available in the "functions" object. The "avg" function calculates an average for each group:
> import org.apache.spark.sql.functions._
> val avgAge = ds.groupByKey(_.dept)
.agg(avg($"age").as[Double])
avgAge: org.apache.spark.sql.Dataset[(String, Double)]
|value|avg(age)|
+-----+--------+ | Math| 23.0| | CS| 23.0|
OrderBy
To order by a given set of fields:
ordered: org.apache.spark.sql.Dataset[Student]
|age|dept| name|
+---+----+-----+ | 23| CS| Bob| | 21|Math|Alice| | 25|Math| Carl|
Caching
To persist a Dataset at the default storage level (memory and disk):> ds.cache()
Getting Data Out
Into an Array
To collect data into a Scala Array, use "collect". Note that this will collect all rows into the Driver node, and thus could potentially be a memory- and IO- intensive operation.
studentArr: Array[Student] = Array(Student(Alice,Math,21), Student(Bob,CS,23), Student(Carl,Math,25))
To collect only the first few rows into a Scala Array:
> val firstTwo = ds.head(2)
firstTwo: Array[Student] = Array(Student(Alice,Math,21), Student(Bob,CS,23))
Into an RDD
To convert into an RDD:> val studentRdd = ds.rdd
studentRdd: org.apache.spark.rdd.RDD[Student]
Into a File
To write a Dataset into a file, use "write". A number of output formats are supported. Here is an example of writing in JSON format:
No comments:
Post a Comment