Reading and Writing Streams with Spark Structured Streaming

class: center, middle, inverse, title-slide

.title[
# Reading and Writing Streams with Spark Structured Streaming
]
.author[
### Justin Post
]

---

layout: false
class: title-slide-section-red, middle

# Reading and Writing Streams with Spark Structured Streaming
Justin Post

---
layout: true

---

# Recap

We'll use Spark Structured Streaming to handle our streaming data ([Guide](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html))

- Create a spark session
1. **Read in a stream**
    + Stream from a file, terminal, or use something like kafka
2. Set up transformations/aggregations to do (mostly using SQL type functions)
    + Perhaps over windows
3. Set up **writing of the query** to an output source
    + Console (for debugging)
    + File (say .csv)
    + Database
4. `query.start()` the query!  
    + Continues listening until terminated (`query.stop()`)

---

# Streaming DataFrames

Stream is read into a Spark SQL data frame

- Data frames can be used to represent both static data and streaming data

Differences:

- Streaming data frames are unbounded and schema is only checked at runtime
- Rows added incrementally

<div class="figure" style="text-align: center">
<img src="data:image/png;base64,#img/structured-streaming-stream-as-a-table.png" alt="https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html" width="500px" />
<p class="caption">https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html</p>
</div>

---

# Streaming DataFrames

- When the query starts, Spark will check for new data (at a specified interval of time)
 - If there is new data, Spark will run an “incremental” query that combines the previous running counts with the new data to compute updated counts
 
 
---

# Streaming DataFrames

> Note that Structured Streaming does not materialize the entire table. It reads the latest available data from the streaming data source, processes it incrementally to update the result, and then discards the source data. It only keeps the minimal intermediate state data as required to update the result (e.g. intermediate counts).

---

# Reading a Stream

Stream read in using the [`DataStreamReader` interface](https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss/index.html) (`SparkSession.readStream`)

- `readStream` has different methods to customize/set-up how to read the stream

---

# Reading a Stream

Stream read in using the [`DataStreamReader` interface](https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss/index.html) (`SparkSession.readStream`)

- `readStream` has different methods to customize/set-up how to read the stream

+ `.format()` - (generic) specifies the input source
    + `.schema()` - setup what Spark should expect
    + `.option(key, value)` - allows an input option on a file source
    + `.load()` - loads a data stream and returns a DataFrame

---

# Reading Data from a Kafka Stream

- Common syntax for reading in data

```python
df = spark \
    .readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "localhost:9092") \
    .option("subscribe", "topic_name") \
    .load()
```

---

# Reading in Testing Data

- `rate` format generates timestamp data at a specified interval of time

```python
df = spark \
    .readStream \
    .format("rate") \
    .option("rowsPerSecond", 1) \
    .load()
```

---

# Reading Data From a CSV

- Common syntax for reading in data

```python
myschema = StructType().add("value", "string")
df = spark \
       .readStream \
       .schema(myschema) \
       .csv("csv_files") #automatically 'loads'
```

---

# Quick Example

Let's jump into `pyspark` and use the "rate" format

- Will need to write the stream to see it (covered in more detail shortly)

---

# Starting Streaming Queries

Notice that the process doesn't evaluate things until we use `.start()`

---

# Starting Streaming Queries

Uses the `DataStreamWriter` [interface](https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss/index.html) (`df_with_transforms_etc.writeStream`)