Union
The union
function in PySpark is used to combine two DataFrames or Datasets with the same schema. It returns a new DataFrame that contains all the rows from both input DataFrames.
Syntax
The syntax for using the union
function is as follows:
union(other)
Where:
-
other
: The DataFrame or Dataset to be combined with the current DataFrame.
Example
Let's consider an example to understand how the union
function works:
# Importing the necessary libraries
from pyspark.sql import SparkSession
# Creating a SparkSession
spark = SparkSession.builder.getOrCreate()
# Creating two DataFrames with the same schema
df1 = spark.createDataFrame([(1, "John"), (2, "Alice")], ["id", "name"])
df2 = spark.createDataFrame([(3, "Bob"), (4, "Eve")], ["id", "name"])
# Combining the DataFrames using union
combined_df = df1.union(df2)
# Displaying the combined DataFrame
combined_df.show()
Output:
+---+-----+
| id| name|
+---+-----+
| 1| John|
| 2|Alice|
| 3| Bob|
| 4| Eve|
+---+-----+
In the above example, we create two DataFrames df1
and df2
with the same schema. Then, we use the union
function to combine both DataFrames into a new DataFrame called combined_df
. Finally, we display the contents of the combined_df
DataFrame using the show
function.
Notes
- The
union
function only works if the DataFrames have the same schema. If the schemas are different, you can use theunionByName
function to combine DataFrames with similar column names. - The
union
function does not remove duplicate rows. If you want to remove duplicates, you can use thedistinct
function after performing the union. - The
union
function is a transformation operation. Therefore, it is lazily evaluated. To trigger the execution of the union, you can use an action likeshow
orcount
.