The unionByName
function in PySpark is used to combine two DataFrames or Datasets by matching and merging their columns based on column names. This function is particularly useful when you have two DataFrames with different column orders or missing columns, and you want to merge them based on column names rather than positions.
Syntax
The syntax for unionByName
function is as follows:
unionByName(other)
-
other
: The DataFrame or Dataset to be merged with the current DataFrame.
Parameters
The unionByName
function takes a single parameter:
-
other
: This parameter represents the DataFrame or Dataset to be merged with the current DataFrame. Theother
DataFrame must have the same number of columns as the current DataFrame, and the column names must match.
Return Value
The unionByName
function returns a new DataFrame that contains the merged result of the current DataFrame and the other
DataFrame. The resulting DataFrame will have the same number of rows as the current DataFrame, and the merged columns will be appended to the right side of the DataFrame.
Example
Let's consider an example to understand how unionByName
works:
from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder.getOrCreate()
# Create the first DataFrame
data1 = [("Alice", 25, "New York"), ("Bob", 30, "San Francisco")]
df1 = spark.createDataFrame(data1, ["name", "age", "city"])
# Create the second DataFrame
data2 = [("Charlie", "Chicago"), ("David", "Boston")]
df2 = spark.createDataFrame(data2, ["name", "city"])
# Merge the DataFrames using unionByName
merged_df = df1.unionByName(df2)
# Show the merged DataFrame
merged_df.show()
Output:
+-------+---+-------------+
| name|age| city|
+-------+---+-------------+
| Alice| 25| New York|
| Bob| 30|San Francisco|
|Charlie| | Chicago|
| David| | Boston|
+-------+---+-------------+
In the above example, we have two DataFrames, df1
and df2
. The df1
DataFrame has three columns: "name", "age", and "city", while the df2
DataFrame has two columns: "name" and "city". By using the unionByName
function, we merge the two DataFrames based on the column names. The resulting DataFrame, merged_df
, contains all the columns from both DataFrames, and the missing values are filled with null.
Conclusion
The unionByName
function in PySpark allows you to merge two DataFrames or Datasets based on column names. It is a convenient way to combine DataFrames with different column orders or missing columns. By understanding the syntax, parameters, and return value of unionByName
, you can effectively use this function in your PySpark applications.