If you have a situation that you can easily get the result using SQL/ SQL already existing , then you can convert the dataframe to a table and do a query on top of it. Converting dataframe to a table as bellow
from pyspark.sql import SparkSession from pyspark import SparkContext sc = SparkContext() spark=SparkSession.builder.getOrCreate() myDF = spark.createDataFrame([("Tom", 400,50, "Teacher","IND"),("Jack", 420,60, "Finance","USA"),("Brack", 500,10, "Teacher","IND"),("Jim", 700,80, "Finance","JAPAN")],("name", "salary","cnt", "department","country")) myDF.registerTempTable("sql_df") tot_salary = spark.sql("select department,sum(salary) as total_salary from sql_df group by department ") tot_salary.show(30,False) +----------+------------+ |department|total_salary| +----------+------------+ |Teacher |900 | |Finance |1120 | +----------+------------+
You can also try the bellow to get all the column from data frame
tot_salary.selectExpr('*').show() tot_salary.select('*').show()