SparkSQL defines operations in spearate classes under the package org.apache.spark.sql.execution
, and uses the class names as operation names. Therefore, we identified the operations via the linux command
git clone https://github.com/apache/spark
cd spark
git checkout f4e53fa9
cd sql/core/src/main/scala/org/apache/spark/sql/execution
find . -type f -name "*Exec.scala"| sed -n 's|.*/\([^/]*\)Exec.scala|\1|p'
We further manually examined the identified operations and removed three abstract base classes: BaseAggregate
, BaseJoin
, and BaseScriptTransformation
, which are used for other operations and are not shown as operation names in query plans.
Operation | Category | Reference | Description |
---|---|---|---|
AdaptiveSparkPlan | Executor | Link | The operation appears in the root of query plans and all children nodes are executed adaptively. |
AddPartition | Executor | Link | It adds partitions to the table. |
AggregateInPandas | Folder | Link | Aggregation with group aggregate Pandas UDF. |
AlterNamespaceSetProperties | Consumer | Link | Setting properties of the namespace. |
AlterTable | Consumer | Link | Altering a table. |
AQEShuffleRead | Executor | Link | A wrapper of shuffle query stage |
ArrowEvalPython | Executor | Link | A physical plan that evaluates a [[PythonUDF]]. |
AttachDistributedSequence | Consumer | Link | Adds a new long column with `sequenceAttr` that increases one by one. |
BatchEvalPython | Producer | Link | A physical plan that evaluates a [[PythonUDF]]. |
BatchScan | Producer | Link | Scanning a batch of data from a data source v2. |
BroadcastExchange | Executor | Link | A BroadcastExchangeExec collects, transforms, and finally broadcasts the result of a transformed SparkPlan. |
BroadcastHashJoin | Executor | Link | Performs an inner hash join of two child relations. When the output RDD of this operator is being constructed, a Spark job is asynchronously started to calculate the values for the broadcast relation. This data is then placed in a Spark broadcast variable. The streamed relation is not shuffled. |
BroadcastNestedLoopJoin | Executor | Link | Performs a nested loop join of two child relations. |
CacheTable | Executor | Link | Cache table into memory. |
CartesianProduct | Folder | Link | Apply the cartesian product of two children. |
CollectMetrics | Executor | Link | Collect arbitrary (named) metrics from a [[SparkPlan]]. |
CommandResult | Executor | Link | Physical plan node for holding data from a command. |
ContinuousScan | Producer | Link | Physical plan node for scanning data from a streaming data source with continuous mode. |
CreateIndex | Consumer | Link | Physical plan node for creating an index. |
CreateNamespace | Consumer | Link | Physical plan node for creating a namespace. |
CreateTable | Consumer | Link | Physical plan node for creating a table. |
DataSourceScan | Producer | Link | Read data from any data source. |
DeleteFromTable | Consumer | Link | Delete data from a table. |
DescribeColumn | Executor | Link | Obtain information about a column. |
DescribeNamespace | Executor | Link | Obtain information about a namespace. |
DescribeTable | Executor | Link | Obtain information about a table. |
DropIndex | Consumer | Link | Delete an index. |
DropNamespace | Consumer | Link | Remove a namespace. |
DropPartition | Consumer | Link | Remove a partition. |
DropTable | Consumer | Link | Delete a table. |
EvalPython | Executor | Link | A physical plan that evaluates a PythonUDF, one partition of tuples at a time. |
EventTimeWatermark | Executor | Link | Used to mark a column as containing the event time for a given record. |
Expand | Executor | Link | Apply all of the GroupExpressions to every input row. |
FlatMapCoGroupsInPandas | Executor | Link | Physical node for [[org.apache.spark.sql.catalyst.plans.logical.FlatMapCoGroupsInPandas]] |
FlatMapGroupsInPandas | Executor | Link | Physical node for [[org.apache.spark.sql.catalyst.plans.logical.FlatMapGroupsInPandas]] |
FlatMapGroupsInPandasWithState | Executor | Link | Physical operator for executing [[org.apache.spark.sql.catalyst.plans.logical.FlatMapGroupsInPandasWithState]] |
FlatMapGroupsWithState | Executor | Link | Physical operator for executing `FlatMapGroupsWithState` |
Generate | Executor | Link | Applies a [[Generator]] to a stream of input rows, combining the output of each into a new stream of rows. |
HashAggregate | Folder | Link | Aggregates rows for a GROUP BY operation using a hash table. |
InMemoryTableScan | Producer | Link | Scan the tables in memory. |
LocalTableScan | Producer | Link | Physical plan node for scanning data from a local collection. |
MapInBatch | Executor | Link | A relation produced by applying a function that takes an iterator of batches, such as pandas DataFrame or PyArrow’s record batches, and outputs an iterator of them. |
MapInPandas | Executor | Link | A relation produced by applying a function that takes an iterator of pandas DataFrames and outputs an iterator of pandas DataFrames. |
MergingSessions | Executor | Link | This node is a variant of SortAggregateExec which merges the session windows based on the fact child node will provide inputs as sorted by group keys + the start time of the session window. |
MicroBatchScan | Producer | Link | Physical plan node for scanning a micro-batch of data from a data source. |
ObjectHashAggregate | Folder | Link | A hash-based aggregate operator that supports [[TypedImperativeAggregate]] functions that may use arbitrary JVM objects as aggregation states. |
PythonMapInArrow | Executor | Link | A relation produced by applying a function that takes an iterator of PyArrow’s record batches and outputs an iterator of PyArrow’s record batches. |
QueryStage | Executor | Link | A query stage is an independent subgraph of the query plan. |
RefreshTable | Executor | Link | Refresh all caches referencing the given table. |
RenamePartition | Consumer | Link | Physical plan node for renaming a table partition. |
RenameTable | Consumer | Link | Physical plan node for renaming a table. |
ReplaceTable | Consumer | Link | Physical plan node for replacing a table. |
SetCatalogAndNamespace | Consumer | Link | Physical plan node for setting the current catalog and/or namespace. |
ShowCreateTable | Executor | Link | Physical plan node for showing creating a table. |
ShowFunctions | Executor | Link | Physical plan node for showing functions. |
ShowNamespaces | Executor | Link | Physical plan node for showing namespaces. |
ShowPartitions | Executor | Link | Physical plan node for showing partitions. |
ShowTableProperties | Executor | Link | Physical plan node for showing table properties. |
ShowTables | Executor | Link | Physical plan node for showing tables. |
ShuffledHashJoin | Executor | Link | Performs a hash join of two child relations by first shuffling the data using the join keys. |
ShuffleExchange | Executor | Link | Performs a shuffle that will result in the desired partitioning. |
Sort | Combinator | Link | Performs (external) sorting. |
SortAggregate | Folder | Link | Sort-based aggregate operator. |
SortMergeJoin | Join | Link | Performs a sort-merge join of two child relations. |
SparkScriptTransformation | Executor | Link | Transforms the input by forking and running the specified script. |
StreamingSymmetricHashJoin | Join | Link | Performs stream-stream join using symmetric hash join algorithm. |
SubqueryAdaptiveBroadcast | Executor | Link | Similar to [[SubqueryBroadcastExec]], this node is used to store the initial physical plan of DPP subquery filters when enabling both AQE and DPP. It is an intermediate physical plan and not executable. |
SubqueryBroadcast | Executor | Link | Physical plan for a custom subquery that collects and transforms the broadcast key values. |
TruncatePartition | Consumer | Link | Physical plan node for table partition truncation. |
TruncateTable | Consumer | Link | Physical plan node for table truncation. |
UpdatingSessions | Consumer | Link | This node updates the session window spec of each input row by analyzing neighbor rows and determining rows belonging to the same session window. The number of input rows remains the same. |
V2Command | Executor | Link | A physical operator that executes run() and saves the result to prevent multiple executions. |
WholeStageCodegen | Executor | Link | WholeStageCodegen compiles a subtree of plans that support codegen together into a single Java function. |
Window | Folder | Link | This class calculates and outputs (windowed) aggregates over the rows in a single (sorted) partition. |
WindowInPandas | Executor | Link | This operation calculates and outputs windowed aggregates over the rows in a single partition. |
WriteToContinuousDataSource | Executor | Link | The physical plan for writing data into continuous processing [[StreamingWrite]]. |
WriteToDataSourceV2 | Executor | Link | The physical plan for writing data into a data source. |