pyspark.sql.functions.tuple_intersection_integer#

pyspark.sql.functions.tuple_intersection_integer(col1, col2, mode=None)[source]#

Returns the intersection of two Datasketches TupleSketch objects with integer summaries.

New in version 4.2.0.

Parameters
col1Column or column name

The first TupleSketch column

col2Column or column name

The second TupleSketch column

modeColumn or str, optional

The summary mode: “sum” (default), “min”, “max”, or “alwaysone”

Returns
Column

The binary representation of the intersected TupleSketch.

Examples

>>> from pyspark.sql import functions as sf
>>> df = spark.createDataFrame([(1, 10, 2, 20), (2, 20, 3, 30), (3, 30, 4, 40)], ["key1", "v1", "key2", "v2"])  # noqa
>>> df = df.agg(
...     sf.tuple_sketch_agg_integer("key1", "v1").alias("sketch1"),
...     sf.tuple_sketch_agg_integer("key2", "v2").alias("sketch2")
... )
>>> df.select(sf.tuple_sketch_estimate_integer(sf.tuple_intersection_integer(df.sketch1, "sketch2"))).show()  # noqa
+--------------------------------------------------------------------------------+
|tuple_sketch_estimate_integer(tuple_intersection_integer(sketch1, sketch2, sum))|
+--------------------------------------------------------------------------------+
|                                                                             2.0|
+--------------------------------------------------------------------------------+