pyspark.pandas.DataFrame.aggregate#

DataFrame.aggregate(func)[source]#

Aggregate using one or more operations over the specified axis.

Parameters
funcdict or a list

a dict mapping from column name (string) to aggregate functions (list of strings). If a list is given, the aggregation is performed against all columns.

Returns
DataFrame

See also

DataFrame.apply

Invoke function on DataFrame.

DataFrame.transform

Only perform transforming type operations.

DataFrame.groupby

Perform operations over groups.

Series.aggregate

The equivalent function for Series.

Notes

agg is an alias for aggregate. Use the alias.

Examples

>>> df = ps.DataFrame([[1, 2, 3],
...                    [4, 5, 6],
...                    [7, 8, 9],
...                    [np.nan, np.nan, np.nan]],
...                   columns=['A', 'B', 'C'])
>>> df
     A    B    C
0  1.0  2.0  3.0
1  4.0  5.0  6.0
2  7.0  8.0  9.0
3  NaN  NaN  NaN

Aggregate these functions over the rows.

>>> df.agg(['sum', 'min'])[['A', 'B', 'C']].sort_index()
        A     B     C
min   1.0   2.0   3.0
sum  12.0  15.0  18.0

Different aggregations per column.

>>> df.agg({'A' : ['sum', 'min'], 'B' : ['min', 'max']})[['A', 'B']].sort_index()
        A    B
max   NaN  8.0
min   1.0  2.0
sum  12.0  NaN

For multi-index columns:

>>> df.columns = pd.MultiIndex.from_tuples([("X", "A"), ("X", "B"), ("Y", "C")])
>>> df.agg(['sum', 'min'])[[("X", "A"), ("X", "B"), ("Y", "C")]].sort_index()
        X           Y
        A     B     C
min   1.0   2.0   3.0
sum  12.0  15.0  18.0
>>> aggregated = df.agg({("X", "A") : ['sum', 'min'], ("X", "B") : ['min', 'max']})
>>> aggregated[[("X", "A"), ("X", "B")]].sort_index()  
        X
        A    B
max   NaN  8.0
min   1.0  2.0
sum  12.0  NaN