Dataframe Analytics Reference Book

Dataframe Analytics Reference Book

avatar

AAAS24

August 19, 2022

Coding

Project Description

The intent of this post is to document basic functions related to analyzing data using DataFrames.

TABLE OF CONTENT


- Data Exploration
- Operations with rows and columns
- Filtering Data Operations
- Chaining examples
- Dictionary Operations



Dataframes Operations

Data Exploration

Command Syntax
Describe df.describe
Number of Rows and Columns df.shape, where:
df.shape[0]= rows & df.shape[1]= columns
Names in Columns df.columns.values
Column data types df.dtypes
Min df['COLUMN_NAME'].min()
Max df['COLUMN_NAME'].max()
Number of unique values len(df['COLUMN_NAME'].unique())
Sorted list of unique values sorted(df['COLUMN_NAME'].unique())
Identify type of value in a column type(df_reason_mod['COLUMN_NAME'][0])
Count per value in column df.['COLUMN_NAME'].value_counts(dropna=False)
Traspose df.head().T
View column with missing value in rows df.loc[df.COLUMN_NAME.isna(), ['COLUMN_NAME_1', 'COLUMN_NAME_2', 'COLUMN_NAME_3']]
   

Operations with rows and columns

|Command |Syntax | | — | — | |Dropping column|df=df.drop(['COLUMN_NAME'], axis=1)| |Adding column|df['NEW_COLUMN']=df.sum(axis=1)| |Sum values per row|df['NEW_COLUMN']=df.sum(axis=1) | |Sum values per column|df['COLUMN_NAME'].sum(axis=0) | |Cut dataframe from Columns A to D|new_df=df.loc[:,A:D]| |Concatenate multiple df|df= pd.concat([df1,df2,df3,df4,...,dfn], axis=1)| |Renaiming columns| 1) get initial values with column_names=df.columns.value
2) modify the name value in variable
3) df.columns=column_names| |Reordering columns| 1) get initial values with colum_names_reordered=df.columns.value
2) modify order of values in variable
3) df=df[colum_names_reordered]| |Convert values in column to timestamp format. Ex, a column DATE with str such as 07/15/2018|pd.to_datetime(df_reason_mod['Date'], format='%d/%m/%Y')
The convension of the format parameter, see the ‘strftime’ documentation| |Reclassify the values in a column|In a column with 4 distinct values, where we wish to combine 2,3 & 4 as B, and re-label 1 as A: df['COLUMN_NAME']=df['COLUMN_NAME'].map({1:A,2:B,3:B,4:B})| |Slice data with iloc|df.iloc[START_ROW:END_ROW,START_COLUMN:END_COLUMN]
Ex: df.iloc[:,:-1], which is all rows, all columns but the last one| |Where| df['COLUMN_NAME']= np.where(daf['COLUMN_NAMEs']>CONDITION,TRUE_VALUE,FALSE_VALUE)| |Filter data|df[df['COLUMN_NAME']=='VALUE']| |Temporarily display more rows|with pd.option_context('display.max_columns', 22, 'display.min_rows',10): display(df)| |||

Useful codes snippets

** Find Missing data

For a dataFrame called ‘df’, this shows the ‘percentage’ of missing values per column, where there is in fact missing values detected by the ‘.isna()’ function.

    ((df
    .isna()
    .mean()
    *100)
    .pipe(lambda ser:ser[ser>0])
    )


Filtering Data Operations

Operation Example Code
With AND Condition df[(df.state=='Texas') & (df.year==1990)]
With OR Condition df[(df.state=='Texas') | (df.year==1990)]
With != Column Names Nomenclature df[df.state=='Texas'] or df[df['state']=='Texas']
With a list_of_values data_filtered=df[df.column_Name.isin(list_of_values)]
Using ‘query’ df.query('colum_Name>=5 and colum_Name<=10')
   


Date Formatting Data Operations

Operation Example Code
Converting timestamp into datetime data['dt'] = data['ts'].map(lambda x: datetime.strptime(x, '%Y-%m-%dT%H:%M:%SZ').strftime("%Y-%m-%d"))
Function to convert columns for column in columns_to_date: </break> df[column] = pd.to_datetime(df[column], format='%B %d, %Y', errors='coerce')
   

Chaining Examples

</table> ## Dictionary Operations
Operation Example Code
Groupby & Aggregations Controlling Column Name by Using 'pd.NamedAgg'
 See a code example this workbook.
            
            
            data=(df2   
                [df2.date.dt.year==2017]
                .groupby(df2.state)
                .agg(
                    tot_victims=pd.NamedAgg(column="victims", aggfunc="sum"),
                    num_incidents=pd.NamedAgg(column="incident_id", aggfunc="count")
                )
                .sort_values(by='tot_victims', ascending=False)
                .reset_index()
            )
            
            
</td> </tr>
Renaming Columns With .set_axis()
            
        dict_sub_region=(
        data.loc[:,['state','subRegion']]
            .set_index('state')
            .set_axis(['sub_region'], axis = 1) #renaming columns
            .to_dict()
        )
            
            
Catching Current State of a DataFrame during Method Chaining
            
            def catchstate(df, var_name: str) -> 'pd.DataFrame':
            """
            Helper function that captures intermediate Dataframes mid-chain.
            In the global namespace, make a new variable called var_name and set it to dataframe
            """
            globals()[var_name] = df
            return df
            data=(df2_incidents
                .assign(year=df2_incidents.date.dt.year, month=df2_incidents.date.dt.month)
                .assign(sub_region=df2_incidents.state.map(dict_sub_region))
                .pipe(catchstate, var_name="df2")
                [(df2.sub_region!='Mountain')&(df2.year!=2017)]#filtering outlier of las vegas shooting
                .groupby(by=['year','sub_region'])
                .agg(
                    tot_victims=pd.NamedAgg(column="tot_victims", aggfunc="sum"),
                    num_incidents=pd.NamedAgg(column="incident_id", aggfunc="count")
                ).reset_index()
            )
            data
            
            
Adding time values based on date related column 'time_col'. Ex: time_col='date_of_birth'
            
            df=df.assign(
                month=lambda df_: df_[time_col].dt.month,
                qtr=lambda df_: df_[time_col].dt.quarter,
                year=lambda df_: df_[time_col].dt.year,
                day=lambda df_: df_[time_col].dt.day_name()
            ).rename(columns={'month': ('').join([time_col, '_month']),
                                'qtr': ('').join([time_col, '_qtr']),
                                'year': ('').join([time_col, '_year']),
                                'day': ('').join([time_col, '_day'])
                        }
                    )
            df.head(5)
            
            
</table>
Operation Example Code
Removing outer layer of dictionary
            
            dict_sub_region=(
                data.loc[:,['state','subregion']]
                .set_index('state')
                .set_axis(['sub_region'], axis = 1)#renaming columns
                .to_dict()
            )
            
            this printed a dictionary like this: 
            
            {'sub_region': {'AK': 'Pacific',
            'AL': 'East South Central',
            'AR': 'West South Central',
            'AZ': 'Mountain',
            'CA': 'Pacific',
            'CO': 'Mountain',
            (...)}}
            
            To remove 'sub_region', we use: 
            
            from collections import ChainMap
            res = dict(ChainMap(*dict_sub_region.values()))#removes top level of dictionary
            print(res)
            
            

Recent Blog Articles