KPMG Data Analytics Virtual Internship

Task 1: Data Quality Assessment

Contents

  • Transactions table
  • NewCustomerList table
  • CustomerDemographic table
  • CustomerAddress table
In [1]:
import pandas as pd
import numpy as np
import datetime as dt
In [2]:
df = pd.read_excel('KPMG_VI_New_raw_data_update_final.xlsx', sheet_name = [1,2,3,4])

1. Transactions table¶

First Look At The Data¶

In [3]:
df[1]
Out[3]:
Note: The data and information in this document is reflective of a hypothetical situation and client. This document is to be used for KPMG Virtual Internship purposes only. Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10 Unnamed: 11 Unnamed: 12
0 transaction_id product_id customer_id transaction_date online_order order_status brand product_line product_class product_size list_price standard_cost product_first_sold_date
1 1 2 2950 2017-02-25 00:00:00 False Approved Solex Standard medium medium 71.49 53.62 41245
2 2 3 3120 2017-05-21 00:00:00 True Approved Trek Bicycles Standard medium large 2091.47 388.92 41701
3 3 37 402 2017-10-16 00:00:00 False Approved OHM Cycles Standard low medium 1793.43 248.82 36361
4 4 88 3135 2017-08-31 00:00:00 False Approved Norco Bicycles Standard medium medium 1198.46 381.1 36145
... ... ... ... ... ... ... ... ... ... ... ... ... ...
19996 19996 51 1018 2017-06-24 00:00:00 True Approved OHM Cycles Standard high medium 2005.66 1203.4 37823
19997 19997 41 127 2017-11-09 00:00:00 True Approved Solex Road medium medium 416.98 312.74 35560
19998 19998 87 2284 2017-04-14 00:00:00 True Approved OHM Cycles Standard medium medium 1636.9 44.71 40410
19999 19999 6 2764 2017-07-03 00:00:00 False Approved OHM Cycles Standard high medium 227.88 136.73 38216
20000 20000 11 1144 2017-09-22 00:00:00 True Approved Trek Bicycles Standard medium small 1775.81 1580.47 36334

20001 rows × 13 columns

Comment:¶

As the header of the dataframe is incorrect, we will put the first row as header.

Change the header¶

In [4]:
new_header_1 = df[1].iloc[0] 
transactions = df[1][1:]
transactions.columns = new_header_1
transactions = transactions.rename_axis(None, axis = 1)
transactions.head()
Out[4]:
transaction_id product_id customer_id transaction_date online_order order_status brand product_line product_class product_size list_price standard_cost product_first_sold_date
1 1 2 2950 2017-02-25 00:00:00 False Approved Solex Standard medium medium 71.49 53.62 41245
2 2 3 3120 2017-05-21 00:00:00 True Approved Trek Bicycles Standard medium large 2091.47 388.92 41701
3 3 37 402 2017-10-16 00:00:00 False Approved OHM Cycles Standard low medium 1793.43 248.82 36361
4 4 88 3135 2017-08-31 00:00:00 False Approved Norco Bicycles Standard medium medium 1198.46 381.1 36145
5 5 78 787 2017-10-01 00:00:00 True Approved Giant Bicycles Standard medium large 1765.3 709.48 42226

Data types¶

In [5]:
transactions.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 1 to 20000
Data columns (total 13 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   transaction_id           20000 non-null  object
 1   product_id               20000 non-null  object
 2   customer_id              20000 non-null  object
 3   transaction_date         20000 non-null  object
 4   online_order             19640 non-null  object
 5   order_status             20000 non-null  object
 6   brand                    19803 non-null  object
 7   product_line             19803 non-null  object
 8   product_class            19803 non-null  object
 9   product_size             19803 non-null  object
 10  list_price               20000 non-null  object
 11  standard_cost            19803 non-null  object
 12  product_first_sold_date  19803 non-null  object
dtypes: object(13)
memory usage: 2.0+ MB
In [6]:
#change the currency variables to float
transactions['list_price'] = pd.to_numeric(transactions['list_price'], errors='coerce')
transactions['standard_cost'] = pd.to_numeric(transactions['standard_cost'], errors='coerce')
#change the date variable to datetime
transactions['transaction_date'] =  pd.to_datetime(transactions['transaction_date'], format='%Y-%m-%d')

Completeness¶

In [7]:
transactions.isna().sum()
Out[7]:
transaction_id               0
product_id                   0
customer_id                  0
transaction_date             0
online_order               360
order_status                 0
brand                      197
product_line               197
product_class              197
product_size               197
list_price                   0
standard_cost              197
product_first_sold_date    197
dtype: int64
Comment:¶

We found 360 missing values in “online_order” column and 197 blanks in 6 other columns (“brand”, “product_line”, “product_class”, “product_size”, “standard_cost”, “product_first_sold_date”).

  • As 197 rows with missing values are lacking almost a half of the information, we will remove all those 197 rows to reduce completeness.
  • In the “online_order” column, we found out that the proportions of True and False values are approximately the same, so we will fill the missing values with the last valid values forward.
In [8]:
#remove 197 missing values from the dataframe
transactions.dropna(subset=['brand'], inplace=True)
transactions.reset_index(drop=True, inplace=True)
In [9]:
transactions['online_order'].value_counts()
Out[9]:
True     9739
False    9706
Name: online_order, dtype: int64
In [10]:
#fill missing values with the last valid observation forward
transactions['online_order'] = transactions['online_order'].fillna(method='ffill')

Relevancy¶

Comment:¶

We also see that all the values in “product_first_sold_date” column are in 5 digit date serial numbers, which are supposed to be in date type. Therefore, we will change the values into datetime format.

In [11]:
#change the figures in "product_first_sold_date" into date
lst = []
for x in np.array(transactions['product_first_sold_date']):
    lst.append(dt.datetime(1899, 12, 30) + dt.timedelta(days= x))
transactions['product_first_sold_date'] = np.array(lst)
In [12]:
transactions['product_first_sold_date'] =  pd.to_datetime(transactions['product_first_sold_date'], format='%Y-%m-%d')

2. NewCustomerList table¶

First Look At The Data¶

In [13]:
df[2]
Out[13]:
Note: The data and information in this document is reflective of a hypothetical situation and client. This document is to be used for KPMG Virtual Internship purposes only. Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 ... Unnamed: 13 Unnamed: 14 Unnamed: 15 Unnamed: 16 Unnamed: 17 Unnamed: 18 Unnamed: 19 Unnamed: 20 Unnamed: 21 Unnamed: 22
0 first_name last_name gender past_3_years_bike_related_purchases DOB job_title job_industry_category wealth_segment deceased_indicator owns_car ... state country property_valuation NaN NaN NaN NaN NaN Rank Value
1 Chickie Brister Male 86 1957-07-12 General Manager Manufacturing Mass Customer N Yes ... QLD Australia 6 0.56 0.7000 0.875000 0.743750 1.0 1 1.71875
2 Morly Genery Male 69 1970-03-22 Structural Engineer Property Mass Customer N No ... NSW Australia 11 0.89 0.8900 1.112500 0.945625 1.0 1 1.71875
3 Ardelis Forrester Female 10 1974-08-28 00:00:00 Senior Cost Accountant Financial Services Affluent Customer N No ... VIC Australia 5 1.01 1.0100 1.010000 1.010000 1.0 1 1.71875
4 Lucine Stutt Female 64 1979-01-28 Account Representative III Manufacturing Affluent Customer N Yes ... QLD Australia 1 0.87 1.0875 1.087500 1.087500 4.0 4 1.703125
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
996 Ferdinand Romanetti Male 60 1959-10-07 Paralegal Financial Services Affluent Customer N No ... NSW Australia 7 0.79 0.7900 0.790000 0.790000 996.0 996 0.374
997 Burk Wortley Male 22 2001-10-17 Senior Sales Associate Health Mass Customer N No ... NSW Australia 10 0.76 0.7600 0.950000 0.807500 997.0 997 0.357
998 Melloney Temby Female 17 1954-10-05 Budget/Accounting Analyst IV Financial Services Affluent Customer N Yes ... QLD Australia 2 0.85 1.0625 1.062500 1.062500 997.0 997 0.357
999 Dickie Cubbini Male 30 1952-12-17 Financial Advisor Financial Services Mass Customer N Yes ... QLD Australia 2 1.09 1.3625 1.362500 1.158125 997.0 997 0.357
1000 Sylas Duffill Male 56 1955-10-02 Staff Accountant IV Property Mass Customer N Yes ... NSW Australia 9 0.47 0.5875 0.734375 0.624219 1000.0 1000 0.34

1001 rows × 23 columns

Change the header¶

In [14]:
new_header_2 = df[2].iloc[0] 
newcust = df[2][1:]
newcust.columns = new_header_2
newcust = newcust.rename_axis(None, axis = 1)
newcust.head()
Out[14]:
first_name last_name gender past_3_years_bike_related_purchases DOB job_title job_industry_category wealth_segment deceased_indicator owns_car ... state country property_valuation NaN NaN NaN NaN NaN Rank Value
1 Chickie Brister Male 86 1957-07-12 General Manager Manufacturing Mass Customer N Yes ... QLD Australia 6 0.56 0.7000 0.8750 0.743750 1.0 1 1.71875
2 Morly Genery Male 69 1970-03-22 Structural Engineer Property Mass Customer N No ... NSW Australia 11 0.89 0.8900 1.1125 0.945625 1.0 1 1.71875
3 Ardelis Forrester Female 10 1974-08-28 00:00:00 Senior Cost Accountant Financial Services Affluent Customer N No ... VIC Australia 5 1.01 1.0100 1.0100 1.010000 1.0 1 1.71875
4 Lucine Stutt Female 64 1979-01-28 Account Representative III Manufacturing Affluent Customer N Yes ... QLD Australia 1 0.87 1.0875 1.0875 1.087500 4.0 4 1.703125
5 Melinda Hadlee Female 34 1965-09-21 Financial Analyst Financial Services Affluent Customer N No ... NSW Australia 9 0.52 0.5200 0.6500 0.650000 4.0 4 1.703125

5 rows × 23 columns

Data types¶

In [15]:
newcust.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 1 to 1000
Data columns (total 23 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   first_name                           1000 non-null   object 
 1   last_name                            971 non-null    object 
 2   gender                               1000 non-null   object 
 3   past_3_years_bike_related_purchases  1000 non-null   object 
 4   DOB                                  983 non-null    object 
 5   job_title                            894 non-null    object 
 6   job_industry_category                835 non-null    object 
 7   wealth_segment                       1000 non-null   object 
 8   deceased_indicator                   1000 non-null   object 
 9   owns_car                             1000 non-null   object 
 10  tenure                               1000 non-null   object 
 11  address                              1000 non-null   object 
 12  postcode                             1000 non-null   object 
 13  state                                1000 non-null   object 
 14  country                              1000 non-null   object 
 15  property_valuation                   1000 non-null   object 
 16  nan                                  1000 non-null   float64
 17  nan                                  1000 non-null   float64
 18  nan                                  1000 non-null   float64
 19  nan                                  1000 non-null   float64
 20  nan                                  1000 non-null   float64
 21  Rank                                 1000 non-null   object 
 22  Value                                1000 non-null   object 
dtypes: float64(5), object(18)
memory usage: 179.8+ KB
In [16]:
#drop unnecessary columns
newcust.drop(newcust.columns[[16, 17, 18, 19, 20]], inplace = True, axis = 1)
In [17]:
#change the numerical variables to float
newcust['tenure'] = pd.to_numeric(newcust['tenure'], errors='coerce')
newcust['Rank'] = pd.to_numeric(newcust['Rank'], errors='coerce')
newcust['Value'] = pd.to_numeric(newcust['Value'], errors='coerce')
#change the date variable to datetime
newcust['DOB'] =  pd.to_datetime(newcust['DOB'], format='%Y-%m-%d')

Consistency¶

In [18]:
newcust['gender'].value_counts()
Out[18]:
Female    513
Male      470
U          17
Name: gender, dtype: int64
Comment:¶

In “gender” column, there are 17 values named “U” which do not seem to represent any gender types. We will discard these values later.

Completeness¶

In [19]:
newcust.isna().sum()
Out[19]:
first_name                               0
last_name                               29
gender                                   0
past_3_years_bike_related_purchases      0
DOB                                     17
job_title                              106
job_industry_category                  165
wealth_segment                           0
deceased_indicator                       0
owns_car                                 0
tenure                                   0
address                                  0
postcode                                 0
state                                    0
country                                  0
property_valuation                       0
Rank                                     0
Value                                    0
dtype: int64
Comment:¶

There are 29 missing values in “last_name” column, 17 null values in “DOB”, 106 blanks in “job_title” column and 165 blank values in “job_industry_category”.

  • To mitigate the missing values in “last_name” variable, we will create a new column called “name” which consists of both first name and last name. This new column will replace “first_name” and “last_name” variables, so we will remove these 2 old columns.
In [20]:
#create a name variable that contains both first name and last name
newcust['name'] = newcust['first_name'] + ' ' + newcust['last_name'].fillna('')
newcust.drop(['first_name', 'last_name'], inplace = True, axis = 1)
  • We also found that “DOB” contains exactly 17 missing values in all “U” gender rows. Therefore, we will remove missing values in “DOB” column to solve the problem of completeness as well as consistency.
In [21]:
#Remove missing values in DOB 
newcust.dropna(subset=['DOB'], inplace=True)
newcust.reset_index(drop=True, inplace=True)
  • We deal with the blanks in “job_title” and “job_industry_category” by filling them with the last valid values forward.
In [22]:
newcust['job_title'].value_counts()
Out[22]:
Environmental Tech          14
Software Consultant         14
Associate Professor         14
Chief Design Engineer       13
Cost Accountant             12
                            ..
Web Developer II             1
Statistician III             1
Systems Administrator IV     1
Accounting Assistant IV      1
Web Developer I              1
Name: job_title, Length: 184, dtype: int64
In [23]:
#fill missing values with the last valid observation forward
newcust['job_title'] = newcust['job_title'].fillna(method='ffill')
In [24]:
newcust['job_industry_category'].value_counts()
Out[24]:
Financial Services    202
Manufacturing         199
Health                152
Retail                 78
Property               64
Entertainment          36
IT                     36
Argiculture            26
Telecommunications     25
Name: job_industry_category, dtype: int64
In [25]:
#fill missing values with the last valid observation forward
newcust['job_industry_category'] = newcust['job_industry_category'].fillna(method='ffill')

3. CustomerDemographic table¶

First Look At The Data¶

In [26]:
df[3]
Out[26]:
Note: The data and information in this document is reflective of a hypothetical situation and client. This document is to be used for KPMG Virtual Internship purposes only. Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10 Unnamed: 11 Unnamed: 12
0 customer_id first_name last_name gender past_3_years_bike_related_purchases DOB job_title job_industry_category wealth_segment deceased_indicator default owns_car tenure
1 1 Laraine Medendorp F 93 1953-10-12 00:00:00 Executive Secretary Health Mass Customer N "' Yes 11
2 2 Eli Bockman Male 81 1980-12-16 00:00:00 Administrative Officer Financial Services Mass Customer N <script>alert('hi')</script> Yes 16
3 3 Arlin Dearle Male 61 1954-01-20 00:00:00 Recruiting Manager Property Mass Customer N 2018-02-01 00:00:00 Yes 15
4 4 Talbot NaN Male 33 1961-10-03 00:00:00 NaN IT Mass Customer N () { _; } >_[$($())] { touch /tmp/blns.shellsh... No 7
... ... ... ... ... ... ... ... ... ... ... ... ... ...
3996 3996 Rosalia Halgarth Female 8 1975-08-09 00:00:00 VP Product Management Health Mass Customer N -100 No 19
3997 3997 Blanch Nisuis Female 87 2001-07-13 00:00:00 Statistician II Manufacturing High Net Worth N â¦testâ§ Yes 1
3998 3998 Sarene Woolley U 60 NaN Assistant Manager IT High Net Worth N NaN No NaN
3999 3999 Patrizius NaN Male 11 1973-10-24 00:00:00 NaN Manufacturing Affluent Customer N ¡â¢Â£Â¢â§¶â¢ÂªÂºââ Yes 10
4000 4000 Kippy Oldland Male 76 1991-11-05 00:00:00 Software Engineer IV NaN Affluent Customer N 0/0 No 11

4001 rows × 13 columns

Change the header¶

In [27]:
new_header_3 = df[3].iloc[0] 
custdemo = df[3][1:]
custdemo.columns = new_header_3
custdemo = custdemo.rename_axis(None, axis = 1)
custdemo.head()
Out[27]:
customer_id first_name last_name gender past_3_years_bike_related_purchases DOB job_title job_industry_category wealth_segment deceased_indicator default owns_car tenure
1 1 Laraine Medendorp F 93 1953-10-12 00:00:00 Executive Secretary Health Mass Customer N "' Yes 11
2 2 Eli Bockman Male 81 1980-12-16 00:00:00 Administrative Officer Financial Services Mass Customer N <script>alert('hi')</script> Yes 16
3 3 Arlin Dearle Male 61 1954-01-20 00:00:00 Recruiting Manager Property Mass Customer N 2018-02-01 00:00:00 Yes 15
4 4 Talbot NaN Male 33 1961-10-03 00:00:00 NaN IT Mass Customer N () { _; } >_[$($())] { touch /tmp/blns.shellsh... No 7
5 5 Sheila-kathryn Calton Female 56 1977-05-13 00:00:00 Senior Editor NaN Affluent Customer N NIL Yes 8

Data types¶

In [28]:
custdemo.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000 entries, 1 to 4000
Data columns (total 13 columns):
 #   Column                               Non-Null Count  Dtype 
---  ------                               --------------  ----- 
 0   customer_id                          4000 non-null   object
 1   first_name                           4000 non-null   object
 2   last_name                            3875 non-null   object
 3   gender                               4000 non-null   object
 4   past_3_years_bike_related_purchases  4000 non-null   object
 5   DOB                                  3913 non-null   object
 6   job_title                            3494 non-null   object
 7   job_industry_category                3344 non-null   object
 8   wealth_segment                       4000 non-null   object
 9   deceased_indicator                   4000 non-null   object
 10  default                              3698 non-null   object
 11  owns_car                             4000 non-null   object
 12  tenure                               3913 non-null   object
dtypes: object(13)
memory usage: 406.4+ KB
In [29]:
#change the numerical variable to float
custdemo['tenure'] = pd.to_numeric(custdemo['tenure'], errors='coerce')
#change the date variable to datetime
custdemo['DOB'] =  pd.to_datetime(custdemo['DOB'], format='%Y-%m-%d')

Completeness¶

In [30]:
custdemo.isna().sum()
Out[30]:
customer_id                              0
first_name                               0
last_name                              125
gender                                   0
past_3_years_bike_related_purchases      0
DOB                                     87
job_title                              506
job_industry_category                  656
wealth_segment                           0
deceased_indicator                       0
default                                302
owns_car                                 0
tenure                                  87
dtype: int64
Comment:¶

We found 125 missing values in “last_name” column, 87 blank values in “DOB” and “tenure”, 302 blanks in “default” column, 506 and 656 null values in “job_title” column and “job_industry_category” respectively.

  • We deal with the missing values in “last_name” column the same way as in NewCustomerList table mentioned above.
  • 87 rows that contain blanks in “DOB” also contain null values in “tenure”, so we will remove those 87 rows.
  • We handle the blanks in “job_title” and “job_industry_category” by filling them with the last valid values forward.
In [31]:
#create a name variable that contains both first name and last name
custdemo['name'] = custdemo['first_name'] + ' ' + custdemo['last_name'].fillna('')
custdemo.drop(['first_name', 'last_name'], inplace = True, axis = 1)
In [32]:
#Remove missing values in DOB 
custdemo.dropna(subset=['DOB'], inplace=True)
custdemo.reset_index(drop=True, inplace=True)
In [33]:
#fill missing values with the last valid observation forward
custdemo['job_title'] = custdemo['job_title'].fillna(method='ffill')
In [34]:
#fill missing values with the last valid observation forward
custdemo['job_industry_category'] = custdemo['job_industry_category'].fillna(method='ffill')

Accuracy¶

In [35]:
custdemo['DOB'].sort_values().head()
Out[35]:
33     1843-12-21
704    1931-10-23
1065   1935-08-22
3334   1940-09-22
2361   1943-08-11
Name: DOB, dtype: datetime64[ns]
Comment:¶

In “DOB” column, we noticed that 1 person was born in 1843, meaning that he is 175 years old. Knowing that it is not a correct value, we will delete this row in the sheet.

In [36]:
#remove outliers from DOB column
custdemo.drop(custdemo[custdemo['DOB'] == '1843-12-21'].index, inplace = True)
custdemo = custdemo.reset_index(drop= True)

Consistency¶

In [37]:
custdemo['gender'].value_counts()
Out[37]:
Female    2037
Male      1872
F            1
Femal        1
M            1
Name: gender, dtype: int64
Comment:¶

We found some inconsistencies, along with misspellings in “gender” column.

In [38]:
#rename value in gender column
custdemo['gender'].replace('F', 'Female',inplace=True)
custdemo['gender'].replace('M', 'Male',inplace=True)
custdemo['gender'].replace('Femal', 'Female',inplace=True)

Validity¶

Comment:¶

As we only want to analyse dataset from customers that are still alive, we will remove “Y” values from “deceased_indicator”.

In [39]:
#remove "Y" value from decreased_indicator column
custdemo.drop(custdemo[custdemo['deceased_indicator'] == 'Y'].index, inplace = True)
custdemo = custdemo.reset_index(drop= True)

Relevancy¶

Comment:¶

As we couldn’t see any relationship between “default” column and the dataset, we will delete this column from the table.

In [40]:
#drop default variable
custdemo.drop(['default'], inplace = True, axis = 1)

4. CustomerAddress table¶

First Look At The Data¶

In [41]:
df[4]
Out[41]:
Note: The data and information in this document is reflective of a hypothetical situation and client. This document is to be used for KPMG Virtual Internship purposes only. Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5
0 customer_id address postcode state country property_valuation
1 1 060 Morning Avenue 2016 New South Wales Australia 10
2 2 6 Meadow Vale Court 2153 New South Wales Australia 10
3 4 0 Holy Cross Court 4211 QLD Australia 9
4 5 17979 Del Mar Point 2448 New South Wales Australia 4
... ... ... ... ... ... ...
3995 3999 1482 Hauk Trail 3064 VIC Australia 3
3996 4000 57042 Village Green Point 4511 QLD Australia 6
3997 4001 87 Crescent Oaks Alley 2756 NSW Australia 10
3998 4002 8194 Lien Street 4032 QLD Australia 7
3999 4003 320 Acker Drive 2251 NSW Australia 7

4000 rows × 6 columns

Change the header¶

In [42]:
new_header_4 = df[4].iloc[0] 
address = df[4][1:]
address.columns = new_header_4
address = address.rename_axis(None, axis = 1)
address.head()
Out[42]:
customer_id address postcode state country property_valuation
1 1 060 Morning Avenue 2016 New South Wales Australia 10
2 2 6 Meadow Vale Court 2153 New South Wales Australia 10
3 4 0 Holy Cross Court 4211 QLD Australia 9
4 5 17979 Del Mar Point 2448 New South Wales Australia 4
5 6 9 Oakridge Court 3216 VIC Australia 9

Data types¶

In [43]:
address.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3999 entries, 1 to 3999
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   customer_id         3999 non-null   object
 1   address             3999 non-null   object
 2   postcode            3999 non-null   object
 3   state               3999 non-null   object
 4   country             3999 non-null   object
 5   property_valuation  3999 non-null   object
dtypes: object(6)
memory usage: 187.6+ KB
In [44]:
#### Completeness
In [45]:
address.isna().sum()
Out[45]:
customer_id           0
address               0
postcode              0
state                 0
country               0
property_valuation    0
dtype: int64
Comment:¶

There is no missing value in this dataframe.

Consistency¶

In [46]:
address['state'].value_counts()
Out[46]:
NSW                2054
VIC                 939
QLD                 838
New South Wales      86
Victoria             82
Name: state, dtype: int64
Comment:¶

There are some inconsistencies in “state” column. Therefore, we will change “New South Wales” to “NSW” and “Victoria” to “VIC”.

In [47]:
#rename value in state column
address['state'].replace('New South Wales', 'NSW',inplace=True)
address['state'].replace('Victoria', 'VIC',inplace=True)

Additional values¶

Comment:¶

There are some additional customer_ids in the Transactions table and Customer Address table but not in Customer Demographic table. Therefore, we will remove all extra values in these tables.

In [48]:
transactions.drop(transactions[transactions['customer_id'] == 5034].index, inplace = True)
transactions = transactions.reset_index(drop= True)
address.drop(address[(address['customer_id'] == 4001) | (address['customer_id'] == 4002) | (address['customer_id'] == 4003)].index, inplace = True)
address = address.reset_index(drop= True)

Save file¶

In [49]:
writer = pd.ExcelWriter('KPMG_task1.xlsx')
transactions.to_excel(writer, index=False, sheet_name='Transactions')
newcust.to_excel(writer, index=False, sheet_name='NewCustomerList')
custdemo.to_excel(writer, index=False, sheet_name='CustomerDemographic')
address.to_excel(writer, index=False, sheet_name='CustomerAddress')
writer.save()