Beginner Data Analysis Process — part one

4 min readMar 5, 2021

We are surrounded by data, digital songs, videos, photos, etc, we create and share those data on various social media platforms. However, before working part-time as a data analyst, I rarely view the stuff around me through the data analysis lens.

When you use a service as a customer you just use it, that’s it. But from the company’s perspective, statistical numbers matter, how many customers we have, how much they spend, which products are popular to what degree, how to extract insight from data analysis…

In fact, I started playing Final Fantasy XIV in January, sometimes I wonder which places are the most popular ones, what kind of events attracts the largest amount of players, how long do the players spend their time in this game, and how they spend it, etc

You might have noticed, currently, I often get into the descriptive data analysis mode, which is the easier part of data analysis. The more challenging one is finding cause and effect from those descriptive data and provide guidance or action suggestions.

As a junior data analyst, what I have done for the past 7 months is all descriptive data analysis. There are several steps from raw data to final reports

Understand what your boss or other department needs, could be simple monthly revenue information, could be analysis report on certain promotion event.
Get the raw data, either from colleagues or directly pull from some sort of ERP system, no SQL is needed in our company since there is no database we need to communicate with directly.
Trying to know all the attributes or features I can get from the Excel/CSV file.
Data cleansing, get rid of Null cells, unreliable data, etc
Data wrangling, or data aggregation. Raw data usually are on daily basis, so we need to aggregate it on a monthly or yearly basis, or aggregate it by stores by groups.
Data visualization. It’s about creating the storyline or final report to your supervisor.

Below is some technical reflection, no conceptual idea on how to figure out the purpose of the analysis and how to make the storytelling with data.

SQL, Python

I think I will need to pick up SQL and practice it again, although I don’t need to use it now, it seems a must working as data analysis

Pandas and NumPy are mainly used, actually in every step you need to select different rows/columns/cells by the condition.

Basic selection using loc, iloc

# loc label-based.
# iloc location-based.
>>> s = pd.Series(list("abcdef"), index=[49, 48, 47, 0, 1, 2]) 
49    a
48    b
47    c
0     d
1     e
2     f

>>> s.loc[0]    # value at index label 0
'd'

>>> s.iloc[0]   # value at index location 0
'a'

>>> s.loc[0:1]  # rows at index labels between 0 and 1 (inclusive)
0    d
1    e

>>> s.iloc[0:1] # rows at index location between 0 and 1 (exclusive)
49    a

2. Not In condition

import pandas as pd

dates_drop = ['2021-03-05' , '2021-03-06']

df = pd.DataFrame(data={'date':['2021-03-01','2021-03-02','2021-03-03','2021-03-04','2021-03-05','2021-03-06']})

print(df)
#         date
#0  2021-03-01
#1  2021-03-02
#2  2021-03-03
#3  2021-03-04
#4  2021-03-05
#5  2021-03-06

df = df[~df['date'].isin(dates_drop)]print(df)
#         date
#0  2021-03-01
#1  2021-03-02
#2  2021-03-03
#3  2021-03-04

Data Cleansing

When there is Null, replace it with zero, delete it or estimate it.

Delete some rows or columns, since the table could include tons of attributes

# numpy.delete(a, index)import numpy as np

a = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])
index = [2, 3, 6]

new_a = np.delete(a, index)

print(new_a) #Prints `[1, 2, 5, 6, 8, 9]`# Enumerate way
new_a = [val for i, val in enumerate(a) if all(i != index)]

Data Aggregation

The group operation is often used

df.groupby(['att1','att2']).sum()
df.groupby(['att1','att2'],as_index=False).sum()df.groupby(['att1','att2'])['att3'].sum()df.groupby(['att1', 'att2']).agg({'att1': "count", 'att3': "sum",'att4': 'mean'})

Sometimes you need to merge before doing aggregation

merge_data = df_a.merge(df_b, how='inner', left_on=['A', 'B'], right_on=['A', 'B_new'])

Sometimes merge doesn’t work, look for some value from another DataFrame

df[df.attr1=='A'].attr1.item() #get the value from other df
df.loc[df.Letters=='C','Letters'].values[0] #another way

Save it and import it to Visualization tools to create the report

df1.to_excel("output.xlsx",sheet_name='Sheet_name_1', index=False)

Data Visualization

Currently, I am using Tableau which is so powerful and elegant. I’d like to write about it separately next time.

Beginner Data Analysis Process — part one

SQL, Python

Data Cleansing

Data Aggregation

Data Visualization

Written by Renee LIN

No responses yet