发布于  更新于 

Basic Skills in Data Analysis

This blog has been migrated to Microsoft Azure and is generated automatically using Azure DevOps Pipeline. Due to some problems in Node.js, some inline MathJax may be error. The author is working on that, sorry.

1
2
3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In this section we use the housing dataset to practice some skills in data analysis.

1
2
HOUSING_PATH = "housing/datasets/handson-ml/datasets/housing/housing.csv"
housing = pd.read_csv(HOUSING_PATH)

Split of Data

Not all data are used for training. In fact, we usually split our dataset into:

  • Training dataset + Test dataset
  • Training dataset + Validation dataset + Test dataset

Sometimes we use stratified split to maintain high representativenenss.

Basic Split of Data

A Method for Random TT Split

For the first way, we can define a funtion like this:

1
2
3
4
5
6
7
8
def split_data(data, test_ratio):
shuffled_indices = np.random.permutation(len(data))
test_data_size = int(len(data) * test_ratio)

test_data_indices = shuffled_indices[:test_data_size]
train_data_indices = shuffled_indices[test_data_size:]

return data.iloc[test_data_indices], data.iloc[train_data_indices]

We can test it using our data.

1
2
3
4
train_housing, test_housing = split_data(housing, 0.7)

print(train_housing.shape)
print(test_housing.shape)
(14447, 10)
(6193, 10)

Note that we shuffled the indices when splitting, which may cause the problem that, if we train the model for more than once, the model will be able to see the whole dataset. That’s what we should avoid!

Random but Fixed: Two Mothods for Stable Random TT Split

One possible solution is to save the two sets after the first training.

Another is to use a fixed seed when shuffling indices (before we call numpy.random.permutation()) to generate the same set everytime. However, both solutions FUXK UP when the dataset is updated.

Identifiers: A Mothod for Stable, Random but algo Sustainable TT Split

We give each datum an unique identifier and decide which set it should be in according to it.

Eg. We calculate the last byte of the identifier’s hash. We use $256 \times 70% \approx 51$ as the threshold to implement the 70 - 30 split.

1
2
3
4
5
6
7
8
9
import hashlib

def test_set_check(identifier, test_ratio, hash):
return hash(np.int64(identifier)).digest()[-1] < 256 * test_ratio

def split_by_id(data, test_ratio, id_column, hash=hashlib.md5):
ids = data[id_column]
in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio, hash))
return data.loc[in_test_set], data.loc[~in_test_set]

N.B. The lambda expression here collects all id_ that get True in test_set_check into in_test_set. data.loc[~in_test_set] selects rows whose indices are not in in_test_set.

N.B. The identifier MUST BE UNIQUE. We use id, which is the row index here. However, it FUXKS UP when any in-between row is removed. Data like longitude and latitude is fixed relatively, so they can be good components for generating identifiers.

Now we test the functions we’ve just defined.

1
2
3
4
train_housing, test_housing = split_by_id(housing.reset_index(), 0.7, 'index')

print(train_housing.shape)
print(test_housing.shape)
(14587, 11)
(6053, 11)

N.B. DataFrame.reset_index() returns a DataFrame with a more column index with indices.

Tools from Scikit-Learn

Scikit-Learn provides us with useful functions to split data.

1
2
3
4
5
6
7
8
sklearn.model_selection.train_test_split(
*arrays,
test_size=None,
train_size=None,
random_state=None,
shuffle=True,
stratify=None
)

N.B. When only one value of test_size, train_size is provided, the functions calculate the ratio according to the provided param. When none is provided, the function splits using 75 - 25 ratio.

1
2
3
4
5
6
from sklearn.model_selection import train_test_split

train_data, test_data = train_test_split(housing, train_size=0.7, shuffle=True)

print(train_data.shape)
print(test_data.shape)
(14447, 10)
(6193, 10)

Stratified Split

If we want to maintain the high representativeness of the data in every set, or we want to avoid the serious bias that may be caused by pure random modelling, we use stratified split.

First we need to stratify our data. We take income_median as an example.

1
2
housing['income_cate'] = np.ceil(housing['median_income'] / 1.5)           # So there won't be too much categories.
housing['income_cate'].where(housing['income_cate'] < 5, 5.0, inplace=True) # Merge
1
2
3
4
5
6
from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing['income_cate']):
strat_train_data = housing.loc[train_index]
strat_test_data = housing.loc[test_index]

Its usage may be a little bit of complicated.

Anyway. First let’s see the distribution in the whole dataset.

1
housing['income_cate'].value_counts() / len(housing)
3.0    0.350581
2.0    0.318847
4.0    0.176308
5.0    0.114438
1.0    0.039826
Name: income_cate, dtype: float64
1
strat_train_data['income_cate'].value_counts() / len(strat_train_data)
3.0    0.350594
2.0    0.318859
4.0    0.176296
5.0    0.114402
1.0    0.039850
Name: income_cate, dtype: float64
1
strat_test_data['income_cate'].value_counts() / len(strat_test_data)
3.0    0.350533
2.0    0.318798
4.0    0.176357
5.0    0.114583
1.0    0.039729
Name: income_cate, dtype: float64

Nice! To make data neat we delete the income_cate column.

1
2
for set in (strat_test_data, strat_train_data):
set.drop(['income_cate'], axis=1, inplace=True)

From now on, we should let alone the test set and only focus on the training set.

1
housing = strat_train_data.copy()

Display of Data

It’s a good idea to have a view on the data before we use them. To see them is to display them.

Raw Data

1
housing

longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
17606 -121.89 37.29 38.0 1568.0 351.0 710.0 339.0 2.7042 286600.0 <1H OCEAN
18632 -121.93 37.05 14.0 679.0 108.0 306.0 113.0 6.4214 340600.0 <1H OCEAN
14650 -117.20 32.77 31.0 1952.0 471.0 936.0 462.0 2.8621 196900.0 NEAR OCEAN
3230 -119.61 36.31 25.0 1847.0 371.0 1460.0 353.0 1.8839 46300.0 INLAND
3555 -118.59 34.23 17.0 6592.0 1525.0 4459.0 1463.0 3.0347 254500.0 <1H OCEAN
... ... ... ... ... ... ... ... ... ... ...
6563 -118.13 34.20 46.0 1271.0 236.0 573.0 210.0 4.9312 240200.0 INLAND
12053 -117.56 33.88 40.0 1196.0 294.0 1052.0 258.0 2.0682 113000.0 INLAND
13908 -116.40 34.09 9.0 4855.0 872.0 2098.0 765.0 3.2723 97800.0 INLAND
11159 -118.01 33.82 31.0 1960.0 380.0 1356.0 356.0 4.0625 225900.0 <1H OCEAN
15775 -122.45 37.77 52.0 3095.0 682.0 1269.0 639.0 3.5750 500001.0 NEAR BAY

16512 rows × 10 columns

Abstract of Data

Using DataFrame.describe() we’re able to see the abstract/description of our data.

1
housing.describe()

longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
count 16512.000000 16512.000000 16512.000000 16512.000000 16354.000000 16512.000000 16512.000000 16512.000000 16512.000000
mean -119.575834 35.639577 28.653101 2622.728319 534.973890 1419.790819 497.060380 3.875589 206990.920724
std 2.001860 2.138058 12.574726 2138.458419 412.699041 1115.686241 375.720845 1.904950 115703.014830
min -124.350000 32.540000 1.000000 6.000000 2.000000 3.000000 2.000000 0.499900 14999.000000
25% -121.800000 33.940000 18.000000 1443.000000 295.000000 784.000000 279.000000 2.566775 119800.000000
50% -118.510000 34.260000 29.000000 2119.500000 433.000000 1164.000000 408.000000 3.540900 179500.000000
75% -118.010000 37.720000 37.000000 3141.000000 644.000000 1719.250000 602.000000 4.744475 263900.000000
max -114.310000 41.950000 52.000000 39320.000000 6210.000000 35682.000000 5358.000000 15.000100 500001.000000

Histogram for Quantative Data

For quantative data we can plot their histograms simply using DataFrame.hist().

The API for DataFrame.hist() is like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
DataFrame.hist(
column=None,
by=None,
grid=True,
xlabelsize=None,
xrot=None,
ylabelsize=None,
yrot=None,
ax=None,
sharex=False,
sharey=False,
figsize=None, # The size in inches of the figure to create.
layout=None,
bins=10, # Number of histogram bins to be used.
backend=None,
legend=False,
**kwargs
)

For more information, please refer to the Pandas Documentation.

1
2
3
4
5
housing.hist(
bins=50,
figsize=(15, 12)
)
plt.show()

Observe that the median_income seems not to be measured in USD. Indeed, its data were scaled down at a certain proportion. Such operation, which is quite common in ML, is called “data-preprocessing”. It’s vital to know whether and how some data were pre-processed.

Find Relations

Using DataFrame.corr() we can directly see the correlation coefficients (Pearson product-moment correlation coefficient, usually denoted as $r$, see Wikipedia) among data.

The more $r$ approaches $1$, the strong is the positive correlation between the two data. The more $r$ approaches $-1$, the strong is the negative correlation between them.

1
2
corr_matrix = housing.corr()
corr_matrix

longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
longitude 1.000000 -0.924478 -0.105848 0.048871 0.076598 0.108030 0.063070 -0.019583 -0.047432
latitude -0.924478 1.000000 0.005766 -0.039184 -0.072419 -0.115222 -0.077647 -0.075205 -0.142724
housing_median_age -0.105848 0.005766 1.000000 -0.364509 -0.325047 -0.298710 -0.306428 -0.111360 0.114110
total_rooms 0.048871 -0.039184 -0.364509 1.000000 0.929379 0.855109 0.918392 0.200087 0.135097
total_bedrooms 0.076598 -0.072419 -0.325047 0.929379 1.000000 0.876320 0.980170 -0.009740 0.047689
population 0.108030 -0.115222 -0.298710 0.855109 0.876320 1.000000 0.904637 0.002380 -0.026920
households 0.063070 -0.077647 -0.306428 0.918392 0.980170 0.904637 1.000000 0.010781 0.064506
median_income -0.019583 -0.075205 -0.111360 0.200087 -0.009740 0.002380 0.010781 1.000000 0.687160
median_house_value -0.047432 -0.142724 0.114110 0.135097 0.047689 -0.026920 0.064506 0.687160 1.000000

We can select one column (row) and sort it to focus on a certain feature.

1
print(corr_matrix['median_house_value'].sort_values(ascending=False))
median_house_value    1.000000
median_income         0.687160
total_rooms           0.135097
housing_median_age    0.114110
households            0.064506
total_bedrooms        0.047689
population           -0.026920
longitude            -0.047432
latitude             -0.142724
Name: median_house_value, dtype: float64

However, the $r$ only shows the linear correlation. Using pandas.plotting.scatter_matrix we can see more interesting correlations.

1
2
3
4
5
from pandas.plotting import scatter_matrix

attrs = ['median_house_value', 'median_income', 'latitude']
scatter_matrix(housing[attrs], figsize = (15, 12))
plt.show()

Trick: Geo Figure for Data with Geo Data

For data that are distributed geografically with geografic data, we can plot them in scatters geografically. Note that

1
2
3
housing.plot(kind='scatter', x='longitude', y='latitude', label="population")
plt.legend()
plt.title("Geographical Distribution")
Text(0.5, 1.0, 'Geographical Distribution')

The distribution is not even. To show such feature we can set a lower transparency (alpha value).

1
2
3
    housing.plot(kind='scatter', x='longitude', y='latitude', label="population", alpha=0.5)
plt.legend()
plt.title("Geographical Distribution")
Text(0.5, 1.0, 'Geographical Distribution')

Still not that intuitive right? Use the heatmap!

1
2
3
4
5
6
housing.plot(kind='scatter', x='longitude', y='latitude', alpha=0.5,
s=housing['population']/100, label="population",
c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True)
plt.legend()
plt.title('Geographical Distribution Heatmap')
plt.show()

From these figures we can tell that,

  • Median of Income is most positively correlative with the pricing of houses, which aligns with our intuition;
  • There is slight negative correlation between the pricing and the latitude. With the location going north, the prices tend to decrease;

Cleaning of Data

Some features bugs in the data may cause trouble. So data cleaning is necessary before we use them.

For example, we notice that somedata in total_bedrooms are missing.

1
housing['total_bedrooms']
17606     351.0
18632     108.0
14650     471.0
3230      371.0
3555     1525.0
          ...  
6563      236.0
12053     294.0
13908     872.0
11159     380.0
15775     682.0
Name: total_bedrooms, Length: 16512, dtype: float64

We generally have the following options:

  • Ignore every row (1 piece of datum) with missing data;
  • Ignore every column (1 kind of feature) with missing data;
  • Fill the missing data using 0, average, median, mode, etc.

If we want to inore every row (1 piece of datum) with missing data, we use DataFrame.dropna().

1
2
3
4
housing_1 = housing.copy()
housing_1.dropna(subset=[
'total_bedrooms'
])

longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
17606 -121.89 37.29 38.0 1568.0 351.0 710.0 339.0 2.7042 286600.0 <1H OCEAN
18632 -121.93 37.05 14.0 679.0 108.0 306.0 113.0 6.4214 340600.0 <1H OCEAN
14650 -117.20 32.77 31.0 1952.0 471.0 936.0 462.0 2.8621 196900.0 NEAR OCEAN
3230 -119.61 36.31 25.0 1847.0 371.0 1460.0 353.0 1.8839 46300.0 INLAND
3555 -118.59 34.23 17.0 6592.0 1525.0 4459.0 1463.0 3.0347 254500.0 <1H OCEAN
... ... ... ... ... ... ... ... ... ... ...
6563 -118.13 34.20 46.0 1271.0 236.0 573.0 210.0 4.9312 240200.0 INLAND
12053 -117.56 33.88 40.0 1196.0 294.0 1052.0 258.0 2.0682 113000.0 INLAND
13908 -116.40 34.09 9.0 4855.0 872.0 2098.0 765.0 3.2723 97800.0 INLAND
11159 -118.01 33.82 31.0 1960.0 380.0 1356.0 356.0 4.0625 225900.0 <1H OCEAN
15775 -122.45 37.77 52.0 3095.0 682.0 1269.0 639.0 3.5750 500001.0 NEAR BAY

16354 rows × 10 columns

If we want to inore every column (1 kind of feature) with missing data, we use DataFrame.drop().

1
2
3
4
housing_2 = housing.copy()
housing_2.drop([
'total_bedrooms'
], axis=1)

longitude latitude housing_median_age total_rooms population households median_income median_house_value ocean_proximity
17606 -121.89 37.29 38.0 1568.0 710.0 339.0 2.7042 286600.0 <1H OCEAN
18632 -121.93 37.05 14.0 679.0 306.0 113.0 6.4214 340600.0 <1H OCEAN
14650 -117.20 32.77 31.0 1952.0 936.0 462.0 2.8621 196900.0 NEAR OCEAN
3230 -119.61 36.31 25.0 1847.0 1460.0 353.0 1.8839 46300.0 INLAND
3555 -118.59 34.23 17.0 6592.0 4459.0 1463.0 3.0347 254500.0 <1H OCEAN
... ... ... ... ... ... ... ... ... ...
6563 -118.13 34.20 46.0 1271.0 573.0 210.0 4.9312 240200.0 INLAND
12053 -117.56 33.88 40.0 1196.0 1052.0 258.0 2.0682 113000.0 INLAND
13908 -116.40 34.09 9.0 4855.0 2098.0 765.0 3.2723 97800.0 INLAND
11159 -118.01 33.82 31.0 1960.0 1356.0 356.0 4.0625 225900.0 <1H OCEAN
15775 -122.45 37.77 52.0 3095.0 1269.0 639.0 3.5750 500001.0 NEAR BAY

16512 rows × 9 columns

If we want to fill the missing data, we need to calculate the data that we want to fill and fill in the blanks. For filling we use DataFrame.fillna().

1
2
housing_3 = housing.copy()
housing_3.fillna(0)

longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
17606 -121.89 37.29 38.0 1568.0 351.0 710.0 339.0 2.7042 286600.0 <1H OCEAN
18632 -121.93 37.05 14.0 679.0 108.0 306.0 113.0 6.4214 340600.0 <1H OCEAN
14650 -117.20 32.77 31.0 1952.0 471.0 936.0 462.0 2.8621 196900.0 NEAR OCEAN
3230 -119.61 36.31 25.0 1847.0 371.0 1460.0 353.0 1.8839 46300.0 INLAND
3555 -118.59 34.23 17.0 6592.0 1525.0 4459.0 1463.0 3.0347 254500.0 <1H OCEAN
... ... ... ... ... ... ... ... ... ... ...
6563 -118.13 34.20 46.0 1271.0 236.0 573.0 210.0 4.9312 240200.0 INLAND
12053 -117.56 33.88 40.0 1196.0 294.0 1052.0 258.0 2.0682 113000.0 INLAND
13908 -116.40 34.09 9.0 4855.0 872.0 2098.0 765.0 3.2723 97800.0 INLAND
11159 -118.01 33.82 31.0 1960.0 380.0 1356.0 356.0 4.0625 225900.0 <1H OCEAN
15775 -122.45 37.77 52.0 3095.0 682.0 1269.0 639.0 3.5750 500001.0 NEAR BAY

16512 rows × 10 columns