Basic Skills in Data Analysis

This blog has been migrated to Microsoft Azure and is generated automatically using Azure DevOps Pipeline. Due to some problems in Node.js, some inline MathJax may be error. The author is working on that, sorry.

1
2
3

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In this section we use the housing dataset to practice some skills in data analysis.

1 2	HOUSING_PATH = "housing/datasets/handson-ml/datasets/housing/housing.csv" housing = pd.read_csv(HOUSING_PATH)

Split of Data

Not all data are used for training. In fact, we usually split our dataset into:

Training dataset + Test dataset
Training dataset + Validation dataset + Test dataset

Sometimes we use stratified split to maintain high representativenenss.

Basic Split of Data

A Method for Random TT Split

For the first way, we can define a funtion like this:

def split_data(data, test_ratio):
    shuffled_indices = np.random.permutation(len(data))
    test_data_size = int(len(data) * test_ratio)

    test_data_indices = shuffled_indices[:test_data_size]
    train_data_indices = shuffled_indices[test_data_size:]

    return data.iloc[test_data_indices], data.iloc[train_data_indices]

We can test it using our data.

train_housing, test_housing = split_data(housing, 0.7)

print(train_housing.shape)
print(test_housing.shape)

(14447, 10)
(6193, 10)

Note that we shuffled the indices when splitting, which may cause the problem that, if we train the model for more than once, the model will be able to see the whole dataset. That’s what we should avoid!

Random but Fixed: Two Mothods for Stable Random TT Split

One possible solution is to save the two sets after the first training.

Another is to use a fixed seed when shuffling indices (before we call numpy.random.permutation()) to generate the same set everytime. However, both solutions FUXK UP when the dataset is updated.

Identifiers: A Mothod for Stable, Random but algo Sustainable TT Split

We give each datum an unique identifier and decide which set it should be in according to it.

Eg. We calculate the last byte of the identifier’s hash. We use $256 \times 70% \approx 51$ as the threshold to implement the 70 - 30 split.

import hashlib

def test_set_check(identifier, test_ratio, hash):
    return hash(np.int64(identifier)).digest()[-1] < 256 * test_ratio

def split_by_id(data, test_ratio, id_column, hash=hashlib.md5):
    ids = data[id_column]
    in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio, hash))
    return data.loc[in_test_set], data.loc[~in_test_set]

N.B. The lambda expression here collects all id_ that get True in test_set_check into in_test_set. data.loc[~in_test_set] selects rows whose indices are not in in_test_set.

N.B. The identifier MUST BE UNIQUE. We use id, which is the row index here. However, it FUXKS UP when any in-between row is removed. Data like longitude and latitude is fixed relatively, so they can be good components for generating identifiers.

Now we test the functions we’ve just defined.

train_housing, test_housing = split_by_id(housing.reset_index(), 0.7, 'index')

print(train_housing.shape)
print(test_housing.shape)

(14587, 11)
(6053, 11)

N.B. DataFrame.reset_index() returns a DataFrame with a more column index with indices.

Tools from Scikit-Learn

Scikit-Learn provides us with useful functions to split data.

sklearn.model_selection.train_test_split(
    *arrays,
    test_size=None,
    train_size=None,
    random_state=None,
    shuffle=True,
    stratify=None
)

N.B. When only one value of test_size, train_size is provided, the functions calculate the ratio according to the provided param. When none is provided, the function splits using 75 - 25 ratio.

from sklearn.model_selection import train_test_split

train_data, test_data = train_test_split(housing, train_size=0.7, shuffle=True)
    
print(train_data.shape)
print(test_data.shape)

(14447, 10)
(6193, 10)

Stratified Split

If we want to maintain the high representativeness of the data in every set, or we want to avoid the serious bias that may be caused by pure random modelling, we use stratified split.

First we need to stratify our data. We take income_median as an example.

1 2	housing['income_cate'] = np.ceil(housing['median_income'] / 1.5) # So there won't be too much categories. housing['income_cate'].where(housing['income_cate'] < 5, 5.0, inplace=True) # Merge

from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing['income_cate']):
        strat_train_data = housing.loc[train_index]
        strat_test_data  = housing.loc[test_index]

Its usage may be a little bit of complicated.

Anyway. First let’s see the distribution in the whole dataset.

1	housing['income_cate'].value_counts() / len(housing)

3.0    0.350581
2.0    0.318847
4.0    0.176308
5.0    0.114438
1.0    0.039826
Name: income_cate, dtype: float64

1	strat_train_data['income_cate'].value_counts() / len(strat_train_data)

3.0    0.350594
2.0    0.318859
4.0    0.176296
5.0    0.114402
1.0    0.039850
Name: income_cate, dtype: float64

1	strat_test_data['income_cate'].value_counts() / len(strat_test_data)

3.0    0.350533
2.0    0.318798
4.0    0.176357
5.0    0.114583
1.0    0.039729
Name: income_cate, dtype: float64

Nice! To make data neat we delete the income_cate column.

1 2	for set in (strat_test_data, strat_train_data): set.drop(['income_cate'], axis=1, inplace=True)

From now on, we should let alone the test set and only focus on the training set.

1	housing = strat_train_data.copy()

Display of Data

It’s a good idea to have a view on the data before we use them. To see them is to display them.

Raw Data

housing

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value	ocean_proximity
17606	-121.89	37.29	38.0	1568.0	351.0	710.0	339.0	2.7042	286600.0	<1H OCEAN
18632	-121.93	37.05	14.0	679.0	108.0	306.0	113.0	6.4214	340600.0	<1H OCEAN
14650	-117.20	32.77	31.0	1952.0	471.0	936.0	462.0	2.8621	196900.0	NEAR OCEAN
3230	-119.61	36.31	25.0	1847.0	371.0	1460.0	353.0	1.8839	46300.0	INLAND
3555	-118.59	34.23	17.0	6592.0	1525.0	4459.0	1463.0	3.0347	254500.0	<1H OCEAN
...	...	...	...	...	...	...	...	...	...	...
6563	-118.13	34.20	46.0	1271.0	236.0	573.0	210.0	4.9312	240200.0	INLAND
12053	-117.56	33.88	40.0	1196.0	294.0	1052.0	258.0	2.0682	113000.0	INLAND
13908	-116.40	34.09	9.0	4855.0	872.0	2098.0	765.0	3.2723	97800.0	INLAND
11159	-118.01	33.82	31.0	1960.0	380.0	1356.0	356.0	4.0625	225900.0	<1H OCEAN
15775	-122.45	37.77	52.0	3095.0	682.0	1269.0	639.0	3.5750	500001.0	NEAR BAY

16512 rows × 10 columns

Abstract of Data

Using DataFrame.describe() we’re able to see the abstract/description of our data.

1	housing.describe()

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value
count	16512.000000	16512.000000	16512.000000	16512.000000	16354.000000	16512.000000	16512.000000	16512.000000	16512.000000
mean	-119.575834	35.639577	28.653101	2622.728319	534.973890	1419.790819	497.060380	3.875589	206990.920724
std	2.001860	2.138058	12.574726	2138.458419	412.699041	1115.686241	375.720845	1.904950	115703.014830
min	-124.350000	32.540000	1.000000	6.000000	2.000000	3.000000	2.000000	0.499900	14999.000000
25%	-121.800000	33.940000	18.000000	1443.000000	295.000000	784.000000	279.000000	2.566775	119800.000000
50%	-118.510000	34.260000	29.000000	2119.500000	433.000000	1164.000000	408.000000	3.540900	179500.000000
75%	-118.010000	37.720000	37.000000	3141.000000	644.000000	1719.250000	602.000000	4.744475	263900.000000
max	-114.310000	41.950000	52.000000	39320.000000	6210.000000	35682.000000	5358.000000	15.000100	500001.000000

Histogram for Quantative Data

For quantative data we can plot their histograms simply using DataFrame.hist().

The API for DataFrame.hist() is like:

DataFrame.hist(
    column=None,
    by=None,
    grid=True,
    xlabelsize=None,
    xrot=None,
    ylabelsize=None,
    yrot=None,
    ax=None,
    sharex=False,
    sharey=False,
    figsize=None,   # The size in inches of the figure to create.
    layout=None,
    bins=10,        # Number of histogram bins to be used.
    backend=None,
    legend=False,
    **kwargs
)

For more information, please refer to the Pandas Documentation.

housing.hist(
    bins=50,
    figsize=(15, 12)
)
plt.show()

Observe that the median_income seems not to be measured in USD. Indeed, its data were scaled down at a certain proportion. Such operation, which is quite common in ML, is called “data-preprocessing”. It’s vital to know whether and how some data were pre-processed.

Find Relations

Using DataFrame.corr() we can directly see the correlation coefficients (Pearson product-moment correlation coefficient, usually denoted as $r$, see Wikipedia) among data.

The more $r$ approaches $1$, the strong is the positive correlation between the two data. The more $r$ approaches $-1$, the strong is the negative correlation between them.

1 2	corr_matrix = housing.corr() corr_matrix

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value
longitude	1.000000	-0.924478	-0.105848	0.048871	0.076598	0.108030	0.063070	-0.019583	-0.047432
latitude	-0.924478	1.000000	0.005766	-0.039184	-0.072419	-0.115222	-0.077647	-0.075205	-0.142724
housing_median_age	-0.105848	0.005766	1.000000	-0.364509	-0.325047	-0.298710	-0.306428	-0.111360	0.114110
total_rooms	0.048871	-0.039184	-0.364509	1.000000	0.929379	0.855109	0.918392	0.200087	0.135097
total_bedrooms	0.076598	-0.072419	-0.325047	0.929379	1.000000	0.876320	0.980170	-0.009740	0.047689
population	0.108030	-0.115222	-0.298710	0.855109	0.876320	1.000000	0.904637	0.002380	-0.026920
households	0.063070	-0.077647	-0.306428	0.918392	0.980170	0.904637	1.000000	0.010781	0.064506
median_income	-0.019583	-0.075205	-0.111360	0.200087	-0.009740	0.002380	0.010781	1.000000	0.687160
median_house_value	-0.047432	-0.142724	0.114110	0.135097	0.047689	-0.026920	0.064506	0.687160	1.000000

We can select one column (row) and sort it to focus on a certain feature.

1	print(corr_matrix['median_house_value'].sort_values(ascending=False))

median_house_value    1.000000
median_income         0.687160
total_rooms           0.135097
housing_median_age    0.114110
households            0.064506
total_bedrooms        0.047689
population           -0.026920
longitude            -0.047432
latitude             -0.142724
Name: median_house_value, dtype: float64

However, the $r$ only shows the linear correlation. Using pandas.plotting.scatter_matrix we can see more interesting correlations.

from pandas.plotting import scatter_matrix

attrs = ['median_house_value', 'median_income', 'latitude']
scatter_matrix(housing[attrs], figsize = (15, 12))
plt.show()

Trick: Geo Figure for Data with Geo Data

For data that are distributed geografically with geografic data, we can plot them in scatters geografically. Note that

1
2
3

housing.plot(kind='scatter', x='longitude', y='latitude', label="population")
plt.legend()
plt.title("Geographical Distribution")

Text(0.5, 1.0, 'Geographical Distribution')

The distribution is not even. To show such feature we can set a lower transparency (alpha value).

1
2
3

    housing.plot(kind='scatter', x='longitude', y='latitude', label="population", alpha=0.5)
plt.legend()
plt.title("Geographical Distribution")

Text(0.5, 1.0, 'Geographical Distribution')

Still not that intuitive right? Use the heatmap!

housing.plot(kind='scatter', x='longitude', y='latitude', alpha=0.5,
             s=housing['population']/100, label="population",
             c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True)
plt.legend()
plt.title('Geographical Distribution Heatmap')
plt.show()

From these figures we can tell that,

Median of Income is most positively correlative with the pricing of houses, which aligns with our intuition;
There is slight negative correlation between the pricing and the latitude. With the location going north, the prices tend to decrease;
…

Cleaning of Data

Some ~~features~~ bugs in the data may cause trouble. So data cleaning is necessary before we use them.

For example, we notice that somedata in total_bedrooms are missing.

1	housing['total_bedrooms']

17606     351.0
18632     108.0
14650     471.0
3230      371.0
3555     1525.0
          ...  
6563      236.0
12053     294.0
13908     872.0
11159     380.0
15775     682.0
Name: total_bedrooms, Length: 16512, dtype: float64

We generally have the following options:

Ignore every row (1 piece of datum) with missing data;
Ignore every column (1 kind of feature) with missing data;
Fill the missing data using 0, average, median, mode, etc.

If we want to inore every row (1 piece of datum) with missing data, we use DataFrame.dropna().

housing_1 = housing.copy()
housing_1.dropna(subset=[
    'total_bedrooms'
])

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value	ocean_proximity
17606	-121.89	37.29	38.0	1568.0	351.0	710.0	339.0	2.7042	286600.0	<1H OCEAN
18632	-121.93	37.05	14.0	679.0	108.0	306.0	113.0	6.4214	340600.0	<1H OCEAN
14650	-117.20	32.77	31.0	1952.0	471.0	936.0	462.0	2.8621	196900.0	NEAR OCEAN
3230	-119.61	36.31	25.0	1847.0	371.0	1460.0	353.0	1.8839	46300.0	INLAND
3555	-118.59	34.23	17.0	6592.0	1525.0	4459.0	1463.0	3.0347	254500.0	<1H OCEAN
...	...	...	...	...	...	...	...	...	...	...
6563	-118.13	34.20	46.0	1271.0	236.0	573.0	210.0	4.9312	240200.0	INLAND
12053	-117.56	33.88	40.0	1196.0	294.0	1052.0	258.0	2.0682	113000.0	INLAND
13908	-116.40	34.09	9.0	4855.0	872.0	2098.0	765.0	3.2723	97800.0	INLAND
11159	-118.01	33.82	31.0	1960.0	380.0	1356.0	356.0	4.0625	225900.0	<1H OCEAN
15775	-122.45	37.77	52.0	3095.0	682.0	1269.0	639.0	3.5750	500001.0	NEAR BAY

16354 rows × 10 columns

If we want to inore every column (1 kind of feature) with missing data, we use DataFrame.drop().

housing_2 = housing.copy()
housing_2.drop([
    'total_bedrooms'
], axis=1)

	longitude	latitude	housing_median_age	total_rooms	population	households	median_income	median_house_value	ocean_proximity
17606	-121.89	37.29	38.0	1568.0	710.0	339.0	2.7042	286600.0	<1H OCEAN
18632	-121.93	37.05	14.0	679.0	306.0	113.0	6.4214	340600.0	<1H OCEAN
14650	-117.20	32.77	31.0	1952.0	936.0	462.0	2.8621	196900.0	NEAR OCEAN
3230	-119.61	36.31	25.0	1847.0	1460.0	353.0	1.8839	46300.0	INLAND
3555	-118.59	34.23	17.0	6592.0	4459.0	1463.0	3.0347	254500.0	<1H OCEAN
...	...	...	...	...	...	...	...	...	...
6563	-118.13	34.20	46.0	1271.0	573.0	210.0	4.9312	240200.0	INLAND
12053	-117.56	33.88	40.0	1196.0	1052.0	258.0	2.0682	113000.0	INLAND
13908	-116.40	34.09	9.0	4855.0	2098.0	765.0	3.2723	97800.0	INLAND
11159	-118.01	33.82	31.0	1960.0	1356.0	356.0	4.0625	225900.0	<1H OCEAN
15775	-122.45	37.77	52.0	3095.0	1269.0	639.0	3.5750	500001.0	NEAR BAY

16512 rows × 9 columns

If we want to fill the missing data, we need to calculate the data that we want to fill and fill in the blanks. For filling we use DataFrame.fillna().

1 2	housing_3 = housing.copy() housing_3.fillna(0)

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value	ocean_proximity
17606	-121.89	37.29	38.0	1568.0	351.0	710.0	339.0	2.7042	286600.0	<1H OCEAN
18632	-121.93	37.05	14.0	679.0	108.0	306.0	113.0	6.4214	340600.0	<1H OCEAN
14650	-117.20	32.77	31.0	1952.0	471.0	936.0	462.0	2.8621	196900.0	NEAR OCEAN
3230	-119.61	36.31	25.0	1847.0	371.0	1460.0	353.0	1.8839	46300.0	INLAND
3555	-118.59	34.23	17.0	6592.0	1525.0	4459.0	1463.0	3.0347	254500.0	<1H OCEAN
...	...	...	...	...	...	...	...	...	...	...
6563	-118.13	34.20	46.0	1271.0	236.0	573.0	210.0	4.9312	240200.0	INLAND
12053	-117.56	33.88	40.0	1196.0	294.0	1052.0	258.0	2.0682	113000.0	INLAND
13908	-116.40	34.09	9.0	4855.0	872.0	2098.0	765.0	3.2723	97800.0	INLAND
11159	-118.01	33.82	31.0	1960.0	380.0	1356.0	356.0	4.0625	225900.0	<1H OCEAN
15775	-122.45	37.77	52.0	3095.0	682.0	1269.0	639.0	3.5750	500001.0	NEAR BAY

16512 rows × 10 columns

本文采用署名-非商业性使用-相同方式共享 4.0 国际许可协议，转载请注明出处。