Basic Skills in Data Analysis
This blog has been migrated to Microsoft Azure and is generated automatically using Azure DevOps Pipeline. Due to some problems in Node.js, some inline MathJax may be error. The author is working on that, sorry.
1 | import pandas as pd |
In this section we use the housing
dataset to practice some skills in data analysis.
1 | HOUSING_PATH = "housing/datasets/handson-ml/datasets/housing/housing.csv" |
Split of Data
Not all data are used for training. In fact, we usually split our dataset into:
- Training dataset + Test dataset
- Training dataset + Validation dataset + Test dataset
Sometimes we use stratified split to maintain high representativenenss.
Basic Split of Data
A Method for Random TT Split
For the first way, we can define a funtion like this:
1 | def split_data(data, test_ratio): |
We can test it using our data.
1 | train_housing, test_housing = split_data(housing, 0.7) |
(14447, 10)
(6193, 10)
Note that we shuffled the indices when splitting, which may cause the problem that, if we train the model for more than once, the model will be able to see the whole dataset. That’s what we should avoid!
Random but Fixed: Two Mothods for Stable Random TT Split
One possible solution is to save the two sets after the first training.
Another is to use a fixed seed when shuffling indices (before we call numpy.random.permutation()
) to generate the same set everytime. However, both solutions FUXK UP when the dataset is updated.
Identifiers: A Mothod for Stable, Random but algo Sustainable TT Split
We give each datum an unique identifier and decide which set it should be in according to it.
Eg. We calculate the last byte of the identifier’s hash. We use $256 \times 70% \approx 51$ as the threshold to implement the 70 - 30 split.
1 | import hashlib |
N.B. The lambda
expression here collects all id_
that get True
in test_set_check
into in_test_set
. data.loc[~in_test_set]
selects rows whose indices are not in in_test_set
.
N.B. The identifier
MUST BE UNIQUE. We use id
, which is the row index here. However, it FUXKS UP when any in-between row is removed. Data like longitude and latitude is fixed relatively, so they can be good components for generating identifiers.
Now we test the functions we’ve just defined.
1 | train_housing, test_housing = split_by_id(housing.reset_index(), 0.7, 'index') |
(14587, 11)
(6053, 11)
N.B. DataFrame.reset_index()
returns a DataFrame
with a more column index
with indices.
Tools from Scikit-Learn
Scikit-Learn provides us with useful functions to split data.
1 | sklearn.model_selection.train_test_split( |
N.B. When only one value of test_size
, train_size
is provided, the functions calculate the ratio according to the provided param. When none is provided, the function splits using 75 - 25 ratio.
1 | from sklearn.model_selection import train_test_split |
(14447, 10)
(6193, 10)
Stratified Split
If we want to maintain the high representativeness of the data in every set, or we want to avoid the serious bias that may be caused by pure random modelling, we use stratified split.
First we need to stratify our data. We take income_median
as an example.
1 | housing['income_cate'] = np.ceil(housing['median_income'] / 1.5) # So there won't be too much categories. |
1 | from sklearn.model_selection import StratifiedShuffleSplit |
Its usage may be a little bit of complicated.
Anyway. First let’s see the distribution in the whole dataset.
1 | housing['income_cate'].value_counts() / len(housing) |
3.0 0.350581
2.0 0.318847
4.0 0.176308
5.0 0.114438
1.0 0.039826
Name: income_cate, dtype: float64
1 | strat_train_data['income_cate'].value_counts() / len(strat_train_data) |
3.0 0.350594
2.0 0.318859
4.0 0.176296
5.0 0.114402
1.0 0.039850
Name: income_cate, dtype: float64
1 | strat_test_data['income_cate'].value_counts() / len(strat_test_data) |
3.0 0.350533
2.0 0.318798
4.0 0.176357
5.0 0.114583
1.0 0.039729
Name: income_cate, dtype: float64
Nice! To make data neat we delete the income_cate
column.
1 | for set in (strat_test_data, strat_train_data): |
From now on, we should let alone the test set and only focus on the training set.
1 | housing = strat_train_data.copy() |
Display of Data
It’s a good idea to have a view on the data before we use them. To see them is to display them.
Raw Data
1 | housing |
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
---|---|---|---|---|---|---|---|---|---|---|
17606 | -121.89 | 37.29 | 38.0 | 1568.0 | 351.0 | 710.0 | 339.0 | 2.7042 | 286600.0 | <1H OCEAN |
18632 | -121.93 | 37.05 | 14.0 | 679.0 | 108.0 | 306.0 | 113.0 | 6.4214 | 340600.0 | <1H OCEAN |
14650 | -117.20 | 32.77 | 31.0 | 1952.0 | 471.0 | 936.0 | 462.0 | 2.8621 | 196900.0 | NEAR OCEAN |
3230 | -119.61 | 36.31 | 25.0 | 1847.0 | 371.0 | 1460.0 | 353.0 | 1.8839 | 46300.0 | INLAND |
3555 | -118.59 | 34.23 | 17.0 | 6592.0 | 1525.0 | 4459.0 | 1463.0 | 3.0347 | 254500.0 | <1H OCEAN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
6563 | -118.13 | 34.20 | 46.0 | 1271.0 | 236.0 | 573.0 | 210.0 | 4.9312 | 240200.0 | INLAND |
12053 | -117.56 | 33.88 | 40.0 | 1196.0 | 294.0 | 1052.0 | 258.0 | 2.0682 | 113000.0 | INLAND |
13908 | -116.40 | 34.09 | 9.0 | 4855.0 | 872.0 | 2098.0 | 765.0 | 3.2723 | 97800.0 | INLAND |
11159 | -118.01 | 33.82 | 31.0 | 1960.0 | 380.0 | 1356.0 | 356.0 | 4.0625 | 225900.0 | <1H OCEAN |
15775 | -122.45 | 37.77 | 52.0 | 3095.0 | 682.0 | 1269.0 | 639.0 | 3.5750 | 500001.0 | NEAR BAY |
16512 rows × 10 columns
Abstract of Data
Using DataFrame.describe()
we’re able to see the abstract/description of our data.
1 | housing.describe() |
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | |
---|---|---|---|---|---|---|---|---|---|
count | 16512.000000 | 16512.000000 | 16512.000000 | 16512.000000 | 16354.000000 | 16512.000000 | 16512.000000 | 16512.000000 | 16512.000000 |
mean | -119.575834 | 35.639577 | 28.653101 | 2622.728319 | 534.973890 | 1419.790819 | 497.060380 | 3.875589 | 206990.920724 |
std | 2.001860 | 2.138058 | 12.574726 | 2138.458419 | 412.699041 | 1115.686241 | 375.720845 | 1.904950 | 115703.014830 |
min | -124.350000 | 32.540000 | 1.000000 | 6.000000 | 2.000000 | 3.000000 | 2.000000 | 0.499900 | 14999.000000 |
25% | -121.800000 | 33.940000 | 18.000000 | 1443.000000 | 295.000000 | 784.000000 | 279.000000 | 2.566775 | 119800.000000 |
50% | -118.510000 | 34.260000 | 29.000000 | 2119.500000 | 433.000000 | 1164.000000 | 408.000000 | 3.540900 | 179500.000000 |
75% | -118.010000 | 37.720000 | 37.000000 | 3141.000000 | 644.000000 | 1719.250000 | 602.000000 | 4.744475 | 263900.000000 |
max | -114.310000 | 41.950000 | 52.000000 | 39320.000000 | 6210.000000 | 35682.000000 | 5358.000000 | 15.000100 | 500001.000000 |
Histogram for Quantative Data
For quantative data we can plot their histograms simply using DataFrame.hist()
.
The API for DataFrame.hist()
is like:
1 | DataFrame.hist( |
For more information, please refer to the Pandas Documentation.
1 | housing.hist( |
Observe that the median_income
seems not to be measured in USD. Indeed, its data were scaled down at a certain proportion. Such operation, which is quite common in ML, is called “data-preprocessing”. It’s vital to know whether and how some data were pre-processed.
Find Relations
Using DataFrame.corr()
we can directly see the correlation coefficients (Pearson product-moment correlation coefficient, usually denoted as $r$, see Wikipedia) among data.
The more $r$ approaches $1$, the strong is the positive correlation between the two data. The more $r$ approaches $-1$, the strong is the negative correlation between them.
1 | corr_matrix = housing.corr() |
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | |
---|---|---|---|---|---|---|---|---|---|
longitude | 1.000000 | -0.924478 | -0.105848 | 0.048871 | 0.076598 | 0.108030 | 0.063070 | -0.019583 | -0.047432 |
latitude | -0.924478 | 1.000000 | 0.005766 | -0.039184 | -0.072419 | -0.115222 | -0.077647 | -0.075205 | -0.142724 |
housing_median_age | -0.105848 | 0.005766 | 1.000000 | -0.364509 | -0.325047 | -0.298710 | -0.306428 | -0.111360 | 0.114110 |
total_rooms | 0.048871 | -0.039184 | -0.364509 | 1.000000 | 0.929379 | 0.855109 | 0.918392 | 0.200087 | 0.135097 |
total_bedrooms | 0.076598 | -0.072419 | -0.325047 | 0.929379 | 1.000000 | 0.876320 | 0.980170 | -0.009740 | 0.047689 |
population | 0.108030 | -0.115222 | -0.298710 | 0.855109 | 0.876320 | 1.000000 | 0.904637 | 0.002380 | -0.026920 |
households | 0.063070 | -0.077647 | -0.306428 | 0.918392 | 0.980170 | 0.904637 | 1.000000 | 0.010781 | 0.064506 |
median_income | -0.019583 | -0.075205 | -0.111360 | 0.200087 | -0.009740 | 0.002380 | 0.010781 | 1.000000 | 0.687160 |
median_house_value | -0.047432 | -0.142724 | 0.114110 | 0.135097 | 0.047689 | -0.026920 | 0.064506 | 0.687160 | 1.000000 |
We can select one column (row) and sort it to focus on a certain feature.
1 | print(corr_matrix['median_house_value'].sort_values(ascending=False)) |
median_house_value 1.000000
median_income 0.687160
total_rooms 0.135097
housing_median_age 0.114110
households 0.064506
total_bedrooms 0.047689
population -0.026920
longitude -0.047432
latitude -0.142724
Name: median_house_value, dtype: float64
However, the $r$ only shows the linear correlation. Using pandas.plotting.scatter_matrix
we can see more interesting correlations.
1 | from pandas.plotting import scatter_matrix |
Trick: Geo Figure for Data with Geo Data
For data that are distributed geografically with geografic data, we can plot them in scatters geografically. Note that
1 | housing.plot(kind='scatter', x='longitude', y='latitude', label="population") |
Text(0.5, 1.0, 'Geographical Distribution')
The distribution is not even. To show such feature we can set a lower transparency (alpha
value).
1 | housing.plot(kind='scatter', x='longitude', y='latitude', label="population", alpha=0.5) |
Text(0.5, 1.0, 'Geographical Distribution')
Still not that intuitive right? Use the heatmap!
1 | housing.plot(kind='scatter', x='longitude', y='latitude', alpha=0.5, |
From these figures we can tell that,
- Median of Income is most positively correlative with the pricing of houses, which aligns with our intuition;
- There is slight negative correlation between the pricing and the latitude. With the location going north, the prices tend to decrease;
- …
Cleaning of Data
Some features bugs in the data may cause trouble. So data cleaning is necessary before we use them.
For example, we notice that somedata in total_bedrooms
are missing.
1 | housing['total_bedrooms'] |
17606 351.0
18632 108.0
14650 471.0
3230 371.0
3555 1525.0
...
6563 236.0
12053 294.0
13908 872.0
11159 380.0
15775 682.0
Name: total_bedrooms, Length: 16512, dtype: float64
We generally have the following options:
- Ignore every row (1 piece of datum) with missing data;
- Ignore every column (1 kind of feature) with missing data;
- Fill the missing data using 0, average, median, mode, etc.
If we want to inore every row (1 piece of datum) with missing data, we use DataFrame.dropna()
.
1 | housing_1 = housing.copy() |
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
---|---|---|---|---|---|---|---|---|---|---|
17606 | -121.89 | 37.29 | 38.0 | 1568.0 | 351.0 | 710.0 | 339.0 | 2.7042 | 286600.0 | <1H OCEAN |
18632 | -121.93 | 37.05 | 14.0 | 679.0 | 108.0 | 306.0 | 113.0 | 6.4214 | 340600.0 | <1H OCEAN |
14650 | -117.20 | 32.77 | 31.0 | 1952.0 | 471.0 | 936.0 | 462.0 | 2.8621 | 196900.0 | NEAR OCEAN |
3230 | -119.61 | 36.31 | 25.0 | 1847.0 | 371.0 | 1460.0 | 353.0 | 1.8839 | 46300.0 | INLAND |
3555 | -118.59 | 34.23 | 17.0 | 6592.0 | 1525.0 | 4459.0 | 1463.0 | 3.0347 | 254500.0 | <1H OCEAN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
6563 | -118.13 | 34.20 | 46.0 | 1271.0 | 236.0 | 573.0 | 210.0 | 4.9312 | 240200.0 | INLAND |
12053 | -117.56 | 33.88 | 40.0 | 1196.0 | 294.0 | 1052.0 | 258.0 | 2.0682 | 113000.0 | INLAND |
13908 | -116.40 | 34.09 | 9.0 | 4855.0 | 872.0 | 2098.0 | 765.0 | 3.2723 | 97800.0 | INLAND |
11159 | -118.01 | 33.82 | 31.0 | 1960.0 | 380.0 | 1356.0 | 356.0 | 4.0625 | 225900.0 | <1H OCEAN |
15775 | -122.45 | 37.77 | 52.0 | 3095.0 | 682.0 | 1269.0 | 639.0 | 3.5750 | 500001.0 | NEAR BAY |
16354 rows × 10 columns
If we want to inore every column (1 kind of feature) with missing data, we use DataFrame.drop()
.
1 | housing_2 = housing.copy() |
longitude | latitude | housing_median_age | total_rooms | population | households | median_income | median_house_value | ocean_proximity | |
---|---|---|---|---|---|---|---|---|---|
17606 | -121.89 | 37.29 | 38.0 | 1568.0 | 710.0 | 339.0 | 2.7042 | 286600.0 | <1H OCEAN |
18632 | -121.93 | 37.05 | 14.0 | 679.0 | 306.0 | 113.0 | 6.4214 | 340600.0 | <1H OCEAN |
14650 | -117.20 | 32.77 | 31.0 | 1952.0 | 936.0 | 462.0 | 2.8621 | 196900.0 | NEAR OCEAN |
3230 | -119.61 | 36.31 | 25.0 | 1847.0 | 1460.0 | 353.0 | 1.8839 | 46300.0 | INLAND |
3555 | -118.59 | 34.23 | 17.0 | 6592.0 | 4459.0 | 1463.0 | 3.0347 | 254500.0 | <1H OCEAN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
6563 | -118.13 | 34.20 | 46.0 | 1271.0 | 573.0 | 210.0 | 4.9312 | 240200.0 | INLAND |
12053 | -117.56 | 33.88 | 40.0 | 1196.0 | 1052.0 | 258.0 | 2.0682 | 113000.0 | INLAND |
13908 | -116.40 | 34.09 | 9.0 | 4855.0 | 2098.0 | 765.0 | 3.2723 | 97800.0 | INLAND |
11159 | -118.01 | 33.82 | 31.0 | 1960.0 | 1356.0 | 356.0 | 4.0625 | 225900.0 | <1H OCEAN |
15775 | -122.45 | 37.77 | 52.0 | 3095.0 | 1269.0 | 639.0 | 3.5750 | 500001.0 | NEAR BAY |
16512 rows × 9 columns
If we want to fill the missing data, we need to calculate the data that we want to fill and fill in the blanks. For filling we use DataFrame.fillna()
.
1 | housing_3 = housing.copy() |
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
---|---|---|---|---|---|---|---|---|---|---|
17606 | -121.89 | 37.29 | 38.0 | 1568.0 | 351.0 | 710.0 | 339.0 | 2.7042 | 286600.0 | <1H OCEAN |
18632 | -121.93 | 37.05 | 14.0 | 679.0 | 108.0 | 306.0 | 113.0 | 6.4214 | 340600.0 | <1H OCEAN |
14650 | -117.20 | 32.77 | 31.0 | 1952.0 | 471.0 | 936.0 | 462.0 | 2.8621 | 196900.0 | NEAR OCEAN |
3230 | -119.61 | 36.31 | 25.0 | 1847.0 | 371.0 | 1460.0 | 353.0 | 1.8839 | 46300.0 | INLAND |
3555 | -118.59 | 34.23 | 17.0 | 6592.0 | 1525.0 | 4459.0 | 1463.0 | 3.0347 | 254500.0 | <1H OCEAN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
6563 | -118.13 | 34.20 | 46.0 | 1271.0 | 236.0 | 573.0 | 210.0 | 4.9312 | 240200.0 | INLAND |
12053 | -117.56 | 33.88 | 40.0 | 1196.0 | 294.0 | 1052.0 | 258.0 | 2.0682 | 113000.0 | INLAND |
13908 | -116.40 | 34.09 | 9.0 | 4855.0 | 872.0 | 2098.0 | 765.0 | 3.2723 | 97800.0 | INLAND |
11159 | -118.01 | 33.82 | 31.0 | 1960.0 | 380.0 | 1356.0 | 356.0 | 4.0625 | 225900.0 | <1H OCEAN |
15775 | -122.45 | 37.77 | 52.0 | 3095.0 | 682.0 | 1269.0 | 639.0 | 3.5750 | 500001.0 | NEAR BAY |
16512 rows × 10 columns