An approach to predicted class balancing
February 21, 2018
7 mins read
Postprocessing of predictions
It was observed that the predicted class distribution in the Kaggle: IEEE’s Signal Processing Society - Camera Model Identification Challenge is near to equal (see the post for details on the challenge). Hence, one may assume that there is equal number of images from each class. The altered images were considered as separate classes. So, there are 20 classes for images assignment and, given 2640 test samples, one may expect 132 images in each class.
Given the assumption, predictions were balanced.
We have predictions, which could be represented as a pandas dataset like so:
frame | HTC-1-M7 | iPhone-6 | ... | Motorola-Nexus-6 | Samsung-Galaxy-Note3 | Sony-NEX-7 | |
---|---|---|---|---|---|---|---|
0 | img_0002a04_manip.tif | 5.083291e-04 | 1.730842e-03 | ... | 6.565361e-02 | 7.312844e-04 | 5.617269e-04 |
1 | img_001e31c_unalt.tif | 1.333470e-03 | 1.378153e-03 | ... | 1.011069e-03 | 1.311640e-04 | 5.842330e-04 |
... | ... | ... | ... | ... | ... | ... | ... |
2638 | img_ff56ac2_unalt.tif | 1.707251e-04 | 2.834913e-04 | ... | 4.762203e-04 | 8.150683e-05 | 9.958064e-01 |
2639 | img_ffaeda7_unalt.tif | 6.653962e-11 | 3.320259e-11 | ... | 4.800383e-11 | 8.671302e-12 | 4.138669e-13 |
A sample .csv file with predicted probabilities is available here.
A two-stage algorithm was introduced for class balancing:
- For each class, one by one, sort samples by predicted probability and examine them in descending order. Assign samples while there are free slots for the desired class and the predicted probability is higher than the given threshold. The threshold was selected to be 0.5. It means that images with one predicted class probability higher than the value could not have any other class with higher probability. Exclude assigned samples from further consideration.
- Find the highest probability in the remaining subset and try to assign a sample with it to the corresponding class. If the class already has enough samples, set the examined probability to zero. Exclude the sample from further consideration in case of successful class assignment.
The following function implements the proposed algorithm and it can be applied to manipulated and unaltered images separately:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import numpy as np
import pandas as pd
def class_balancer(df):
n_per_cl = len(df) // (len(names_list)-1) # Images per class
names = df.columns.tolist()[1:]
res = {} # A dict with the results
for i, nam in enumerate(names): # For each class
res[nam] = []
df = df.sort_values(by=[nam], ascending=False).reset_index(drop=True)
k = 0
while float(df.iloc[0, (i+1)]) > THRES and len(res[nam]) < n_per_cl:
# If the prediction is confident enought and we have slots
# in the class, add the sample to it
res[nam].append(df.iloc[0, 0])
df = df.drop([0]).reset_index(drop=True)
k += 1
df = df.reset_index(drop=True)
while len(df) > 0: # While we have any images for classses filling
df_l = df.loc[:, df.columns != 'frame']
row = df.max(axis=1).idxmax() # A sample with the max probability in the remaining set
nam = df_l.max(axis=0).idxmax() # Class of the max probability
if len(res[nam]) < n_per_cl:
# If it is posseble to fill the most probable class
res[nam].append(df.iloc[row, 0])
df = df.drop([row]).reset_index(drop=True)
else: # Set probobility to 0
df.at[row, nam] = 0.0
return res
The function was used along with the following code for data reading, blending predictions by mean square, preparing pandas dataframes for unaltered and manipulated images, processing, and generation of the final submission file.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import os
THRES = 0.5 # Minimal probaility to consider as a confident prediction
results_path = './results/' # Path to the directory with results in .csv format
names_list = ['frame', 'HTC-1-M7', 'iPhone-6', 'Motorola-Droid-Maxx', 'Motorola-X',
'Samsung-Galaxy-S4', 'iPhone-4s', 'LG-Nexus-5x', 'Motorola-Nexus-6',
'Samsung-Galaxy-Note3', 'Sony-NEX-7']
def make_sub(name, dic): # Make a file for submission
with open(name, 'w') as f:
f.write('fname,camera\n')
for r in dic.keys():
for p in dic[r]:
f.write(p+','+r+'\n')
# Read all files with predictions in the results_path directory
dfs = [] # A list of dataframes with predicted probabilities
for fname in os.listdir(results_path):
dfs.append(pd.read_csv(fname, skiprows=1, names=names_list))
# Blending by mean square
df_fin = None
for df in dfs:
if df_fin is None:
df_fin = df
df_fin.loc[:, df_fin.columns != 'frame'] = df_fin.loc[:, df_fin.columns != 'frame'].pow(2)
else:
df2 = df.loc[:, df.columns != 'frame'].pow(2)
df2['frame'] = df['frame']
df_fin.loc[:, df_fin.columns != 'frame'] = df2.loc[:, df2.columns != 'frame']\
.add(df_fin.loc[:, df_fin.columns != 'frame'])
df_fin.loc[:, df_fin.columns != 'frame'] = np.sqrt(df_fin.loc[:, df_fin.columns != 'frame'])
df_manip = df.loc[df.frame.str.contains("_manip")==True]
df_unalt = df.loc[df.frame.str.contains("_unalt")==True]
res_manip = class_balancer(df_manip)
res_unalt = class_balancer(df_unalt)
for i in res_unalt.keys(): # Append res_unalt to res_manip
for j in res_unalt[i]:
res_manip[i].append(j)
make_sub("results.csv", res_manip)
It is the real competition code, written in the last two hours of the competition. Sorry, pep8 :)
An increase of accuracy on the leaderboard by ~1.2-1.6% was observed as the result of this approach application, which makes a huge difference in the final standing.