Luận án Person re-identification in a surveillance camera network

Download

Trang 1

Trang 2

Trang 3

Trang 4

Trang 5

Trang 6

Trang 7

Trang 8

Trang 9

Trang 10

Tải về để xem bản đầy đủ

143 trang nguyenduy 10/07/2024 3780

Download

Bạn đang xem 10 trang mẫu của tài liệu "Luận án Person re-identification in a surveillance camera network", để tải tài liệu gốc về máy hãy click vào nút Download ở trên.

Tóm tắt nội dung tài liệu: Luận án Person re-identification in a surveillance camera network

ed when feature vectors are extracted on HSV color space. In
comparison with the other color spaces, the matching rates at rank-1 when working
on HSV color space are the most strongly increased by 3.91% and 4.35% compared
to the worst results in case of using four key frames and walking cycle, respectively.
However, this conclusion is not true with obtained results on iLIDS-VID dataset. By
57
Table 2.4: The matching rates (%) when applying different pooling methods on different color spaces
in case of using four key frames on iLIDS-VID dataset. The two best results for each case are in bold.
Color
spaces
Rank
Methods
AVG-AVG AVG-MAX AVG-MIN MAX-AVG MAX-MAX MAX-MIN MIN-AVG MIN-MAX MIN-MIN
RGB
r=1 38.13 32.93 33.49 33.12 34.63 27.59 32.83 27.35 34.89
r=5 66.71 59.91 60.41 60.07 62.72 52.52 59.64 52.72 62.92
r=10 78.61 72.52 72.98 72.54 75.03 65.52 72.00 65.67 75.11
r=20 88.60 84.37 84.35 84.29 86.59 77.99 83.87 78.98 86.07
Lab
r=1 38.21 33.24 32.30 32.94 35.26 26.91 32.83 28.18 34.52
r=5 66.43 60.47 59.29 59.77 63.66 51.88 59.34 53.06 61.68
r=10 78.70 72.91 72.13 72.54 75.83 64.52 71.75 65.83 74.29
r=20 88.69 84.67 83.95 84.33 87.14 77.95 83.87 78.92 85.88
HSV
r=1 34.75 31.20 29.34 30.21 32.29 24.25 31.40 26.60 31.85
r=5 62.52 58.73 56.90 57.57 60.87 50.05 58.76 52.34 59.67
r=10 75.38 71.41 69.50 70.65 73.69 63.30 71.19 65.66 72.43
r=20 86.31 83.05 82.19 83.01 85.19 77.28 83.42 78.59 84.33
nRnG
r=1 37.77 32.30 32.11 32.75 34.41 26.77 33.89 27.78 35.16
r=5 65.74 59.02 58.67 59.30 62.17 51.34 60.05 52.93 62.75
r=10 77.47 71.49 71.06 71.46 74.38 63.75 71.85 64.90 74.57
r=20 87.48 82.87 82.87 83.15 85.34 76.23 83.54 77.55 85.67
Fusion
r=1 41.09 36.63 35.99 36.11 38.31 30.13 36.23 31.35 38.20
r=5 69.51 64.49 63.41 63.68 66.94 56.06 63.52 57.59 66.28
r=10 80.94 76.44 75.91 75.58 78.89 68.46 75.56 70.01 78.11
r=20 90.43 86.79 86.69 86.27 88.82 81.21 86.43 82.51 88.38
Table 2.5: The matching rates (%) when applying different pooling methods on different color spaces
in case of using frames within a walking cycle on iLIDS-VID dataset. The two best results for each
case are in bold.
Color
spaces
Rank
Methods
AVG-AVG AVG-MAX AVG-MIN MAX-AVG MAX-MAX MAX-MIN MIN-AVG MIN-MAX MIN-MIN
RGB
r=1 41.09 32.98 33.37 34.38 37.83 26.93 33.73 27.37 38.03
r=5 68.75 58.95 58.63 60.01 66.15 50.75 59.51 51.63 65.53
r=10 80.21 70.71 70.27 71.87 77.54 62.41 71.57 64.55 77.05
r=20 89.39 82.15 81.73 83.05 87.93 75.27 82.57 77.27 87.50
Lab
r=1 41.20 34.33 32.01 34.09 38.90 26.43 33.85 28.39 37.19
r=5 69.27 59.35 57.02 59.41 66.51 49.53 59.23 52.55 64.47
r=10 80.21 71.26 69.08 71.60 78.49 61.58 71.32 64.59 76.21
r=20 89.55 82.47 80.91 83.26 88.33 74.50 82.75 77.37 87.15
HSV
r=1 37.01 30.93 28.82 31.26 35.70 23.36 31.88 25.75 34.64
r=5 64.73 57.39 54.47 58.08 63.57 47.66 59.43 50.85 62.30
r=10 76.19 69.41 67.07 70.55 75.80 60.64 71.65 63.95 74.59
r=20 86.80 80.99 79.37 82.51 86.61 74.51 82.89 76.55 85.49
nRnG
r=1 41.51 32.90 31.00 34.33 38.36 26.02 34.86 28.21 38.89
r=5 68.75 58.84 56.23 59.33 66.03 48.75 60.56 52.55 65.48
r=10 79.55 70.15 68.14 71.36 77.24 60.82 72.00 64.61 77.28
r=20 88.91 81.50 79.23 82.35 87.55 73.33 82.68 77.03 87.33
Fusion
r=1 44.14 36.89 35.49 37.19 41.37 29.69 37.29 31.68 41.19
r=5 71.71 63.18 61.15 63.52 69.77 54.00 63.87 56.74 68.64
r=10 82.40 74.35 72.97 75.10 80.62 66.51 75.03 68.87 79.97
r=20 90.63 84.82 84.10 85.49 89.71 78.53 85.45 80.71 89.19
58
Table 2.6: The matching rates (%) when applying different pooling methods on different color spaces
in case of using all frames on iLIDS-VID dataset. The two best results for each case are in bold.
Color
spaces
Rank
Methods
AVG-AVG AVG-MAX AVG-MIN MAX-AVG MAX-MAX MAX-MIN MIN-AVG MIN-MAX MIN-MIN
RGB
r=1 58.20 47.93 48.87 46.80 50.87 23.80 45.47 24.73 52.80
r=5 80.13 71.93 70.73 71.13 77.73 46.40 72.07 46.20 75.80
r=10 87.47 82.07 79.40 81.13 85.73 57.73 82.13 57.73 83.93
r=20 92.27 89.80 87.20 89.47 91.47 69.00 89.27 73.73 89.80
Lab
r=1 58.80 51.27 46.93 45.87 54.47 22.33 46.33 26.40 50.00
r=5 81.00 75.13 71.67 73.07 79.73 42.07 71.87 47.53 76.00
r=10 87.47 83.93 80.00 82.13 87.60 53.73 81.73 61.13 84.07
r=20 92.93 90.87 88.60 89.73 92.13 66.73 88.53 73.80 89.53
HSV
r=1 54.67 46.40 44.27 43.87 50.47 21.00 44.27 19.47 49.53
r=5 78.73 72.73 71.20 70.60 77.87 46.40 71.00 44.13 76.80
r=10 84.93 80.73 80.80 80.87 85.93 60.13 81.53 56.80 84.60
r=20 92.07 89.47 88.73 88.87 92.13 72.73 88.73 69.80 90.47
nRnG
r=1 58.93 50.13 49.13 46.13 50.87 25.67 47.00 28.53 50.80
r=5 79.93 73.33 72.27 72.40 76.13 46.13 72.33 51.53 75.87
r=10 86.73 82.53 79.73 81.40 84.73 58.73 81.80 62.60 84.60
r=20 92.40 89.73 86.93 89.53 90.80 69.67 89.33 73.07 89.93
Fusion
r=1 70.13 55.40 52.93 49.27 56.33 26.27 49.93 29.13 56.20
r=5 92.73 77.47 77.27 75.33 81.47 49.20 76.40 51.60 80.40
r=10 97.33 86.07 84.67 84.67 88.40 62.13 84.33 63.53 86.80
r=20 99.07 91.53 90.80 91.67 93.07 74.20 91.13 76.60 91.87
5 10 15 20
40
50
60
70
80
90
100
Ma
tch
ing
rat
es
(%)
Rank
90.56% PRID_all frames 79.10% PRID_walking cycle 77.19% PRID_4 key frames
(a)
5 10 15 20
30
40
50
60
70
80
90
100
M
a
t
c
h
i
n
g
r
a
t
e
s
(
%
)
Rank
41.09% iLIDS_4 key frames
44.14% iLIDS_walking cycle
70.13% iLIDS_all frames
(b)
Figure 2.14: Evaluation the performance of GOG features on a) PRID 2011 dataset and b) iLIDS-VID
dataset with three different representative frame selection scheme.
59
contrast, for iLIDS-VID dataset, the matching rates at rank-1 when working on HSV
color space are the worst results. The matching rates at rank-1 in case of working on
HSV color space are the most drastically decreased by 4.15% and 4.55% compared to
the best results. From these analysis, we can realize that each kind of color models
has its own effectiveness depending on evaluated dataset. And, by combining features
on different color space will bring a higher performance for this problem. From Tables
2.1-2.3, by applying the fusion strategy, the matching rates at rank-1 on PRID-2011
are improved by 6.99%, 6.62%, and 14.50% compared to the worst ones in case of using
four key frames, walking cycle, and all frames, respectively. For iLIDS-VID, the highest
improvements are 6.65%, 7.13%, and 7.47% when using four key frames, walking cycle,
and all frames, respectively. These results show more clearly the effectiveness of the
fusion scheme in achieving a better performance for person ReID problem.
5 10 15 20
30
40
50
60
70
80
90
100
CMC curves on PRID-2011
Ma
tch
ing
rat
es
(%)
Rank
77.19% 4 key frames 76.53% 4 random frames
(a)
5 10 15 20
30
40
50
60
70
80
90
100
CMC curves on iLIDS-VID dataset
Ma
tch
ing
rat
es
(%)
Rank
41.01% 4 key frames 39.03% 4 random frames
(b)
Figure 2.15: Matching rates when selecting 4 key frames or 4 random frames for person representation
in a) PRID-2011 and iLDIS-VID
In order to prove that four key frames are the best choice for balance between ReID
accuracy and computation time, an extra scenario is performed. In which, four key
frames are chosen randomly in a walking cycle. Noted that in this experiments, average
pooling is still used for generating the final signature of each person and the results are
obtained in case of combining all feature vectors in four color spaces (RGB, Lab, HSV,
nRnG). Figure 2.15 shows the matching rates in two considered scenarios including
four key frames and four random frames in both evaluated datasets PRID-2011 and
iLIDS-VID. By observing these obtained results, CMC curves corresponding to four
random frames are located at lower position compared those corresponding to four key
frames in both PRID-2011 and iLIDS-VID datasets. This means that by choosing four
key frames, the proposed framework can achieve a higher performance and it is also
the reason why four key frames are considered in this chapter.
60
Table 2.7 shows the matching rates in several important ranks (1, 5, 10, 20) in the
two above considered scnarios. Additionally, these obtained results are also compared
to the reported one in the study of Nguyen et al. [58], in which one frame of each
person in PRID-2011 is chosen randomly and multi-shot problem is turned into single-
shot one. It is obvious that the matching rates in case of using four key frames are
much higher than those in case of choosing only one frame for each person.
Table 2.7: Matching rates (%) in several important ranks when using four key frames, four random
frames, and one random frame in PRID-2011 and iLIDS-VID datasets.
Dataset PRID-2011 iLIDS-VID
Methods Rank=1 Rank=5 Rank=10 Rank=20 Rank=1 Rank=5 Rank=10 Rank=20
Four Key Frames 77.19 94.70 97.93 99.37 41.09 69.51 80.94 90.43
Four Random Frames 76.53 93.83 97.47 99.29 39.03 67.09 78.27 88.18
One Random Frame [58] 53.60 79.76 88.93 95.79 - - - -
2.3.2. Quantitative evaluation of the trade-off between the ac-
curacy and computational time
As mentioned in the previous section, using all frames of a person achieves the
best ReID performance. However, this scheme requires high computational time and
memory storage. In this section, a quantitative evaluation of these schemes is indicated
to point out the advantages and disadvantages of the proposed representative frame
selection schemes. Table 2.8 shows the comparison of the three schemes in terms of
person ReID accuracy, computational time, and memory requirement on PRID 2011
dataset.
Table 2.8: Comparison of the three representative frame selection schemes in term of accuracy at
rank-1, computational time, and memory requirement on PRID 2011 dataset.
Methods
Accuracy at
rank-1
Computational time for each person (s)
Memory
Frame
selection
Feature
extraction
Feature pooling
Person
matching
Total time
Four key frames 77.19 7.500 3.960 0.024 0.004 11.488 96 KB
Walking cycle 79.10 7.500 12.868 0.084 0.004 20.452 312 KB
All frames 90.56 0.000 98.988 1.931 0.004 100.919 2,400 KB
The computational time for each query person includes the computational time for
representative frame selection, feature extraction, feature pooling, and person match-
ing. It is worth to note that the computational time for person matching depends on
the size of the dataset (e.g. the number of individuals). The values reported in Table
2.8 are computed for a random split on PRID 2011 dataset. This dataset contains
178 pedestrians in which 89 individuals are used for the training phase and the rest of
dataset is used for the testing phase. The average number of images for each person
on both camera A and B is about 100, and each walking cycle has approximately 13
61
(a)
6
(b)
Figure 2.16: The distribution of frames for each person in PRID 2011 dataset with a) camera A view
and b) camera B view.
images on average. The experiments are conducted on Computer Intel(R) Core(TM)
i5-4440 CPU @ 3.10GHz, 16GB RAM. As shown in this Table, using all frames does
not require frame selection, therefore the computational time for frame selection of this
scheme is zero. These two other schemes spend 7.5s for extracting walking cycles and
selecting four-key-frames. Concerning computational time for feature extraction step,
for each image with the resolution 128× 64, the implementation takes about 0.99s to
extract feature vectors on all color spaces (RGB/Lab/HSV/nRnG) and concatenate
them together. The time for feature extraction are 3.96s, 12.87s, and 98.99s in case of
using four key frames, a walking cycle, and all frames, respectively.
The computational time spent for pooling data is 0.024s, 0.084s, and 1.931s corre-
sponding to the cases of four key frames, a walking cycle, and all frames, respectively.
In person matching step, training phase is performed off-line so the execution time is
calculated for the testing phase. After data pooling layer, each individual is represented
by a feature vector, so the computational time in this step is considered as constant
value for all different cases, with 0.004s. The total computational time for each query
person is 11.488s, 20.452s, and 100.919s for four key frames, one walking cycle, and all
frames, respectively.
Concerning the memory requirement, as an image of 128 × 64 pixels with a 24-bit
color depth takes 24KB (128× 64× 24 = 196, 608 bits =24 KB), the required memory
for four key frames, one walking cycle and all frames schemes are 96KB (24×4), 312KB
(24× 13), and 2,400KB (24× 100 = 2, 400) respectively.
It can be seen from the experimental results that, using all frames in a cycle walking
help to increase approximately 2% of matching rate at rank-1, while the computational
time is nearly twice in comparison with four key frames case. Among three schemes,
62
using all frames achieves impressive results (the accuracy at rank-1 is 90.56%), however,
it requires higher computational time (8.79 times higher than that of four key frames
scheme) and memory (25 times higher than that of four key frames scheme).
To see in detail the requirement of this scheme, the distribution of the frames for
each person in PRID 2011 dataset (see Fig.2.16) is computed. As seen from the Figure,
in PRID 2011 dataset, there are 78.09% and 41.01% of total people having more than
100 frames in camera A and camera B, respectively. This means that in surveillance
scenarios, the number of available frames for a person is high in majority cases. There-
fore, using all of these frames costs computational time and occupied memory while it’s
performance is superior than thoses of four key frames and one walking cycle at rank-1.
For these higher ranks, the difference between these accuracies is not significant. For
example: the accuracies at rank-5 of using four key frames, one walking cycle, and all
frames are 94.70%, 94.99%, and 97.98% respectively while those at rank-10 are 97.93%,
97.92%, and 99.55%.
The evaluation results allow us to suggest the recommendation for the use of these
representative frame selection schemes. If the dataset of working application is chal-
lenging and the application requires relevant results returned in few first ranks, using
all frames scheme for person representation is recommended. This scheme is also a
suitable choice when the powerful computation resources are available. Otherwise,
four key frames and one cycle walking schemes would be considered.
2.3.3. Comparison with state-of-the-art methods
This section aims to compare the proposed method with the state of the art works for
person ReID in two benchmark datasets of PRID 2011 and iLIDS-VID. The methods
are arranged into two groups: the unsupervised learning based methods and supervised
learning based methods. This study belongs to the supervised learning group. The
comparisons are shown in Table 2.9 with two best results are in bold. The comparison
shows three observations.
Firstly, in comparison with unsupervised methods, the proposed method outper-
forms almost unsupervised learning-based methods on both PRID 2011 and iLIDS-VID
datasets. Remarkably, the results from PRID 2011 with three cases: four key frames,
all frames in a walking cycle and all frames are 77.2%, 79.1% and 90.5% at rank 1, re-
spectively. They are much higher than the state-of-the-art proposals of TAUD (49.4%)
[133] and UTAL (54.7%) [134]. This also happens for iLIDS-VID, with 41.1%, 44.1%
and 70.1% of the proposed method compared to 26.7% [133] and 35.1% [134]. Among
the considered unsupervised methods, the method SMP [24] that uses each training
tracklet as a query and performs retrieval in the cross-camera training set provides a
little higher result than the proposed method at a walking cycle case for PRID (80.9%
compared to 79.1%) but lower for iLIDS-VID (41.7% compared to 44.1%) at rank-1.
63
However, for the case of all frames, matching rates at rank-1 of the proposed methods
stand at higher numbers for both of these datasets in comparison with SMP method
(90.5% against 80.9% and 61.7% against 41.7%). The performance gains by the pro-
posed method are due to the robustness of the feature used for person representation
and the metric learning XQDA.
Secondly, in comparison with the methods in the same group (supervised learning-
based methods), the proposed method outperforms all of 11 state-of-the art methods
at the rank-1 for PRID 2011 dataset. Interestingly, the proposed method brings a
significant ReID accuracy in comparison with the deep learning-based methods with
PRID 2011 dataset. The rank-1 margins by the proposed method is 11.8% (90.5%-
78.7%) and 20.5% (90.5%-70.0%) more than the methods using deep recurrent neural
network [135], [57], respectively. When working with the dataset iLIDS-VID that is
suffered from strong occlusions, the proposed method still obtains good results for rank-
1. However, the performance of the proposed method is inferior than two methods in
the list [135], [136]. Among the state-of-the-art methods, the method in [136] obtained
impressive results for both PRID 2011 and iLIDS-VID. However, this method requires
a lot of computation as they exploited both appearance and motion features extracted
through a two-stream convolutional network architecture.
Finally, a close-up comparison with state-of-the-art methods [102, 100, 101] that are
related closely to the proposed work is given. These works all extracted walking cycles
for the representative frame selection before feature extraction step based on motion
energy. For PRID 2011 dataset, it can be seen the outperforming results of the proposed
framework compared to HOG3D [102] and STFV3D [100]. The matching rate at rank-1
of CAR [101] is superior than those of the proposed framework, nevertheless, at higher
ranks (e.g. rank-5, rank-20), the proposed method achieves better performance. For
iLIDS-VID dataset, the proposed method still outperforms the method HOG3D [102]
and obtains competitive results in comparison with the method STFV3D [100].
In order to prove more clearly the effectiveness of the proposed framework in this
Chapter compared with related existing studies, we are going to analyze and discuss
on three studies reported in [103, 104, 105]. All of these studies relate to representative
frames selection based on a clustering algorithm. In [103], representative frames are
chosen via computing the similarity between two consecutive images via Histogram
of Oriented Gradient descriptor extracted on each frame. And then, a curve based on
consecutive image similarities is plotted, the local minimal of the curve corresponding to
a high variation in the appearance are found. Via these minimal points, the trajectory
is divided into several segmentation. Each part models a posture variation. The
key frame of each cluster is chosen by taking the centered image. It is clear that
representative frames are chosen based on the similarity between two consecutive frames
which strongly depends on human pose when he/she remove in a camera network.
64
Additionally, the number of images in each fragment are different. These will be shown
more clearly in different field of camera views. Inversely, in the proposed framework in
this Chapter, motion energy is computing only on the lower human body which more
stable and robust than computing HOG descriptor on all human body in [103]. In [104,
105], the authors proposed to exploit Covariance descriptor for image representation
and Mean-shift algorithm for clustering person images into several group. However,
several groups which have the number of images smaller than average number of images
in all clusters are removed. Representative frames are chosen as frames that have the
minimum distance to the center of each cluster. The research in [105] is an extended
version of [104] in which three descriptors including HOG, Covariance, and multi-

File đính kèm:

luan_an_person_re_identification_in_a_surveillance_camera_ne.pdf
2.Abstract_Vietnamese.pdf