Luận án Researching on the development of hardware implementation solution for the context-adaptive binary arithmetic coder in the hevc standard

Luận án Researching on the development of hardware implementation solution for the context-adaptive binary arithmetic coder in the hevc standard trang 1

Trang 1

Luận án Researching on the development of hardware implementation solution for the context-adaptive binary arithmetic coder in the hevc standard trang 2

Trang 2

Luận án Researching on the development of hardware implementation solution for the context-adaptive binary arithmetic coder in the hevc standard trang 3

Trang 3

Luận án Researching on the development of hardware implementation solution for the context-adaptive binary arithmetic coder in the hevc standard trang 4

Trang 4

Luận án Researching on the development of hardware implementation solution for the context-adaptive binary arithmetic coder in the hevc standard trang 5

Trang 5

Luận án Researching on the development of hardware implementation solution for the context-adaptive binary arithmetic coder in the hevc standard trang 6

Trang 6

Luận án Researching on the development of hardware implementation solution for the context-adaptive binary arithmetic coder in the hevc standard trang 7

Trang 7

Luận án Researching on the development of hardware implementation solution for the context-adaptive binary arithmetic coder in the hevc standard trang 8

Trang 8

Luận án Researching on the development of hardware implementation solution for the context-adaptive binary arithmetic coder in the hevc standard trang 9

Trang 9

Luận án Researching on the development of hardware implementation solution for the context-adaptive binary arithmetic coder in the hevc standard trang 10

Trang 10

Tải về để xem bản đầy đủ

pdf 122 trang nguyenduy 22/07/2024 810
Bạn đang xem 10 trang mẫu của tài liệu "Luận án Researching on the development of hardware implementation solution for the context-adaptive binary arithmetic coder in the hevc standard", để tải tài liệu gốc về máy hãy click vào nút Download ở trên.

Tóm tắt nội dung tài liệu: Luận án Researching on the development of hardware implementation solution for the context-adaptive binary arithmetic coder in the hevc standard

Luận án Researching on the development of hardware implementation solution for the context-adaptive binary arithmetic coder in the hevc standard
and proposing solutions to 
improve performance efficiency for the residual syntax element generation 
module, binarization module and Binary Arithmetic Encoding (BAE) module. 
2.1. Proposed funtional block diagram of CABAC encoder architecture 
Based on research orientations that are determined in the Chapter 1, the 
proposed hardware architecture of CABAC encoder of the thesis is described 
in Figure 2.1. In the architecture, the thesis focuses on proposing effective 
design solutions for major functional modules such as the syntax element 
generation module, the binarization module and the binary arithmetic 
encoding (BAE) module. In the syntax element generation module, the one 
scan for multiple syntax element generation technique is proposed to reduce 
memory access times, which will save the dynamic power consumption. This 
module performs memory accesses and scans the coefficient matrices to 
generate residual syntax elements. Memory access activities consume the 
dynamic power and this power consumption increases proportionally to the 
number of memory accesses. The proposed architecture performs one 
scanning to determine several syntax elements instead of multiple successive 
42 
scannings the coefficients’ matrix. This will reduce memory access times 
compared to conventional methods. Therefore, the dynamic power 
consumption is also reduced compared to the coding principle and with other 
researches. 
Figure 2.1. Proposed hardware architecture of CABAC encoder 
With the Binarizer, the thesis proposes the “combined binarization” 
solution when performing the binarization of the “last significant coefficient 
position” SE. Typically, the x and y coordinates of the “last significant 
coefficient position” are simultaneously determined at the output of Syntax 
Element Generation module and binary converted by two Truncated Rice 
modules. Whereas, in the proposed “combined binarization” architecture, the 
two coordinates are binarized on the same datapath by the same Truncated 
Rice hardware. The proposed solution contributes to the redution of the 
number of logics in the implementation of the binarization module. 
In the BAE module, the thesis proposes the Multiple Bypass Bin 
Processing architecture to simultaneously process multiple bypass bins for the 
purpose of throughput improvement. In the proposed architecture, the “pre-
multiplication” and “Unify datapath” techniques are applied to minimize 
hardware resource usages for BAE module, and therefore, the CABAC 
encoder. 
BAE
Residual SE 
Generation
Context Models
Binarizer Range 
Update
Low Update
Byte_out
Update_model
rLPS
Coefficients
Encoded bits
SE
FIFO
 B
u
ffe
r
M-ary SE
bin
Bypass/regular
regular
Binary SE
model
bypass
CABAC
rLPS_LUT
State_LUT
43 
2.2. Binarizer 
2.2.1. Data statistics of Binarizer 
In the CABAC architecture, the Binarizer performs the conversions of 
SEs into bin strings to feed the BAE module. Figure 2.2 shows the block 
diagram of the HEVC encoder, which describes data streams from other 
functional components of the HEVC encoder to the input of the Binarizer. 
Figure 2.2. Data inputs of CABAC encoder 
The input data of CABAC encoder includes residual data (Transform 
Coefficients), General Control Data, Intra Prediction data, Motion Data and 
Filter Control Data. Each of these parameters is characterized by a set of SEs 
specified by the standard, allowing the abstract representation of the 
information to be transmitted [22]. Depending on the characteristic, 
occurrence frequency and percentage of each type of information the 
corresponding SEs are encoded in different ways. As mentioned in Chapter 1 
(Figure 1.14), in the HEVC’s frame structure there is stable information 
whose occurrence frequency is low and accounts for a small percentage of the 
total bitstream. This information is usually located in the header to convey the 
44 
parameters such as general control, configuration, resolution and frame rate. 
Encoding this information has little overall compression effect; therefore the 
HEVC standard specifies basic encoding methods for implementation 
simplification (FLC and VLC). In contrast, the residual data encapsulated by 
the CTUs account for the majority (75% on average) of the total input data 
[49]. Table 2.1 shows the percentage of main syntax elements in the CABAC 
encoder input. It can be seen that the residual data (Transform Coefficients), 
including Transform Luma and Transform Chroma, occupy a significant 
portion of the encoder input data. In addition, the residual data adaptively 
fluctuates according to the characteristics of each video stream. Based on 
these features, in order to improve the overall compression efficiency, it is 
necessary to apply a highly efficient encoding method to the video data. 
Table 0.1. Statistics of input data type of CABAC [49] 
z 
Frame type 
I B0 B1 B2 B3 
Transform Luma 66% 67% 69% 69% 75% 
Transform Chroma 17% 15% 13% 13% 6% 
Intra/Inter Prediction data 8% 7% 7% 8% 9% 
Figure 2.3. Illustration of CTU structure 
The method of dividing the frame into CTUs is shown in Figure 2.3. In 
the HEVC standard, CTU is the largest coded block, from which the smaller 
CUs are generated. There exist simultaneously two methods of dividing each 
CTU
CU CU
CU
CU CU
CU
CU CU
CU CU
TU TU
TU
TU TU
TU TU
PU
45 
CU into TUs and PUs which are applied for Prediction and Transformation 
processes, respectively. Therefore, in order to fully represent information for 
successful decoding, in addition to TU and PU blocks, there must be syntax 
elements specifying this dividing level (CTU, CU). These syntax elements are 
grouped into CTU/CU bins accompanied TU and PU bins in the bitstream. 
Table 2.2 shows the percentage of each type of bins according to 
different coding configurations: AI (All Intra), LD-P (Low Delay P-frame), 
LD-B (Low Delay B-frame) and RA (Random Access) [49]. In all the testing 
configurations, the set of SEs representing image data (TU data) occupies a 
large portion, 63.7 ÷ 94%, of the CABAC data. 
Table 0.2. Major bins contributors among HEVC data hierarchy [49] 
Common Test Condition 
Hierarchy Level AI 
Low Delay 
Random Access Worst-case 
P frame B frame 
CTU/CU bin 5,4% 15,8% 16,7% 11,7% 1,4% 
PU bin 9,2% 20,6% 19,5% 18,8% 5,0% 
TU bin 85,4% 63,7% 63,8% 69,4% 94,0% 
2.2.2. The structure of residual syntax elements 
a) Scanning method for TU data 
By dividing frames into CTUs, the HEVC standard allows encoding the 
residual data as TUs (Transform Units) of size N N (4 4, 8 8, 16 16 and 
32 32). In the transformation step, TUs are transformed into matrices of 
Transform Coefficients, whose size corresponds to that of each input TU. 
While the H.264/AVC standard applies the Zigzag Scan Pattern method, 
HEVC applies the Diagonal Scan Pattern method for TBs (Transform Block) 
to convert 2-D matrices into 1-D arrays of Transform Coefficients. In the 
diagonal scanning, the scan starts at the bottom-right and diagonally traverses 
up to the top-left corner of the TB. Figure 2.4 shows the differences in 
scanning methods between the two standards. 
46 
Figure 2.4. Comparison of Zigzag scan and diagonal scan in HEVC 
The Diagonal Scan Pattern method is applied to different levels in TBs, 
aiming to divide large TBs into 4 4 TBs (sub-TBs) before generating a set of 
residual syntax elements for each sub-TB. First of all, the Diagonal Scan 
Pattern method is applied to divide large TBs into sub-TB [32], [54]. Then, 
this method is further applied to scan each sub-TB in order to convert 4 4 
Transform Coefficient matrix into 1-D arrays of 16 consecutive transform 
coefficients. This 16-transform coefficient array is called the CG (Coefficient 
Group), where the residual SEs are generated. The above procedure of the 
Diagonal Scan Pattern method is illustrated in Figure 2.5 [34], where a 16 16 
TB is scanned to form 16 sub-TBs and every sub-TB is scanned to generate 
residual sntax elements 
Figure 2.5. Application of diagonal scan [29]. 
b) Forming the group of residual syntax elements at the CABAC input 
As discussed, CABAC is mainly used to encode the residual video data 
before it is merged into the output bitstream for sending. Once CGs are 
formed, the HEVC standard specifies a set of residual syntax elements for 
0 1 5 6
2 4 7 12
3 8 11 13
9 10 14 15
15 13 10 6
14 11 7 3
12 8 4 1
9 5 2 0
Zigzag scan Diagonal scan
4 samples 
4
sa
m
p
le
s 
16 samples 
1
6
 sa
m
p
les 
(a) – TB 16 16 (b) – sub TB 4 4
47 
each CG to abstracted represent the image data before encoding. For each CG, 
this group of residual syntax elements is determined through different scan 
passes by different algorithms. Table 2.3 describes the set of syntax elements 
for the data in each CG. 
Table 0.3. Set of Syntax Element for 4 4 TU 
Syntax Element Description 
last_sig_coeff_x 
X coordinate of the first non-zero 
coefficient in scanning order within CG 
last_sig_coeff_y 
Y coordinate of the first non-zero 
coefficient in scanning order within CG 
sig_coeff_flag 
Flags indicating the significance of a 
coefficient (zero/non-zero) 
coeff_abs_level_greater1_flag 
Flags indicating whether the absolute value 
of a coefficient level is greater than 1 
coeff_abs_level_greater2_flag 
Flags indicating whether the absolute value 
of a coefficient level is greater than 2 
coeff_sign_flag 
Flags indicating the sign of a significant 
coefficient (0: positive; 1: negative) 
coeff_abs_level_remaining 
Remaining value for the absolute value of a 
coefficient level 
Figure 2.6. Diagonal scanning of transform coefficients 
Figure 2.6 describes the process of diagonal scanning for each sub-TB to 
form CG before performing six scan passes on the CG to determine the set of 
residual syntax elements. 
Figure 2.7 illustrates the six scan passes as described in the HEVC 
standard [25]. The first scan pass determines the position of the last 
Scan 
passes
Residual SEs
9 3 0 -1
0 0 0-6
0 1 0 0
0 0 0 0
9-63000010-1000000
coefficicent group
Diagonal scanning
48 
significant coefficient in CG, which is called last_sig_coeff_post. This 
position is specified as the x and y coordinates of the 4x4 matrix and is 
represented by two syntax elements named last_significant_coeff_x and 
last_significant_coeff_y. The last_sig_coeff_post is also the entry point for the 
next five scan passes to define the remaining five syntax elements in Table 
2.3. Thus, to define the set of residual syntax elements, it is necessary to 
perform continuously six scan passes on each CG. 
Figure 2.7. Illustration of Syntax Element generation for 4 4 TB 
Figure 2.8 shows the results of the scans on the CG to extract the values 
of the syntax elements and their order in the input of the CABAC encoder’ 
Binarizer. It can be seen that, in the HEVC standard, the syntax elements with 
the same type are organized in separate groups. In addition, there is a 
separation between regular encoding bins and bypass bins. This algorithm 
improvement of HEVC compared to H.264/AVC allows applying the parallel 
and pipeline solutions in hardware architecture of the CABAC encoder. This 
will be detailly presented in the following sections of the thesis. 
9 3 0 -1
-6 0 0 0
0 1 0 0
0 0 0 0
 1 - -
 - - -
- - - -
- - - -
1 1 0 1
1 0 0 -
0 1 - -
0 - - -
1 1 0
1 -
 0 - -
 - - -
0 0 - 1
1 - - -
- 0 - -
- - - -
7 0 - -
4 - - -
- - - -
- - - -
last_sig_coeff_x (3) 
last_sig_coeff_y (0) sig_coeff_flag coeff_abs_level_greater1_flag
coeff_abs_level_greate2_flag coeff_sign_flag coeff_abs_level_remaining
49 
Figure 2.8. Generated Syntax Elements and order of output sequence 
2.2.3. The drawbacks of multi-core syntax element generation architecture 
As stated in Chapter 1, CABAC is the most “throughput bottle-neck” 
component in the HEVC architecture due to the high correlation of input data 
sequences as well as the bin-to-bin sequential encoding principle. Since the 
standard published, research works have focused on solving the problem of 
improving the performance of the CABAC encoder by the most effective 
architectural solutions. Amongst them, various high efficient design solutions 
have been adopted for the Binarization and the BAE modules. As a result, the 
proposed CABAC encoders can process 4K/8K real-time video streams. In 
recent years, when the intrinsic problems of CABAC have been solved, the 
preprocessing of CABAC input data, i.e. syntax element sequences, has been 
concerned. Particularly, residual data is the most concerning issue due to its 
importance, accounting for a large percentage of CABAC encoded data as 
discussed in the previous section. Once the high throughput CABAC encoder 
has been proposed, its data provider, i.e. residual syntax element generation 
has to be the high throughput design as well. 
The Sergio Bampi research group is prominent in this trend, in which 
they intensively analyze the statistical characteristics of the residual data 
stream to design high-speed hardware architectures for residual syntax 
last_sig_coeff_x 3
last_sig_coeff_y 0
sig_coeff_flag 1 0 1 0 0 0 0 1 1 1
coeff_abs_level_greater1_flag 0 0 1 1 1
coeff_abs_level_greater2_flag 1
coeff_sign_flag 1 0 0 1 0
coeff_abs_level_remaining 0 4 7
Output order: 3 0 1 0 1 0 0 0 0 1 1 1 0 0 1 1 1 1 1 0 0 1 0 0 4 7
Regular mode Bypass mode 
50 
elements generation module [49], [51]. In the work [49], the authors have 
proposed the Multiple Residual Syntax Element Treatment (MRSET) solution 
in designing the Four-Core Multiple Residual Syntax Element Generation 
architecture. Figure 2.9 shows the proposed MRSET architecture [51]. The 4-
core MRSET solution is applied to each 4 4 TB to generate the residual 
syntax elements. In each clock cycle, the 4-core MRSET architecture is 
capable of processing 4 transform coefficients simultaneously that allows 
speeding up the syntax element generation module. 
Figure 2.9. Architecture of the four-core MRSET 
The four-core MRSET architecture is capable of providing input data 
throughput for CABAC to encode 4K/8K video streams. However, it can be 
seen that applying the parallel four-core architecture to every 4 4 TB is 
inefficient in hardware resource usage. The TB data is temporal fluctuation 
and adaptive to the visual characteristics of each video stream. At the 4 4 TB 
division level, the number of samples that need to scan may vary from 1 ÷ 16 
(depending on the position of the last significant coefficient). Therefore, the 
proposed architecture is only efficient in terms of hardware usage when the 
number of samples is large enough to allow 4 cores to work in parallel for at 
least three cycles. In contrast, when the number of samples was only equal or 
less than 4, the processing speed is too fast for the requirement of 8K video 
Transform 
coefficients
Core 0
-1
Core 1 Core 2 Core 3
0 5 0
4 3 00
72Cycle 3rd
Cycle 2nd
Cycle 1st
Output syntax elements
51 
format while still existed 4 cores running in parallel. Some TB data samples 
of the 4K/8K video stream are shown in Figure 2.10. 
Figure 2.10. Typical partten of transform coefficients in HEVC standard 
Figure 2.10a and Figure 2.10b show the statistics of several samples of 
transform coefficients. The statistics show that in each 4 4 TB the number of 
the significant samples (non-zero) is modest and mainly converging to DC (0, 
0). Therefore, the number of samples that need to scan for syntax element 
generation is much less than the 16. For example, in Figure 2.10a, an 8 8 TB 
is divided into 4 sub-TBs, in which only 3 significant (non-zero) sub-TBs (a-
2, a-3 and a-4) need to be scan. Moreover, only a-4 sub-TB contains a 
relatively large number of significant coefficients (11), while the other two 
have only 1 DC element. Similarly, in Figure 2.10b, only b-3 sub-TB contains 
12 coefficients that need to scan, while the remaining blocks contain less than 
10. Therefore, it is less efficient to apply the four-core parallel MRSET 
architecture on these TBs, and there is an imbalance between the throughput 
requirement and the hardware complexity. 
a-1
a-2
a-3
a-4
b-1
b-2
b-3
b-4
(a) (b)
52 
2.2.4. The “one scan for multiple syntax element generation” technique 
Based on the analysis of image data statistics, methods of generating the 
residual syntax elements and the related state-of-the-art results, the thesis 
proposes a combined scanning solution for the generation of several syntax 
elements. By evaluating the characteristics of syntax element types (Table 
2.3), the scanning algorithms [34] and the accompanied binarization methods, 
the proposed scanning technique performs one memory access to 
simultaneously determine several syntax elements of every coefficient. The 
proposed solution improves the dynamic power consumption efficiency in the 
implementation of residual syntax element generation module, thank to the 
reduction of the number of memory accesses times. 
Figure 2.11. Functional block diagram of residual syntax element generation 
and binarization modules 
Figure 2.11 shows the function block diagram of residual syntax element 
generation and binarization. The set of residual syntax elements represent 4x4 
TB is determined and then sent to the binarization module. Each residual 
syntax element type (Table 2.3) is converted to the bin string by a 
corresponding method. The bin string is then appended to the output bin 
sequence in the order as shown in Figure 2.8. 
Observing the generated residual syntax elements in Figure 2.7, several 
special points can be concluded as follows: 
Residulal 
syntax 
element
generation 
module
Binarizer
module
Last_sig_coeff_x
Last_sig_coeff_y
Sig_coeff_flag
Coeff_abs_level_greater1_flag
Coeff_abs_level_greater2_flag
Coeff_sign_flag
Coeff_abs_level_remaining
0 0 0 0 0 0 -1 0 1 0 0 0 0 3 6 9 Bin string
0 1 0 1 0 1
2
2
16
16
16
4
Residual Coefficients
53 
- The syntax elements last_sig_coeff_x, last_sig_coeff_y and 
coeff_abs_level_remaining are decimal values. 
- The remaining syntax elements: sig_coeff_flag, 
coeff_abs_level_greater1_flag, coeff_abs_level_greater2_flag and 
coeff_sign_flag are flags (flagged_SE) are binary bits and can be named as 
flagged_SE. Each of these flagged_SE types forms a vector called 
sig_coeff_flag_vector, coeff_abs_level_greater1_flag_vector, 
coeff_sign_flag_vector. Particularly, the syntax element 
coeff_abs_level_greater2_flag is only one bit for each CG which marks the 
first coefficient position with value 2 during scanning the coefficients of 
that CG [25]. 
Furthermore, in the HEVC standard, these flagged_SEs use the same 
binarization method, Fixed Length Binarization. Based on the above 
observations, the thesis proposes one scan for multiple syntax element 
generation technique (one-time memory access) to process these 
flagged_SEs. In hardware implementation of the syntax element generation 
module, when this technique is applied, the number of memory access will be 
reduced by 3 in comparison with the traditional method. Reducing the number 
of memory accesses effectively reduces dynamic power consumption and 
reduces processing latency caused by memory access. The hardware 
architecture of the proposed solution is depicted in Figure 2.12. As depicted in 
Figure 2.12, instead of applying four scannings to determine flagged_SEs, the 
proposed architecture performs a single scan to generate four types of syntax 
elements. This group of syntax elements is then performed binarization on the 
same Fixed Length binarization datapath in the binarization architecture. 
Figure 2

File đính kèm:

  • pdfluan_an_researching_on_the_development_of_hardware_implement.pdf
  • docThongTin KetLuanMoi LuanAn NCS TranDinhLam.doc
  • pdfTomTat LuanAn NCS TranDinhLam_English.pdf
  • pdfTomTat LuanAn NCS TranDinhLam_TiengViet.pdf
  • docTrichYeu LuanAn NCS TranDinhLam.doc