Luận án Researching on the development of hardware implementation solution for the context-adaptive binary arithmetic coder in the hevc standard
Trang 1
Trang 2
Trang 3
Trang 4
Trang 5
Trang 6
Trang 7
Trang 8
Trang 9
Trang 10
Tải về để xem bản đầy đủ
Bạn đang xem 10 trang mẫu của tài liệu "Luận án Researching on the development of hardware implementation solution for the context-adaptive binary arithmetic coder in the hevc standard", để tải tài liệu gốc về máy hãy click vào nút Download ở trên.
Tóm tắt nội dung tài liệu: Luận án Researching on the development of hardware implementation solution for the context-adaptive binary arithmetic coder in the hevc standard
and proposing solutions to improve performance efficiency for the residual syntax element generation module, binarization module and Binary Arithmetic Encoding (BAE) module. 2.1. Proposed funtional block diagram of CABAC encoder architecture Based on research orientations that are determined in the Chapter 1, the proposed hardware architecture of CABAC encoder of the thesis is described in Figure 2.1. In the architecture, the thesis focuses on proposing effective design solutions for major functional modules such as the syntax element generation module, the binarization module and the binary arithmetic encoding (BAE) module. In the syntax element generation module, the one scan for multiple syntax element generation technique is proposed to reduce memory access times, which will save the dynamic power consumption. This module performs memory accesses and scans the coefficient matrices to generate residual syntax elements. Memory access activities consume the dynamic power and this power consumption increases proportionally to the number of memory accesses. The proposed architecture performs one scanning to determine several syntax elements instead of multiple successive 42 scannings the coefficients’ matrix. This will reduce memory access times compared to conventional methods. Therefore, the dynamic power consumption is also reduced compared to the coding principle and with other researches. Figure 2.1. Proposed hardware architecture of CABAC encoder With the Binarizer, the thesis proposes the “combined binarization” solution when performing the binarization of the “last significant coefficient position” SE. Typically, the x and y coordinates of the “last significant coefficient position” are simultaneously determined at the output of Syntax Element Generation module and binary converted by two Truncated Rice modules. Whereas, in the proposed “combined binarization” architecture, the two coordinates are binarized on the same datapath by the same Truncated Rice hardware. The proposed solution contributes to the redution of the number of logics in the implementation of the binarization module. In the BAE module, the thesis proposes the Multiple Bypass Bin Processing architecture to simultaneously process multiple bypass bins for the purpose of throughput improvement. In the proposed architecture, the “pre- multiplication” and “Unify datapath” techniques are applied to minimize hardware resource usages for BAE module, and therefore, the CABAC encoder. BAE Residual SE Generation Context Models Binarizer Range Update Low Update Byte_out Update_model rLPS Coefficients Encoded bits SE FIFO B u ffe r M-ary SE bin Bypass/regular regular Binary SE model bypass CABAC rLPS_LUT State_LUT 43 2.2. Binarizer 2.2.1. Data statistics of Binarizer In the CABAC architecture, the Binarizer performs the conversions of SEs into bin strings to feed the BAE module. Figure 2.2 shows the block diagram of the HEVC encoder, which describes data streams from other functional components of the HEVC encoder to the input of the Binarizer. Figure 2.2. Data inputs of CABAC encoder The input data of CABAC encoder includes residual data (Transform Coefficients), General Control Data, Intra Prediction data, Motion Data and Filter Control Data. Each of these parameters is characterized by a set of SEs specified by the standard, allowing the abstract representation of the information to be transmitted [22]. Depending on the characteristic, occurrence frequency and percentage of each type of information the corresponding SEs are encoded in different ways. As mentioned in Chapter 1 (Figure 1.14), in the HEVC’s frame structure there is stable information whose occurrence frequency is low and accounts for a small percentage of the total bitstream. This information is usually located in the header to convey the 44 parameters such as general control, configuration, resolution and frame rate. Encoding this information has little overall compression effect; therefore the HEVC standard specifies basic encoding methods for implementation simplification (FLC and VLC). In contrast, the residual data encapsulated by the CTUs account for the majority (75% on average) of the total input data [49]. Table 2.1 shows the percentage of main syntax elements in the CABAC encoder input. It can be seen that the residual data (Transform Coefficients), including Transform Luma and Transform Chroma, occupy a significant portion of the encoder input data. In addition, the residual data adaptively fluctuates according to the characteristics of each video stream. Based on these features, in order to improve the overall compression efficiency, it is necessary to apply a highly efficient encoding method to the video data. Table 0.1. Statistics of input data type of CABAC [49] z Frame type I B0 B1 B2 B3 Transform Luma 66% 67% 69% 69% 75% Transform Chroma 17% 15% 13% 13% 6% Intra/Inter Prediction data 8% 7% 7% 8% 9% Figure 2.3. Illustration of CTU structure The method of dividing the frame into CTUs is shown in Figure 2.3. In the HEVC standard, CTU is the largest coded block, from which the smaller CUs are generated. There exist simultaneously two methods of dividing each CTU CU CU CU CU CU CU CU CU CU CU TU TU TU TU TU TU TU PU 45 CU into TUs and PUs which are applied for Prediction and Transformation processes, respectively. Therefore, in order to fully represent information for successful decoding, in addition to TU and PU blocks, there must be syntax elements specifying this dividing level (CTU, CU). These syntax elements are grouped into CTU/CU bins accompanied TU and PU bins in the bitstream. Table 2.2 shows the percentage of each type of bins according to different coding configurations: AI (All Intra), LD-P (Low Delay P-frame), LD-B (Low Delay B-frame) and RA (Random Access) [49]. In all the testing configurations, the set of SEs representing image data (TU data) occupies a large portion, 63.7 ÷ 94%, of the CABAC data. Table 0.2. Major bins contributors among HEVC data hierarchy [49] Common Test Condition Hierarchy Level AI Low Delay Random Access Worst-case P frame B frame CTU/CU bin 5,4% 15,8% 16,7% 11,7% 1,4% PU bin 9,2% 20,6% 19,5% 18,8% 5,0% TU bin 85,4% 63,7% 63,8% 69,4% 94,0% 2.2.2. The structure of residual syntax elements a) Scanning method for TU data By dividing frames into CTUs, the HEVC standard allows encoding the residual data as TUs (Transform Units) of size N N (4 4, 8 8, 16 16 and 32 32). In the transformation step, TUs are transformed into matrices of Transform Coefficients, whose size corresponds to that of each input TU. While the H.264/AVC standard applies the Zigzag Scan Pattern method, HEVC applies the Diagonal Scan Pattern method for TBs (Transform Block) to convert 2-D matrices into 1-D arrays of Transform Coefficients. In the diagonal scanning, the scan starts at the bottom-right and diagonally traverses up to the top-left corner of the TB. Figure 2.4 shows the differences in scanning methods between the two standards. 46 Figure 2.4. Comparison of Zigzag scan and diagonal scan in HEVC The Diagonal Scan Pattern method is applied to different levels in TBs, aiming to divide large TBs into 4 4 TBs (sub-TBs) before generating a set of residual syntax elements for each sub-TB. First of all, the Diagonal Scan Pattern method is applied to divide large TBs into sub-TB [32], [54]. Then, this method is further applied to scan each sub-TB in order to convert 4 4 Transform Coefficient matrix into 1-D arrays of 16 consecutive transform coefficients. This 16-transform coefficient array is called the CG (Coefficient Group), where the residual SEs are generated. The above procedure of the Diagonal Scan Pattern method is illustrated in Figure 2.5 [34], where a 16 16 TB is scanned to form 16 sub-TBs and every sub-TB is scanned to generate residual sntax elements Figure 2.5. Application of diagonal scan [29]. b) Forming the group of residual syntax elements at the CABAC input As discussed, CABAC is mainly used to encode the residual video data before it is merged into the output bitstream for sending. Once CGs are formed, the HEVC standard specifies a set of residual syntax elements for 0 1 5 6 2 4 7 12 3 8 11 13 9 10 14 15 15 13 10 6 14 11 7 3 12 8 4 1 9 5 2 0 Zigzag scan Diagonal scan 4 samples 4 sa m p le s 16 samples 1 6 sa m p les (a) – TB 16 16 (b) – sub TB 4 4 47 each CG to abstracted represent the image data before encoding. For each CG, this group of residual syntax elements is determined through different scan passes by different algorithms. Table 2.3 describes the set of syntax elements for the data in each CG. Table 0.3. Set of Syntax Element for 4 4 TU Syntax Element Description last_sig_coeff_x X coordinate of the first non-zero coefficient in scanning order within CG last_sig_coeff_y Y coordinate of the first non-zero coefficient in scanning order within CG sig_coeff_flag Flags indicating the significance of a coefficient (zero/non-zero) coeff_abs_level_greater1_flag Flags indicating whether the absolute value of a coefficient level is greater than 1 coeff_abs_level_greater2_flag Flags indicating whether the absolute value of a coefficient level is greater than 2 coeff_sign_flag Flags indicating the sign of a significant coefficient (0: positive; 1: negative) coeff_abs_level_remaining Remaining value for the absolute value of a coefficient level Figure 2.6. Diagonal scanning of transform coefficients Figure 2.6 describes the process of diagonal scanning for each sub-TB to form CG before performing six scan passes on the CG to determine the set of residual syntax elements. Figure 2.7 illustrates the six scan passes as described in the HEVC standard [25]. The first scan pass determines the position of the last Scan passes Residual SEs 9 3 0 -1 0 0 0-6 0 1 0 0 0 0 0 0 9-63000010-1000000 coefficicent group Diagonal scanning 48 significant coefficient in CG, which is called last_sig_coeff_post. This position is specified as the x and y coordinates of the 4x4 matrix and is represented by two syntax elements named last_significant_coeff_x and last_significant_coeff_y. The last_sig_coeff_post is also the entry point for the next five scan passes to define the remaining five syntax elements in Table 2.3. Thus, to define the set of residual syntax elements, it is necessary to perform continuously six scan passes on each CG. Figure 2.7. Illustration of Syntax Element generation for 4 4 TB Figure 2.8 shows the results of the scans on the CG to extract the values of the syntax elements and their order in the input of the CABAC encoder’ Binarizer. It can be seen that, in the HEVC standard, the syntax elements with the same type are organized in separate groups. In addition, there is a separation between regular encoding bins and bypass bins. This algorithm improvement of HEVC compared to H.264/AVC allows applying the parallel and pipeline solutions in hardware architecture of the CABAC encoder. This will be detailly presented in the following sections of the thesis. 9 3 0 -1 -6 0 0 0 0 1 0 0 0 0 0 0 1 - - - - - - - - - - - - - 1 1 0 1 1 0 0 - 0 1 - - 0 - - - 1 1 0 1 - 0 - - - - - 0 0 - 1 1 - - - - 0 - - - - - - 7 0 - - 4 - - - - - - - - - - - last_sig_coeff_x (3) last_sig_coeff_y (0) sig_coeff_flag coeff_abs_level_greater1_flag coeff_abs_level_greate2_flag coeff_sign_flag coeff_abs_level_remaining 49 Figure 2.8. Generated Syntax Elements and order of output sequence 2.2.3. The drawbacks of multi-core syntax element generation architecture As stated in Chapter 1, CABAC is the most “throughput bottle-neck” component in the HEVC architecture due to the high correlation of input data sequences as well as the bin-to-bin sequential encoding principle. Since the standard published, research works have focused on solving the problem of improving the performance of the CABAC encoder by the most effective architectural solutions. Amongst them, various high efficient design solutions have been adopted for the Binarization and the BAE modules. As a result, the proposed CABAC encoders can process 4K/8K real-time video streams. In recent years, when the intrinsic problems of CABAC have been solved, the preprocessing of CABAC input data, i.e. syntax element sequences, has been concerned. Particularly, residual data is the most concerning issue due to its importance, accounting for a large percentage of CABAC encoded data as discussed in the previous section. Once the high throughput CABAC encoder has been proposed, its data provider, i.e. residual syntax element generation has to be the high throughput design as well. The Sergio Bampi research group is prominent in this trend, in which they intensively analyze the statistical characteristics of the residual data stream to design high-speed hardware architectures for residual syntax last_sig_coeff_x 3 last_sig_coeff_y 0 sig_coeff_flag 1 0 1 0 0 0 0 1 1 1 coeff_abs_level_greater1_flag 0 0 1 1 1 coeff_abs_level_greater2_flag 1 coeff_sign_flag 1 0 0 1 0 coeff_abs_level_remaining 0 4 7 Output order: 3 0 1 0 1 0 0 0 0 1 1 1 0 0 1 1 1 1 1 0 0 1 0 0 4 7 Regular mode Bypass mode 50 elements generation module [49], [51]. In the work [49], the authors have proposed the Multiple Residual Syntax Element Treatment (MRSET) solution in designing the Four-Core Multiple Residual Syntax Element Generation architecture. Figure 2.9 shows the proposed MRSET architecture [51]. The 4- core MRSET solution is applied to each 4 4 TB to generate the residual syntax elements. In each clock cycle, the 4-core MRSET architecture is capable of processing 4 transform coefficients simultaneously that allows speeding up the syntax element generation module. Figure 2.9. Architecture of the four-core MRSET The four-core MRSET architecture is capable of providing input data throughput for CABAC to encode 4K/8K video streams. However, it can be seen that applying the parallel four-core architecture to every 4 4 TB is inefficient in hardware resource usage. The TB data is temporal fluctuation and adaptive to the visual characteristics of each video stream. At the 4 4 TB division level, the number of samples that need to scan may vary from 1 ÷ 16 (depending on the position of the last significant coefficient). Therefore, the proposed architecture is only efficient in terms of hardware usage when the number of samples is large enough to allow 4 cores to work in parallel for at least three cycles. In contrast, when the number of samples was only equal or less than 4, the processing speed is too fast for the requirement of 8K video Transform coefficients Core 0 -1 Core 1 Core 2 Core 3 0 5 0 4 3 00 72Cycle 3rd Cycle 2nd Cycle 1st Output syntax elements 51 format while still existed 4 cores running in parallel. Some TB data samples of the 4K/8K video stream are shown in Figure 2.10. Figure 2.10. Typical partten of transform coefficients in HEVC standard Figure 2.10a and Figure 2.10b show the statistics of several samples of transform coefficients. The statistics show that in each 4 4 TB the number of the significant samples (non-zero) is modest and mainly converging to DC (0, 0). Therefore, the number of samples that need to scan for syntax element generation is much less than the 16. For example, in Figure 2.10a, an 8 8 TB is divided into 4 sub-TBs, in which only 3 significant (non-zero) sub-TBs (a- 2, a-3 and a-4) need to be scan. Moreover, only a-4 sub-TB contains a relatively large number of significant coefficients (11), while the other two have only 1 DC element. Similarly, in Figure 2.10b, only b-3 sub-TB contains 12 coefficients that need to scan, while the remaining blocks contain less than 10. Therefore, it is less efficient to apply the four-core parallel MRSET architecture on these TBs, and there is an imbalance between the throughput requirement and the hardware complexity. a-1 a-2 a-3 a-4 b-1 b-2 b-3 b-4 (a) (b) 52 2.2.4. The “one scan for multiple syntax element generation” technique Based on the analysis of image data statistics, methods of generating the residual syntax elements and the related state-of-the-art results, the thesis proposes a combined scanning solution for the generation of several syntax elements. By evaluating the characteristics of syntax element types (Table 2.3), the scanning algorithms [34] and the accompanied binarization methods, the proposed scanning technique performs one memory access to simultaneously determine several syntax elements of every coefficient. The proposed solution improves the dynamic power consumption efficiency in the implementation of residual syntax element generation module, thank to the reduction of the number of memory accesses times. Figure 2.11. Functional block diagram of residual syntax element generation and binarization modules Figure 2.11 shows the function block diagram of residual syntax element generation and binarization. The set of residual syntax elements represent 4x4 TB is determined and then sent to the binarization module. Each residual syntax element type (Table 2.3) is converted to the bin string by a corresponding method. The bin string is then appended to the output bin sequence in the order as shown in Figure 2.8. Observing the generated residual syntax elements in Figure 2.7, several special points can be concluded as follows: Residulal syntax element generation module Binarizer module Last_sig_coeff_x Last_sig_coeff_y Sig_coeff_flag Coeff_abs_level_greater1_flag Coeff_abs_level_greater2_flag Coeff_sign_flag Coeff_abs_level_remaining 0 0 0 0 0 0 -1 0 1 0 0 0 0 3 6 9 Bin string 0 1 0 1 0 1 2 2 16 16 16 4 Residual Coefficients 53 - The syntax elements last_sig_coeff_x, last_sig_coeff_y and coeff_abs_level_remaining are decimal values. - The remaining syntax elements: sig_coeff_flag, coeff_abs_level_greater1_flag, coeff_abs_level_greater2_flag and coeff_sign_flag are flags (flagged_SE) are binary bits and can be named as flagged_SE. Each of these flagged_SE types forms a vector called sig_coeff_flag_vector, coeff_abs_level_greater1_flag_vector, coeff_sign_flag_vector. Particularly, the syntax element coeff_abs_level_greater2_flag is only one bit for each CG which marks the first coefficient position with value 2 during scanning the coefficients of that CG [25]. Furthermore, in the HEVC standard, these flagged_SEs use the same binarization method, Fixed Length Binarization. Based on the above observations, the thesis proposes one scan for multiple syntax element generation technique (one-time memory access) to process these flagged_SEs. In hardware implementation of the syntax element generation module, when this technique is applied, the number of memory access will be reduced by 3 in comparison with the traditional method. Reducing the number of memory accesses effectively reduces dynamic power consumption and reduces processing latency caused by memory access. The hardware architecture of the proposed solution is depicted in Figure 2.12. As depicted in Figure 2.12, instead of applying four scannings to determine flagged_SEs, the proposed architecture performs a single scan to generate four types of syntax elements. This group of syntax elements is then performed binarization on the same Fixed Length binarization datapath in the binarization architecture. Figure 2
File đính kèm:
- luan_an_researching_on_the_development_of_hardware_implement.pdf
- ThongTin KetLuanMoi LuanAn NCS TranDinhLam.doc
- TomTat LuanAn NCS TranDinhLam_English.pdf
- TomTat LuanAn NCS TranDinhLam_TiengViet.pdf
- TrichYeu LuanAn NCS TranDinhLam.doc