
You are missing trading opportunities:
- Free trading apps
- Over 8,000 signals for copying
- Economic news for exploring financial markets
Registration
Log in
You agree to website policy and terms of use
If you do not have an account, please register
INT8 Inference of Quantization-Aware trained models using ONNX-TensorRT
INT8 Inference of Quantization-Aware trained models using ONNX-TensorRT
Dheeraj Peri, a deep learning software engineer at NVIDIA, explains the basics of quantization and how TensorRT supports quantized networks through various fusions. They focus on models trained using TensorFlow 2.0 framework and how to perform post-training quantization (PTQ) and quantization-aware training (QAT). The process of deploying a model trained using Nvidia TF2 quantization toolkit with ONNX-TensorRT is explained, and the accuracy and latency results are presented for various ResNet models. Overall, the end-to-end QAT workflow from TensorFlow to TensorRT deployment via ONNX-TensorRT is demonstrated.
Practical Post Training Quantization of an ONNX Model
Practical Post Training Quantization of an ONNX Model
The video discusses how to implement quantization to reduce the size of a TensorFlow model to an ONNX quantized model. The ONNX model is significantly smaller in size and can be executed faster on a CPU. The author provides code snippets and instructions on how to implement dynamic quantization and how to check the CPU speed.
The video shows the process of quantizing a machine learning model to make it faster and lighter, while acknowledging that it may lead to a drop in accuracy. The ONNX and TensorFlow models are compared to a quantized model, with the latter being found to be faster and lighter. However, the quantized model does not benefit as much from the use of GPUs as the other models do. The accuracy of the quantized model is then evaluated and found to have only a slight drop. The process of visualizing ONNX models is also discussed, with the use of Loot Rodas Neutron app demonstrated. The overall process results in a reduction in model size from one gigabyte to 83 megabytes with minimal loss in accuracy.
QONNX: A proposal for representing arbitrary-precision quantized NNs in ONNX
QONNX: A proposal for representing arbitrary-precision quantized NNs in ONNX
The speaker discusses low precision quantization, with an example of its application in wireless communication. They propose QONNX, a dialect for representing arbitrary-precision quantized neural networks in ONNX. QONNX simplifies the quantization representation, extends it to a wider set of scenarios, and offers options for different types of roundings and binary quantization. It is being used for deployment on FPGAs and is integrated into the Brevitas Python quantization library, with NQCDQ set to be integrated into the next release.
GRCon20 - Deep learning inference in GNU Radio with ONNX
GRCon20 - Deep learning inference in GNU Radio with ONNX
The video discusses using ONNX as an open format for integrating deep learning as a flexible, open-source solution in the radiofrequency domain. The speaker presents their new module GR DNN DN4, which uses Python interfaces for both GNU Radio and ONNX, and demonstrates its capabilities with an example of automatic modulation classification using a deep convolutional neural network model trained on simulated data generated by GNU Radio. They also discuss the requirements and challenges of using deep learning for classification on SDR data with the BGG16 model and suggest using hardware acceleration, such as a GPU, to improve inference and achieve real-time results. The project is open source and collaboration is encouraged.