Learning ONNX for trading - page 14

 

Import, Train, and Optimize ONNX Models with NVIDIA TAO Toolkit



Import, Train, and Optimize ONNX Models with NVIDIA TAO Toolkit

The video showcases how to use the NVIDIA TAO Toolkit to import, train, and optimize ONNX models. It starts by downloading a pre-trained ResNet18 model, fine-tuning it with TAO on the Pascal VOC dataset, and provides steps for importing the model and visualizing the ONNX graph. The training progress can be monitored using TensorBoard visualization, and custom layers can be used in case of ONNX conversion errors. The video also explains how to evaluate the model's performance by observing the decreasing loss, validating loss, and analyzing weights and biases. Users can assess the model's accuracy on the test dataset and sample images and continue with pruning and optimization to improve it further.

  • 00:00:00 In this section, the video explains how to import, train, and optimize ONNX models using the NVIDIA TAO Toolkit. The video starts by downloading a pre-trained ResNet18 model, which is then fine-tuned on the Pascal VOC dataset using TAO. The steps for importing the ONNX model and visualizing the ONNX graph are also covered. Additionally, the video discusses how to monitor the progress of the training job using TensorBoard visualization. Lastly, the video mentions that TAO can handle custom layers and provides guidance on how to use them to import models that fail to convert.

  • 00:05:00 In this section, the speaker discusses how to evaluate the performance of the trained model. Users can look at the decreasing loss to ensure that the model is improving. Additionally, validation loss can help identify overfitting. More advanced users can view graphs and histograms to understand the model's weights and biases. The speaker demonstrates how to check the overall accuracy of the model on the test dataset and how to assess the model's performance on sample images. The model has room for improvement, and users can continue with model pruning and optimization to enhance the accuracy further.
Import, Train, and Optimize ONNX Models with NVIDIA TAO Toolkit
Import, Train, and Optimize ONNX Models with NVIDIA TAO Toolkit
  • 2022.05.19
  • www.youtube.com
The #NVIDIATAO Toolkit, built on TensorFlow and PyTorch, is a low-code AI solution that lets developers create custom AI models using the power of transfer l...
 

NVAITC Webinar: Deploying Models with TensorRT



NVAITC Webinar: Deploying Models with TensorRT

In this section of the NVAITC webinar, solutions architect Nikki Loppie introduces TensorRT, NVIDIA's software development kit for high-performance deep learning inference. TensorRT provides an inference optimizer and runtime for low latency and high throughput inference across a range of platforms, from embedded devices to data centers. Loppie explains the five technologies that TensorRT uses to optimize inference performance, including kernel fusion and precision calibration. Developers can use TensorRT's Python and C++ APIs to incorporate these optimizations into their own applications, and converter libraries like trtorch can be used to optimize PyTorch models for inference. Loppie demonstrates how to save TensorRT optimized models using the trtorch library and benchmarks the optimized models against unoptimized models for image classification, showing significant speed-ups with half precision.

  • 00:00:00 In this section of the webinar, solutions architect Nikki Loppie discusses the importance of efficiency in inference and the need for platform portability. She introduces TensorRT, a software development kit by NVIDIA for high-performance deep learning inference that addresses these two challenges. TensorRT includes an inference optimizer and runtime for low latency and high throughput inference across a wide range of platforms, from embedded devices to data centers. It is also compatible with all major deep learning frameworks. Loppie then explains the five technologies that TensorRT implements to optimize inference performance, including kernel fusion, precision calibration, and kernel auto-tuning.

  • 00:05:00 In this section, the webinar introduces the abilities of TensorRT to optimize kernel execution time, reduce memory footprint, and support parallel inference using multi-stream execution. Developers can use TensorRT's Python and C++ APIs to incorporate these optimizations into their own applications. The webinar also explains how to use converter libraries like tr torch to optimize a PyTorch model for inference. The steps involve saving the model, loading it, initializing a ResNet model, compiling it using TorchScript, and finally converting it to the TensorRT format. The optimized model can then be deployed on the target platform.

  • 00:10:00 In this section of the webinar, the speaker demonstrates how to save TensorRT optimized models for later use or for deployment on other platforms using the trtorch library. The speaker uses an image classification example with ResNet-18 and ResNet-50 models running on ImageNet dataset. The trtorch optimized models show significant speed-ups with half precision compared to the unoptimized models, with a speed-up factor of 5.5x for ResNet-18 and 6.4x for ResNet-50. The speaker also highlights the importance of unbiased benchmarking and provides instructions on how to get started with trtorch.
 

ESP tutorial - How to: design an accelerator in Keras/Pytorch/ONNX



ESP tutorial - How to: design an accelerator in Keras/Pytorch/ONNX

The tutorial introduces a tool called Chalice for ML, which can automatically generate an accelerator from a Keras/Pytorch/ONNX model. The tutorial then proceeds to demonstrate how to integrate the accelerator into ESP (Early Stage Prototyper). The speaker also shows how to design an accelerator in Keras/Pytorch/ONNX and goes through the steps of importing an accelerator, adding a test bench, generating RTL, and creating two versions of the accelerator. The video also covers compiling Linux and creating a Linux user space application for the accelerator. Finally, the tutorial ends with resources for further learning.

  • 00:00:00 In this section of the tutorial, the presenter introduces a tool called Chalice for ML that can automatically generate an accelerator from a Keras/Pytorch/ONNX model. The flow is demonstrated by using HLS 4 ML to generate an accelerator from a pre-built Keras model that is provided within the ESP for NXP's GitHub repository. The generated accelerator is then integrated and tested into ESP using an interactive script. The presenter emphasizes that users should go through the prerequisite guides and set up their environment before attempting to follow the tutorial. The tutorial also offers pre-built materials that users can use to experiment without having to go through all the steps.

  • 00:05:00 In this section of the tutorial, the instructor explains how to integrate the accelerator designed in the previous steps into ESP (Early Stage Prototyper). The three-digit hexadecimal ID for the accelerator is assigned, keeping in mind that the number should not be greater than 1024 in decimal. The data bit width of the accelerator is then defined, which is 32 bits in the current use case, and the input and output file sizes are determined. Finally, the instructor demonstrates running high-level synthesis for the MLP three-layer accelerator and shows how to run HLS with ESP. All the steps are the same as for other guides for System C or C++ accelerators, and the HLS for MLP project folder is added to ESP with all the necessary files for wrapping and interfacing the accelerator with the rest of the ESP system.

  • 00:10:00 In this section of the video, the speaker demonstrates the steps for designing an accelerator in Keras/Pytorch/ONNX. They first show how to import an accelerator and add a test bench that automatically tests the simulation. They then go through the HLS step, which generates a project running on a selector search and the technology of the FPGA. The generated RTL is then placed in the technology of the FPGA and two versions of the accelerator are created, one with 32 bits and the other with 64 bits. The speaker configures an SOC with the ESP X config command and shows how to compile a bare metal application that has been generated automatically. To simulate the bare metal test of an accelerator, the test program needs to be specified. Once the validation has passed, an FPGA bitstream can be generated.

  • 00:15:00 In this section, the video tutorial walks through compiling Linux, which not only compiles cynics but also compiles the user space test applications for the accelerators. Once Linux is done, an executable for the accelerator is created, which is the Linux user space application that will run on FPGA. The tutorial then proceeds to program the FPGA and run the bare-metal test using the make FPGA run command. To run the decelerator bare-metal unit test, a test program that is generated earlier is specified. Afterwards, Linux was run and the unit test application was executed, which successfully found the seller Reiter and the test passed with validation. The tutorial ends with some resources for further learning.
 

Optimal Inferencing on Flexible Hardware with ONNX Runtime



Optimal Inferencing on Flexible Hardware with ONNX Runtime

This tutorial covers the deployment of models on CPU, GPU, and OpenVINO using ONNX Runtime. The speaker demonstrates the use of different execution providers, including OpenVINO, for inferencing on flexible hardware. The code for inferencing is primarily the same across all environments, with the main difference being the execution provider. ONNX Runtime performs inferencing faster than PyTorch on CPU and GPU, and a separate ONNX Runtime library exists for OpenVINO. Overall, the tutorial provides an overview of how to deploy models on various hardware options using ONNX Runtime.

  • 00:00:00 In this section, the speaker goes through the process of setting up virtual environments and using ONNX Runtime to perform inferencing on a ResNet50 model on CPU, GPU, and OpenVINO. The speaker notes that ONNX Runtime will use GPU if it detects a compatible accelerator, otherwise it will default to CPU. The code for inferencing is primarily the same across all three environments, with the main difference being changing the execution provider. The speaker demonstrates that ONNX Runtime can perform inferencing faster than PyTorch on CPU and GPU, and notes that there is a separate ONNX Runtime library for OpenVINO.

  • 00:05:00 In this section, the speaker demonstrates the use of different execution providers like OpenVINO and OpenVino on CPU for inferencing on flexible hardware with ONNX Runtime. By setting the execution provider to OpenVINO, the same code delivers 30ms and a CPU utilization of around 0.78. The tutorial gives an overview of how to deploy models on CPU, GPU, and OpenVINO using ONNX Runtime.
Optimal Inferencing on Flexible Hardware with ONNX Runtime
Optimal Inferencing on Flexible Hardware with ONNX Runtime
  • 2022.08.22
  • www.youtube.com
Check out this video and blog on how to inference ResNet with CPU, GPU or OpenVINO by our intern Kevin Huang!Blog: https://onnxruntime.ai/docs/tutorials/acce...
 

Machine Learning Inference in Flink with ONNX



Machine Learning Inference in Flink with ONNX

The video discusses the benefits and implementation of using ONNX in machine learning inference and deploying it in the distributed computing framework, Flink. The separation of concerns between model training and production inference, the ability to define specifications for inputs and outputs, and language independence make ONNX a valuable tool for data scientists. The video demonstrates how to load an ONNX model into Flink, providing key components of the rich map function and explaining how to bundle the models together with the code using a jar file. The speaker also addresses considerations such as memory management, batch optimization, and hardware acceleration with ONNX, and emphasizes its benefits for real-time machine learning inference in Flink.

  • 00:00:00 In this section of the video, Colin Germain explains the importance of real-time machine learning inference by using a toy example from the cyber attack detection domain. He illustrates how waiting to detect an incident can lead to missing important data exfiltration. Additionally, he explains why machine learning is vital for capturing the variation of different types of techniques and exploits used by external adversaries. Finally, he emphasizes the importance of scaling the solution by using distributed computing in Flink. Colin introduces the ONNX model, which plays a critical role in the solution, and explains how they can use a machine learning model in PyTorch and serialize it with ONNX to deploy it in Flink.

  • 00:05:00 In this section, the speaker explains the benefits of using ONNX, which stands for Open Neural Network Exchange, in a machine learning pipeline. ONNX allows for the separation of concerns between the model training phase and the production inference phase, making it easy for data scientists to develop models in Python and then use the ONNX model with different tools for inference. ONNX provides a contract that defines the computation of a directed acyclic graph to be used for machine learning. Each operator in the graph has a version, allowing for backwards compatibility and continuous serving in the future. The speaker also notes the benefits of packaging ONNX models with the streaming framework Apache Flink for easier deployment.

  • 00:10:00 In this section of the video, the speaker discusses the benefits of using ONNX for machine learning inference, including the ability to define specifications for inputs and outputs, and the support for all versions of models in the ONNX runtime library. The language independence of ONNX and the availability of converters for most ML frameworks make it easy to get models into ONNX, and the speaker suggests using netron for diagnostic purposes. Finally, the speaker presents a simple example of using a pi torch model with ONNX for end-to-end processing without training.

  • 00:15:00 In this section, the speaker discusses the forward method used to define the computation and how it is used by high torque to manage back propagation of gradients for training. A basic example of using the add offset class is demonstrated, which offsets tensors by a defined value. The speaker then moves on to discuss exporting to ONNX and the importance of providing the correct inputs for dynamic models. The ONNX model is loaded into memory using Scala's ONNX runtime environment and session, which allows for inference to be performed on the model. A class is created to hold the loaded model for use in inference.

  • 00:20:00 In this section, the speaker explains how the ONNX model can be loaded into Flink using a byte array and a load method. They also demonstrate how the rich map function can be used to hold onto the loaded model and perform inference in a clean and organized way. The speaker goes through the key components of the rich map function, including setting up the ONNX tensor, defining the inputs and outputs, and running the model session to get the results. They note that the code can be modified to support multiple outputs, making it a flexible solution for different types of models. Finally, the speaker touches on the packaging aspect of ONNX models, explaining how they can be bundled together with the jar file containing the code, eliminating the need to connect to external endpoints or download files from different sources.

  • 00:25:00 In this section, the speaker discusses an example of machine learning inference in Flink using the classic problem of classifying handwritten digits, known as MNIST. They show the PyTorch model used to classify the 28x28 pixel arrays and how it can be converted to an ONNX graph for use in Flink, using the same batch sizing approach as before. The speaker then discusses another example of machine learning using transformers in NLP, specifically a smaller version of BERT, that has been pre-trained on a vocabulary of words. The transformer model is used for predicting sentiment, translation, and other word tasks, and can be further trained for new prediction tasks in Flink.

  • 00:30:00 In this section, the presenter showcases the Hugging Face Transformers library, which allows for easy import of pre-trained models and transfer learning in Python. While these models can tend to be large, the library includes optimization and quantization features to improve performance. However, it is important to note that pre-processing stages, such as tokenization, are not yet part of the graph and may not be accessible in scala. The presenter also highlights the benefits of leveraging the full capabilities of the Scala language in Flink, while separating training dependencies from production inference and decoupling the two pieces effectively. Overall, while there are some cons, such as model size and pre-processing challenges, the method offers advantages in terms of leveraging Flink's capabilities and deploying in a JAR file.

  • 00:35:00 In this section, the speaker notes that when using machine learning inference in Flink with ONNX, there are some important considerations to keep in mind. One is careful memory management when dealing with large models, such as transformer models that can take up hundreds of megabytes. Additionally, batch optimization and hardware acceleration, such as the use of GPUs, can impact performance. Pre-processing can be done with custom ONNX graphs, but this requires extra work and is not as easy as what was shown earlier. The speaker emphasizes that ONNX enables real-time machine learning in Flink and nicely separates Python training code from Scala production code, which can be a win for data scientists. The speaker also addresses questions about using ONNX with TensorFlow-based models and why ONNX was chosen over Java API, PyTorch, or TensorFlow.

  • 00:40:00 In this section, the speaker talks about ONNX as a framework-agnostic schema for the graph that is language-agnostic as well. The speaker mentioned that one of the interesting properties of using ONNX is that if one is using PyTorch and wants to switch to TensorFlow, they can use ONNX as a "vehicle" to go between the two frameworks, which shows the flexibility of the framework. The audience asked if the speaker had experimented with TensorFlow models with TensorFlow's Scala project, to which he responded negatively. Lastly, the speaker invites the audience to visit the ONNX repository, create an issue on GitHub, or reach out to him on LinkedIn for questions, especially on hiring.
Machine Learning Inference in Flink with ONNX
Machine Learning Inference in Flink with ONNX
  • 2021.05.20
  • www.youtube.com
What is the best way to run inference on a machine learning model in your streaming application? We will unpack this question, and explore the ways to levera...
 

Improving the online shopping experience with ONNX



Improving the online shopping experience with ONNX

This video discusses how e-commerce companies are using AI to create impactful insights that differentiate winning and losing in the online retail space. The speaker provides an example of Bazaar Voice, the largest network of brands and retailers that provide over 8 billion total reviews, and how they use product matching to share reviews. The speaker then describes how they developed a machine learning model in Python, exported it to ONNX format, and deployed it to a serverless function using a node environment to run inference on an ONNX runtime. This solution allows for high-speed matching of hundreds of millions of products across thousands of client catalogs while maintaining low costs, resulting in significant cost savings and millions of extra reviews for brands and retailers. The speaker concludes by inviting viewers to explore more ways of using the capabilities of ONNX and sharing their use cases for future technological advancements.

  • 00:00:00 In this section, we learn that the digitization of commerce has led to e-commerce companies possessing the scale of data they collect and use AI to create impactful insights as the differentiator between winning and losing in the online retail space. One example is of a billion shoppers a month for Bazaar Voice, the world's largest network of brands and retailers that provide over 8 billion total reviews, and how sharing reviews is powered by product matching. Product matching, the core function of the machine learning model built, is performed by comparing unique identifiers, but over a million product matches are performed manually every month. The solution is a scikit-learn model built with Python and is exported to ONNX format, having implemented the most lightweight, cost-efficient solution while maintaining performance.

  • 00:05:00 In this section, the speaker discusses various options for implementing machine learning models for an online shopping experience and concludes that serverless functions are the best option due to their low cost and easy implementation. They then explain how they developed a model in Python, exported it to ONNX format, and deployed it to a serverless function using a node environment to run inference on an ONNX runtime. The modularity of this solution allows it to be easily plugged into any service, and by using metrics such as memory used and execution time, they were able to find the optimal memory size to ensure the best performance while keeping costs low. While deployment size limits and working within timeout limits are considerations, the power of ONNX and ONNX runtime in combination with serverless functions allowed for high-speed matching of hundreds of millions of products across thousands of client catalogs, resulting in significant cost savings and 15 million extra reviews for brands and retailers.

  • 00:10:00 In this section, the speaker concludes the video by inviting viewers to explore more ways of using the capabilities of ONNX and sharing their use cases. As someone actively working in this space, the speaker is intrigued and compelled by where these technologies may take us in the future.
 

DSS online #4 : End-to-End Deep Learning Deployment with ONNX


DSS online #4 : End-to-End Deep Learning Deployment with ONNX

This video discusses the challenges of end-to-end deep learning deployment, including managing different languages, frameworks, dependencies, and performance variability, as well as friction between teams and proprietary format lock-ins. The Open Neural Network Exchange (ONNX) is introduced as a protocol buffer-based format for deep learning serialization. It supports major deep learning frameworks and provides a self-contained artifact for running the model. ONNX ML is also discussed as a part of the ONNX specification that provides support for traditional machine learning pre-processing. The limitations of ONNX are acknowledged, but it is seen as a rapidly growing project with strong support from large organizations that offers true portability across different dimensions of languages, frameworks, runtimes, and versions.

  • 00:00:00 In this section, Nick Pentreath, a principal engineer at IBM, introduces end-to-end deep learning deployment with the Open Neural Network Exchange. There are many steps in the machine learning workflow, including analyzing data, preprocessing it for models, training the model, deploying it in real-world applications, and maintaining and monitoring it. Pentreath discusses how the workflow spans different teams and tools, making it essential to have infrastructure serving the model.

  • 00:05:00 In this section, the speaker discusses the three questions that are crucial for machine learning deployment: what are we deploying, where are we deploying, and how are we deploying. The deployment of a machine learning model involves the incorporation of the entire set of steps that precede the trained model, including transformations, feature extraction, and pre-processing. It is imperative to apply the same pre-processing steps in the live environment as that used during training, as any differences can lead to data skew that can result in catastrophic outcomes. The speaker notes that even deep learning still needs careful pre-processing and feature engineering, and highlights the challenges that come with the standardization of pre-processing. These challenges include different data layouts and color modes among different frameworks that can have subtle, but significant impacts on the predictions.

  • 00:10:00 In this section, the video discusses the challenges when it comes to deploying an end-to-end deep learning pipeline, including managing and bridging different languages, frameworks, dependencies, and performance variability, as well as friction between teams, the lack of standardization among open source frameworks, and proprietary format lock-ins. While container-based deployment does bring significant benefits, it still requires some sort of serving framework on top and does not solve the issue of standardization. That's why the video suggests using open standards to export models from different frameworks to a standardized format, which provides a separation of concerns between the model producer and the model consumer, allowing them to focus on their respective tasks without worrying about deployment issues or where the model came from.

  • 00:15:00 In this section, the speaker discusses the importance of open source and open standards in deep learning deployment. They explain the benefits of having a single stack and a standardized set of tools for analysis and visualization, and highlight the critical role of open governance in providing visibility and avoiding concentration of control. The speaker then introduces the open neural network exchange (ONNX), a protocol buffer-based format for defining serialization of machine learning models with a focus on deep learning. ONNX supports major deep learning frameworks such as PyTorch, Caffe, TensorFlow, Keras, Apple Core ML, and MXNet, and provides a self-contained artifact for running the model.

  • 00:20:00 In this section, the speaker discusses how ONNX ML (Machine Learning) is a part of the ONNX specification that provides support for traditional machine learning pre-processing, along with additional types such as sequences and maps. ONNX encompasses a wide community and ecosystem of exporters that are written for various traditional machine learning frameworks as well as models such as linear models, tree ensembles, and gradient boosting. To represent all of this, ONNX acts as a standard that sits between the model producers and consumers. The ONNX model zoo contains a set of widely used and standard models across different domains, including image analysis, classification, segmentation, and natural language processing, all represented in ONNX formats. The ONNX runtime, an open-source project by Microsoft, is a fully compliant runtime that supports both core deep learning and ONNX ML operators.

  • 00:25:00 In this section, the speaker discusses the limitations of ONNX, particularly in terms of certain missing features such as image processing, advanced string processing, and hashing clustering models. In addition, there are challenges when it comes to exporting hybrid pipelines from frameworks such as Spark ML, and this requires a bit of custom code. However, ONNX is an active project that is rapidly growing and it has strong support from large organizations. It offers true portability across different dimensions of languages, frameworks, runtimes, and versions, which solves a significant pain point for the deployment of deep learning pipelines in an open and portable manner. ONNX is open source and open governance, so anyone can get involved.
DSS online #4 : End-to-End Deep Learning Deployment with ONNX
DSS online #4 : End-to-End Deep Learning Deployment with ONNX
  • 2020.10.29
  • www.youtube.com
End-to-End Deep Learning Deployment with ONNXA deep learning model is often viewed as fully self-contained, freeing practitioners from the burden of data pro...
 

ONNX and ONNX Runtime with Microsoft's Vinitra Swamy and Pranav Sharma



ONNX and ONNX Runtime with Microsoft's Vinitra Swamy and Pranav Sharma

The video discusses the Open Neural Network Exchange (ONNX) format, created to make models interoperable and efficient in serialization and versioning. ONNX consists of an intermediate representation layer, operator specs, and supports different types of data. The ONNX Runtime, implemented in C++ and assembler, offers backward compatibility and is extensible through execution providers, custom operators, and graph optimizers. The API offers support for platforms, programming languages, and execution providers. Users can create sessions, optimize models, and serialize them for future use. The speakers provide a demonstration of ONNX Runtime's versatility and efficiency, with the ability to run on Android devices.

  • 00:00:00 In this section, Venetra from the ONNX engineering team introduces ONNX, the Open Neural Network Exchange, which is an interoperable standard for AI models. She explains that Microsoft has integrated machine learning into almost every aspect of their product suite, from HoloLens to Xbox to Skype, which has led to various deployment challenges at scale. ONNX was created to optimize for efficient inference by standardizing the model deployment process for different frameworks and deployment targets. The goal is to support models from many frameworks by implementing one standard and to provide a consistent experience for all users, whether they are data scientists, hardware lenders, service authors, or ML engineers.

  • 00:05:00 In this section, Vinitra Swamy and Pranav Sharma discuss ONNX, a consortium of founding partners that includes Microsoft, Facebook, Amazon, NVIDIA, and Intel, among others. ONNX consists of an intermediate representation layer and a full operator spec that define operators in a standard way despite the different sets of operators each framework has. The code to convert models to ONNX is not lengthy, and conversion could save users a lot in terms of inference and interoperability. Additionally, ONNX has design principles that enable interoperability for both deep learning and machine learning models. Users can get started with ONNX by going to the ONNX model zoo, model creation services or converters.

  • 00:10:00 The section discusses the components and design of ONNX, a model format created to make it interoperable and backwards-compatible while supporting efficient serialization and versioning. The format consists of a model, a computational graph with nodes, and an operator spec. The types of data supported include the standard tensor types and two non-tensor types, sequences and maps. Operator specs feature inputs, outputs, constraints, and examples. An example of an operator spec is given for the relu operator.

  • 00:15:00 In this section of the video, Vinitra Swamy and Pranav Sharma discuss the different versions and operators that are supported in the Open Neural Network Exchange (ONNX) format. They explain that ONNX has over 156 deep learning spec ops and 18 traditional machine learning ops that are interoperable across the different operators. Additionally, users can create custom ops for their models using the ONNX framework. They also highlight the importance of versioning, which is done across three different levels - intermediate representation layer, offsets, and individual operators. Finally, they discuss ONNX Runtime, which is an open-source and high-performance inferencing engine for ONNX. It is cross-platform and designed to be backwards compatible, making it suitable for deployment in production environments.

  • 00:20:00 In this section, the focus is on the architecture of ONNX Runtime and how a model is run inside it. Backward compatibility and performance were key concerns for ONNX Runtime, which is implemented in C++ and some parts in assembler. ONNX Runtime has support for hardware accelerators by using something called "execution providers." The partitioning algorithm enables the model to run in a hybrid execution stage, and the individual execution providers can optimize the subgraphs even further for better performance. Finally, ONNX Runtime acts as an interpreter going through all the nodes in the graph to execute the model.

  • 00:25:00 In this section, the speakers discuss the modes of execution in ONNX and ONNX Runtime, which are sequential and parallel. Users can control the number of threads they want to configure for each mode of execution, and results are sent out through the API. The speakers note that different devices may not share the same memory, so memory copy nodes are inserted based on optimizations performed. They also talk about the graph partitioning process, where users have to specify a prioritized list of execution providers where the graph should be run. However, in the next release, there will be a new phase called smart partitioning, where ONNX will figure out the best way to place the graph and how to run it efficiently. The speakers also touch on execution providers, which are software abstractions on top of hardware accelerators. Two types of execution providers are kernel-based and runtime-based, and the latter is a black box where the execution provider runs parts of the graph for us.

  • 00:30:00 In this section, the speakers discuss the design principles of ONNX Runtime, emphasizing its extensibility through options like execution providers, custom operators, and graph optimizers. They also provide a matrix of supported platforms, programming languages, and execution providers, including Tensor RT, Direct ML, and OpenVINO. The speakers explain the high-level constructs of a session and the thread-safe way to create the session object before calling the run function. They also discuss how the time it takes to optimize a model depends on the size of the model and its optimization opportunities.

  • 00:35:00 In this section, the speakers discuss the creation of sessions and the use of run options and session options, with the ability to serialize the optimized model for future use. They also explain the process of registering custom operators, with the option of using Python for those who prefer not to use C#. The ONNX Runtime 1.0 version has been released, ensuring no breaking of APIs going forward, with
    compatibility going back to CentOS 7.6. The ONNX Go Live Tool, an open-source tool for converting and tuning models for optimal performance, is also discussed. The section concludes with examples of Microsoft services utilizing ONNX, including a 14x performance gain in Office's missing determiner model and a 3x performance gain in the optical character recognition model used in cognitive services.

  • 00:40:00 In this section, the speakers discuss the ONNX runtime API, which is in preview mode and allows for running ONNX runtime on Android devices. They also mention the training support, which is currently exploratory and aims to see if ONNX runtime can be used to tune already-created models. The speakers then give a demonstration of using ONNX runtime on a YOLOv3 object detection model, showing that ONNX runtime is versatile, efficient, and useful for cases that require good performance or need to support a model across different frameworks.

  • 00:45:00 In this section of the video, the presenters demonstrate the ONNX Runtime by identifying images and their respective classes with a large and complicated model. They also showcase a quick demo on the Onnx Ecosystem Converter, allowing users to upload and convert models from different frameworks in Jupyter Notebook. They convert a document classification model from CoreML, Apple's machine learning framework, to ONNX, and validate its accuracy. They note that it is a one-time cost to convert a model to ONNX, and it is an efficient process.

  • 00:50:00 In this section, the speakers summarize what they have covered in the video, including the benefits of using ONNX and ONNX Runtime, the various ways to convert from different frameworks into ONNX, and the increasing adoption of ONNX across their 26 companies. They thank their audience for listening and express their excitement to continue with the Q&A session.
ONNX and ONNX Runtime with Microsoft's Vinitra Swamy and Pranav Sharma_1
ONNX and ONNX Runtime with Microsoft's Vinitra Swamy and Pranav Sharma_1
  • 2020.07.30
  • www.youtube.com
Microsoft hosted an AI enthusiast’s meetup group in San Francisco in November 2019 focused on accelerating and optimizing machine learning models with ONNX a...
 

Jan-Benedikt Jagusch Christian Bourjau: Making Machine Learning Applications Fast and Simple with ONNX



Jan-Benedikt Jagusch Christian Bourjau: Making Machine Learning Applications Fast and Simple with ONNX

In this video about machine learning and deployment, the speakers discuss the challenges of putting models into production, particularly the difficulty of pickling and deploying models. They introduce ONNX, a universal file format for exporting machine learning models, and explain how it can help decouple training and inference, making deployment faster and more efficient. They provide a live demo using scikit-learn, explaining how to convert a machine learning pipeline to ONNX format. They also discuss the limitations of Docker containers for deploying machine learning models and highlight the benefits of using ONNX instead. They touch on the topic of encrypting models for additional security and address the usability issue of ONNX, which is still a young ecosystem with some cryptic error messages.

  • 00:00:00 In this section of the video, the presenters discuss the importance of decoupling model training from inference using ONNX. The presenters note that 55% of companies who have started with machine learning have failed to put their models into production, and argue that automating business processes by putting models into production is where the majority of value lies. However, they also note that deploying models is more complicated than it may initially seem, which is why they will be discussing how ONNX can help overcome this challenge. They also walk through the process of how a machine learning project typically starts, develops, and then collides with deployment requirements.

  • 00:05:00 In this section, the speakers discuss the challenges of putting machine learning models into production, specifically focusing on the difficulties of pickling and deploying the model. They explore the issues that arise in using an unfinished package, such as a pickle, to transfer the model, and how the correct environments and dependencies must be installed to successfully load the model in production. They also address the problem of models being too slow for use in production, leading to changes and optimizations to the model. Finally, they discuss the need for a universal file format to export the model, making it easy to use any runtime for deployment.

  • 00:10:00 In this section, the speakers discuss the concept of decoupling training from prediction time using tools for training to export a machine learning model to a universal file format, such as ONNX , in order to free up the tools used for deployment. They explain that ONNX is "the standardized way to describe your entire model, including your feature engineering and to store it into a binary format." They also note that ONNX is a good option for those with different types of machine learning models, not just neural networks. However, they emphasize that to use ONNX , a machine learning model must be described as a computational graph with nodes that are operators and edges that are data flowing through the graph, and that ONNX is strongly typed with type information and shape.

  • 00:15:00 In this section, the speakers discuss the specifics of ONNX , which defines a set of operators that must be used to ensure compatibility with the format. At the time of this talk, there were 175 operators, including more complicated ones like linear regressors and true ensemble regressors. ONNX also specifies the data needed to store each operator, making the entire file self-contained with no other dependencies needed. The speakers stress that anything representable in a directed acyclic graph can be converted to ONNX , not just machine learning models. Additionally, the entire pipeline can be converted to ONNX , as long as each step can be represented as its own directed acyclic graph.

  • 00:20:00 In this section, the speakers demonstrate how to create a simple imputer and regressor using numpy operations, which can easily be defined as a graph of onyx operators. By replacing every node in the scikit-learn graph with a graph itself, scikit-learn can be converted to an onix format. While established machine learning frameworks such as PyTorch, TensorFlow, Light GBM, and XGBoost already have converters available, custom converters must be written for custom estimators and transformers. However, the learning curve is steep but feasible, and it is crucial that the custom code fits into a deck. The speakers also provide a live demo using training data and a pipeline from scikit-learn, which is then converted to an onix format.

  • 00:25:00 In this section, Jan-Benedikt Jagusch and Christian Bourjau explain that ONNX is strongly typed and requires initial information from the data that is being provided. To simplify this process, they convert the types easily from a pandas data frame by mapping pandas data types to ONNX data types. The ONNX model is then fully self-contained, extracted from the prediction logic of the pipeline. The data engineering team only needs to dump this into a file and use the ONNX runtime to analyze the data, which is the only dependency regardless of the data model being serialized from TensorFlow, Python, or elsewhere. The ONNX runtime provides Python bindings that improve prediction speed up to a factor of ten. Single-row prediction speed is also a priority since it is essential in online environments, taking only 170 milliseconds, which is similar to Scikit-learn.

  • 00:30:00 In this section, the speakers discuss the benefits of using onyx to decouple the training environment from the deployment environment. They explain that by exporting models to the universal file format of ONNX , users can interpret their models using a runtime that provides the necessary performance characteristics for real-world deployment. The speakers also address a question about using Docker containers, highlighting the limitations in terms of scalability and flexibility. They recommend looking into onyx for its ability to provide both performance and flexibility, with the potential to archive models in addition to improving deployment.

  • 00:35:00 In this section, the speakers discuss the limitations of using Docker for deploying machine learning models, and highlight the benefits of serializing models to ONNX instead. While Docker may work for providing a REST API and in certain cases, the artifact produced includes many layers, making it difficult to load the mathematical formulation of the model. On the other hand, serializing the model to ONNX provides a pure essence of the model that is human-readable and easy to load. The speakers caution that while ONNX has many benefits, it is not a perfect solution for all use cases and requires some overhead to convert custom estimators and transformers. Additionally, the ecosystem is still relatively new, and users may need to spend time fixing issues or reading through GitHub issues. Finally, the speakers briefly mention the possibility of deploying ONNX models on GPUs, which is technically possible with the default ONNX runtime.

  • 00:40:00 In this section, the speakers discuss the possibility of encrypting ONNX models to protect against unintended use or reverse engineering. They mention that while it is possible to read the coefficients out of the model, if it is complex, it becomes difficult as ONNX does not preserve the operator and pipeline information. ONNX provides security by obfuscation to some extent, but it is not encrypted. However, they mention that it is possible to compile the file down to machine code for obfuscation and further security. The speakers also address the issue of putting preprocessing steps that have I/O against the database, which would require all the data to be in the database for the instantiator inside of the ONNX graph. Lastly, they discuss the usability issue of Onyx, as the error messages can be cryptic, but they are optimistic that the ecosystem will improve, given its young age and corporate backing.
Jan-Benedikt Jagusch Christian Bourjau: Making Machine Learning Applications Fast and Simple with...
Jan-Benedikt Jagusch Christian Bourjau: Making Machine Learning Applications Fast and Simple with...
  • 2022.05.12
  • www.youtube.com
Speaker:: Jan-Benedikt Jagusch Christian BourjauTrack: General: ProductionTaking trained machine learning models from inside a Jupyter notebook and deploying...
 

ONNX Runtime Azure EP for Hybrid Inferencing on Edge and Cloud



ONNX Runtime Azure EP for Hybrid Inferencing on Edge and Cloud

The ONNX Runtime team has released their first step into the hybrid world of enabling developers to use a single API for both edge and cloud computing with the Azure EP, which eliminates device connectivity concerns and allows developers to switch to the cloud model they've optimized, saving costs and reducing latency. This new feature allows developers to update application logic and choose which path to take via the Azure EP, offering more capability and power. The team demonstrates the deployment of children's servers and object detection models, as well as how to test the endpoint and configure Onnx Runtime Azure simply. The presenters also discuss the ability to switch between local and remote processing and potential use cases, including lower- vs. higher-performing models. The ONNX Runtime Azure EP can be pre-loaded and configured easily with necessary packages for deployment, contributing to the ease of usage of the software.

  • 00:00:00 In this section, the Azure EP is introduced as the ONNX runtime team's first step into the hybrid world of enabling developers to use a single API for both edge and cloud computing. By doing so, developers will not have to worry about device connectivity and can switch to the cloud model that they've optimized and are using there, saving cost and latency. This new feature allows developers to update application logic and choose which path to take via the Azure EP, giving them more capability and power. Overall, the ONNX runtime team is excited to see what comes from the developer community and how this new feature is implemented.

  • 00:05:00 In this section, Randy Schrey, a contributor to the new ONNX Runtime (ORT) release 1.14, demonstrates some of the cool features that come with the release. First, he shows an endpoint called on Azure Machine Learning, which serves as a server-side for the models. He also discusses the Triton server that is utilized to provide endpoints, divided by Nvidia, and its impressive performance and stability. Schrey shows how to deploy a children's server and gives an overview of how it looks like, including specifying the model's name, version, and location. He also highlights the folder structure that must be followed when deploying a Triton server and shows the configuration file that describes how the model gets the input and the output.

  • 00:10:00 In this section, the speaker discusses the structure of their folder for deploying object detection models and explains how the Triton server can find the model for deployment. They also answer a question about consuming models served on Azure and mention the current limitations of Trtis over Azure, stating that it only supports Triton server as its server side. The speaker then discusses testing the endpoint, the process for installing Onnx Runtime Azure simply and how they can use Onnx Runtime Azure for Hybrid Inferencing on Edge and Cloud to work with the online endpoint from the client side. The speaker provides a script and explains some of the configurations required to load and consume a model using Onnx Runtime Azure.

  • 00:15:00 In this section, the presenters demonstrate how to use the ONNX Runtime Azure EP for hybrid inferencing on edge and cloud. They show how to configure the authentication key and run the inference, with the ability to switch between local and remote processing by changing a single parameter in the code. They discuss potential use cases, such as choosing between lower-performing and higher-performing models, and note that while the current preview release requires the Triton inference server, the plan is to support all types of deployment servers in the future.

  • 00:20:00 In this section, it is explained that the ONNX Runtime Azure EP can be pre-loaded and easily configured with the necessary packages for deployment. This feature contributes to the ease of deployment and usage of the software.
ONNX Runtime Azure EP for Hybrid Inferencing on Edge and Cloud
ONNX Runtime Azure EP for Hybrid Inferencing on Edge and Cloud
  • 2023.02.14
  • www.youtube.com
0:00 What is Azure EP?5:00 How to Setup a Triton Inference Server Managed Endpoint in Azure12:45 Installing the ONNX Runtime Azure EP Package13:35 Using the ...