YOLO11 推理速度大對決：OpenVINO vs. PyTorch vs. ONNX Runtime

大貓咪收錄於 Python

2025-05-31 2025-05-31 約 1835 字預計閱讀 9 分鐘次閱讀 CC BY-NC 4.0

OpenVINO 真的能提升 YOLO11 的推理效能嗎？我們測試 Ultralytics、OpenVINO、ONNX Runtime 在 CPU 和 iGPU 上的表現，結果顯示 OpenVINO 在 iGPU 上推理速度超越 PyTorch 和 ONNX Runtime！這篇文章將帶你深入分析各框架的推理速度。

剛安裝完 OpenVINO，但它到底有沒有比較快？

我們一起來看看~

廣告 AD

Preface

前陣子把 OpenVINO 裝好，順手玩了一下，但不知道效能怎麼樣，有沒有比較快，這次來看看他的速度怎麼樣吧 ~

我們會測試看看不同框架下的 YOLO Object Detection 的速度，一律都採用 YOLO11n，以下是測試的框架：

Ultralytics
OpenVINO
ONNXRuntime

手上的測試裝備：

CPU: Intel Core Ultra 7 155H
GPU: CPU 內建的 GPU

Ultralytics

Ultralytics 是一個 YOLO 的開源框架，提供了一系列的 Object Detection 和 Image Segmentation 的模型，能輕鬆地透過 Python 來使用。

pip install ultralytics

Inference

from ultralytics import YOLO

model = YOLO("yolo11n.pt")
image_path = "bus.jpg"

# average the inference time
count = 100
inference_time_list = []
for _ in range(count):
    results = model(image_path, imgsz=(640, 640), device="CPU", rect=False)
    inference_time_list.append(results[0].speed["inference"])

print(f"Total: {sum(inference_time_list) / 1000} seconds")
print(f"Average: {sum(inference_time_list) / 1000 / count} seconds")
print(f"{1 / sum(inference_time_list) * count * 1000} FPS")

Ultralytics 底層是使用 PyTorch 來運算的，使用 CPU 下 FPS 為 11 左右。

Device	Time (ms)	FPS
CPU	89.43509	11.18

OpenVINO

我們都採用 Ultralytics pre-train 好的 YOLO11n 模型，首先我們先透過 Ultralytics 將格式輸出成 OpenVINO 的格式，並且設定包含 NMS。

yolo export model=yolo11n.pt format=openvino dynamic=False nms=True

Preprocessing

由於 YOLO 需要將圖片轉成 640 x 640 的格式，我們需要對圖片做前處理，將圖片縮放到最長邊為 640，並填充 padding 上去。

from PIL import Image, ImageDraw, ImageOps
import numpy as np

def letterbox(
    img: Image,
    new_shape: tuple[int, int] | int = (640, 640),
    color: tuple[int, int, int] = (114, 114, 114),
):
    shape = img.size  # W, H
    if isinstance(new_shape, int):
        new_shape = (new_shape, new_shape)  # W, H

    # scale ratio (new / old)
    r = min(new_shape[0] / shape[0], new_shape[1] / shape[1])
    # only scale down, do not scale up (for better test mAP)
    r = min(r, 1.0)

    # compute padding
    new_unpad = int(round(shape[0] * r)), int(round(shape[1] * r))
    dw, dh = (
        new_shape[0] - new_unpad[0],
        new_shape[1] - new_unpad[1],
    )
    left = int(round(dw / 2))
    right = dw - left
    top = int(round(dh / 2))
    bottom = dh - top

    # resize
    img = img.resize(new_unpad)

    # border
    image_with_border = ImageOps.expand(
        img, border=(left, top, right, bottom), fill=color
    )
    return image_with_border

def preprocess_image(img: Image):
    # Normalization
    img = np.array(img).astype(np.float32)
    img = img / 255

    # Convert HWC to CHW
    img = img.transpose(2, 0, 1)
    img = np.ascontiguousarray(img)

    # add batch dim
    img = np.expand_dims(img, axis=0)
    return img

Inference

我們指定 OpenVINO 讀取模型的 xml 檔案，並且設定 inference 所使用的 device，為了取得更好的結果，我們 inference 了 100 次，並取了平均值當作結果。

from PIL import Image
import openvino as ov
import time

def openvino(img_input: np.ndarray, device_name: str):
    model_path = "model/yolo11n_openvino_model/yolo11n.xml"

    core = ov.Core()
    model = core.read_model(model_path)
    compiled_model = core.compile_model(model, device_name=device_name)

    # average the inference time
    count = 100
    start = time.time()
    for _ in range(count):
        outputs = compiled_model(img_input)
    end = time.time()

    print(f"Device Name: {device_name}")
    print(f"Total: {end - start} seconds")
    print(f"Average: {(end - start) / count} seconds")
    print(f"{1 / (end - start) * count} FPS")
    result = outputs[compiled_model.output(0)][0]

    return result

if __name__ == "__main__":

    image_path = "bus.jpg"

    img = Image.open(image_path)
    img_border = letterbox(img)
    img_input = preprocess_image(img_border)

    device_name_list = ["CPU", "GPU", "HETERO:GPU,CPU"]
    for device_name in device_name_list:
        result = openvino(img_input, device_name)
        draw_result(img_border, result)

結果顯示，使用 iGPU 的會比使用 CPU 的還要快，但使用 iGPU 和 Hybrid 的相比就差不多了，在測試的過程中兩者其實相差不大，兩個的結果都還蠻快的，因此建議直接使用 iGPU 即可。

Device	Time (ms)	FPS
CPU	26.86	37.22
iGPU	8.98	111.34
Hybrid (CPU + iGPU)	8.53	117.20

ONNX Runtime

ONNX Runtime (ORT) 是一個跨平台的推理框架，使用特定的 ONNX 模型格式，主要是提供統一的優化推理環境，能在不同的硬體上運行高效率的推理。使用上直接透過 pip 安裝即可。

pip install onnxruntime

另外我們需要將模型轉成 ONNX 格式，一樣包含 NMS 的部分，轉換完成後會是一個 ONNX 模型檔案。

yolo export model=yolo11n.pt format=onnx dynamic=False nms=True

Preprocessing

前處理和 Openvino 的一樣 Openvino - Preprocessing。

Inference

我們指定剛剛轉換好的 ONNX 模型檔案，然後使用 CPU 的 Execution Provider，為了取得更好的結果，一樣取平均。

from PIL import Image
import onnxruntime as ort
import time

def onnxruntime(img_input: np.ndarray):
    model_path = "model/onnx/yolo11n.onnx"

    session = ort.InferenceSession(
        model_path,
        providers=["CPUExecutionProvider"],
        provider_options=[{}],
    )

    count = 100
    start = time.time()
    for _ in range(count):
        outputs = session.run(None, {"images": img_input})
    end = time.time()

    print(f"Total: {end - start} seconds")
    print(f"Average: {(end - start) / count} seconds")
    print(f"{1 / (end - start) * count} FPS")
    result = outputs[0][0]

    return result

if __name__ == "__main__":

    image_path = "bus.jpg"

    img = Image.open(image_path)
    img_border = letterbox(img)
    img_input = preprocess_image(img_border)

    result = onnxruntime(img_input)
    draw_result(img_border, result)

與 OpenVINO 的 CPU 相比，使用 ONNX Runtime 的速度略為下降，可見 OpenVINO 還是有優勢的。

Device	Time (ms)	FPS
CPU	32.06	31.18

ONNX Runtime + OpenvVINO

由於 ONNX Runtime 可以選擇用 OpenVINO，因此我就也順便來看看和直接用 OpenVINO 相比有沒有差別。

pip install onnxruntime-openvino

import onnxruntime as ort

session = ort.InferenceSession(
    model_path,
    providers=["OpenVINOExecutionProvider"],
    provider_options=[{"device_type": "CPU"}],
)

Preprocessing

前處理還是和 Openvino 的一樣 ~ Openvino - Preprocessing。

Inference

Provider 的部分設定為 OpenVINOExecutionProvider，然後在 provider_options 的部分設定所需要的 Device。

from PIL import Image
import onnxruntime as ort
import time

def onnxruntime(img_input: np.ndarray, device_name: str):
    model_path = "model/onnx/yolo11n.onnx"

    session = ort.InferenceSession(
        model_path,
        providers=["OpenVINOExecutionProvider"],
        provider_options=[{"device_type": device_name}],
    )

    count = 100
    start = time.time()
    for _ in range(count):
        outputs = session.run(None, {"images": img_input})
    end = time.time()

    print(f"Total: {end - start} seconds")
    print(f"Average: {(end - start) / count} seconds")
    print(f"{1 / (end - start) * count} FPS")
    result = outputs[0][0]

    return result

if __name__ == "__main__":

    image_path = "bus.jpg"

    img = Image.open(image_path)
    img_border = letterbox(img)
    img_input = preprocess_image(img_border)

    device_name_list = ["CPU", "GPU", "HETERO:GPU,CPU"]
    for device_name in device_name_list:
        result = onnxruntime(img_input, device_name)
        draw_result(img_border, result)

我們和 OpenVINO 一樣，跑了三種的 device，但整體上都比 OpenVINO 還要來的低，且執行時會出現警告，某些 node 沒辦法在指定的裝置上執行。

Device	Time (ms)	FPS
CPU	45.306	22.07
iGPU	10.097839	99.03
Hybrid (CPU + iGPU)	9.37036	106.71

2025-05-31 15:54:23.4379916 [W:onnxruntime:, session_state.cc:1263 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2025-05-31 15:54:23.4436529 [W:onnxruntime:, session_state.cc:1265 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.

Conclusion

在 YOLO11 的測試下，OpenVINO 相較 PyTorch 能更好的利用硬體的特性來加速推理，且推薦優先使用 iGPU 來幫助推理，對速度的提升很大，如果沒有 iGPU 再使用 CPU。

Framework	Device	Time (ms)	FPS
PyTorch	CPU	89.43509	11.18
OpenVINO	CPU	26.86	37.22
OpenVINO	iGPU	8.98	111.34
OpenVINO	Hybrid (CPU + iGPU)	8.53	117.20
ONNX Runtime	CPU	32.06	31.18
ONNX Runtime + OpenVINO	CPU	45.306	22.07
ONNX Runtime + OpenVINO	iGPU	10.097839	99.03
ONNX Runtime + OpenVINO	Hybrid (CPU + iGPU)	9.37036	106.71

NPU

為什麼都沒有用到 NPU，因為用 NPU 跑 YOLO11n 會出現錯誤，可能有用到動態的維度，但我已經設定 dynamic=False，暫時還找不到方法解決這個問題。

[ERROR] 17:21:10.854 [vpux-compiler] Got Diagnostic at loc(fused<{name = "aten::add/Add_2", type = "Add"}>["aten::add/Add_2"]) : Got non broadcastable dimensions pair : '0' and -9223372036854775808'
loc(fused<{name = "aten::add/Add_2", type = "Add"}>["aten::add/Add_2"]): error: Got non broadcastable dimensions pair : '0' and -9223372036854775808'
LLVM ERROR: Failed to infer result type(s).

Reference

廣告 AD

目錄

目錄

YOLO11 推理速度大對決：OpenVINO vs. PyTorch vs. ONNX Runtime

Preface

Ultralytics

Inference

OpenVINO

Preprocessing

Inference

ONNX Runtime

Preprocessing

Inference

ONNX Runtime + OpenvVINO

Preprocessing

Inference

Conclusion

Reference

目錄

YOLO11 推理速度大對決：OpenVINO vs. PyTorch vs. ONNX Runtime

Preface

Ultralytics

Inference

OpenVINO

Preprocessing

Inference

ONNX Runtime

Preprocessing

Inference

ONNX Runtime + OpenvVINO

Preprocessing

Inference

Conclusion

Reference

IPEX-LLM：如何使用 Intel NPU 推理大型語言模型？

OpenVINO 安裝與使用 — 加速 AI 推理！

Python Debug：用 GDB 來 Debug Python！

WAT：檢視 Python 物件，一款強大的 Debug 工具

py-spy: 找出 Python 程式性能瓶頸