whisper.cppOpenAI Whisper 模型的 C/C++ 移植

联合创作 · 2023-09-26 06:41

whisper.cpp 是 OpenAI 的 Whisper 自动语音识别 (ASR) 模型的  C/C++ 移植


特性



  • 没有依赖项的普通 C/C++ 实现

  • Apple silicon 一等公民 - 通过 Arm Neon 和 Accelerate 框架优化

  • AVX 内在函数支持 x86 架构

  • VSX 内在函数支持 POWER 架构

  • 混合 F16 / F32 精度

  • 内存使用率低(Flash Attention)

  • 运行时零内存分配

  • 在 CPU 上运行

  • C 风格的 API


支持的平台:



  •  Mac OS (Intel and Arm)

  •  iOS

  •  Android

  •  Linux / FreeBSD

  •  WebAssembly

  •  Windows (MSVC and MinGW]

  •  Raspberry Pi


模型的整个实现包含在2个源文件中:



这种轻量级的模型实现允许容易地将 OpenAI 的 Whisper 模型集成到不同的平台和应用程序中。


实现细节



  • 核心张量运算在 C 中实现 (ggml.h / ggml.c)

  • 转换器模型和高级 C 风格的 API 是用 C++ 实现的  (whisper.h / whisper.cpp)

  • main.cpp 中演示了示例用法

  • stream.cpp 中演示了麦克风的实时音频转录示例

  •  examples 文件夹中提供了各种其他示例


张量运算符针对 Apple 芯片的 CPU 进行了大量优化。根据计算大小,使用 Arm Neon SIMD instrisics 或 CBLAS Accelerate 框架例程。后者对于更大的尺寸特别有效,因为 Accelerate 框架利用现代 Apple 产品中提供的专用 AMX 协处理器。


Quick start 快速开始


首先,下载一个转换为 ggml 格式的 Whisper 模型。例如:




bash ./models/download-ggml-model.sh base.en


构建主要示例并转录一个音频文件,如下所示:




# build the main example
make

# transcribe an audio file
./main -f samples/jfk.wav


要快速演示,只需运行 make base.en 




$ make base.en

cc -I. -O3 -std=c11 -pthread -DGGML_USE_ACCELERATE -c ggml.c -o ggml.o
c++ -I. -I./examples -O3 -std=c++11 -pthread -c whisper.cpp -o whisper.o
c++ -I. -I./examples -O3 -std=c++11 -pthread examples/main/main.cpp whisper.o ggml.o -o main -framework Accelerate
./main -h

usage: ./main [options] file0.wav file1.wav ...

options:
-h, --help [default] show this help message and exit
-t N, --threads N [4 ] number of threads to use during computation
-p N, --processors N [1 ] number of processors to use during computation
-ot N, --offset-t N [0 ] time offset in milliseconds
-on N, --offset-n N [0 ] segment index offset
-d N, --duration N [0 ] duration of audio to process in milliseconds
-mc N, --max-context N [-1 ] maximum number of text context tokens to store
-ml N, --max-len N [0 ] maximum segment length in characters
-bo N, --best-of N [5 ] number of best candidates to keep
-bs N, --beam-size N [-1 ] beam size for beam search
-wt N, --word-thold N [0.01 ] word timestamp probability threshold
-et N, --entropy-thold N [2.40 ] entropy threshold for decoder fail
-lpt N, --logprob-thold N [-1.00 ] log probability threshold for decoder fail
-su, --speed-up [false ] speed up audio by x2 (reduced accuracy)
-tr, --translate [false ] translate from source language to english
-di, --diarize [false ] stereo audio diarization
-nf, --no-fallback [false ] do not use temperature fallback while decoding
-otxt, --output-txt [false ] output result in a text file
-ovtt, --output-vtt [false ] output result in a vtt file
-osrt, --output-srt [false ] output result in a srt file
-owts, --output-words [false ] output script for generating karaoke video
-ocsv, --output-csv [false ] output result in a CSV file
-of FNAME, --output-file FNAME [ ] output file path (without file extension)
-ps, --print-special [false ] print special tokens
-pc, --print-colors [false ] print colors
-pp, --print-progress [false ] print progress
-nt, --no-timestamps [true ] do not print timestamps
-l LANG, --language LANG [en ] spoken language ('auto' for auto-detect)
--prompt PROMPT [ ] initial prompt
-m FNAME, --model FNAME [models/ggml-base.en.bin] model path
-f FNAME, --file FNAME [ ] input WAV file path


bash ./models/download-ggml-model.sh base.en
Downloading ggml model base.en ...
ggml-base.en.bin 100%[========================>] 141.11M 6.34MB/s in 24s
Done! Model 'base.en' saved in 'models/ggml-base.en.bin'
You can now use it like this:

$ ./main -m models/ggml-base.en.bin -f samples/jfk.wav


===============================================
Running base.en on all samples in ./samples ...
===============================================

----------------------------------------------
[+] Running base.en on samples/jfk.wav ... (run 'ffplay samples/jfk.wav' to listen)
----------------------------------------------

whisper_init_from_file: loading model from 'models/ggml-base.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab = 51864
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 512
whisper_model_load: n_text_head = 8
whisper_model_load: n_text_layer = 6
whisper_model_load: n_mels = 80
whisper_model_load: f16 = 1
whisper_model_load: type = 2
whisper_model_load: mem required = 215.00 MB (+ 6.00 MB per decoder)
whisper_model_load: kv self size = 5.25 MB
whisper_model_load: kv cross size = 17.58 MB
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx = 140.60 MB
whisper_model_load: model size = 140.54 MB

system_info: n_threads = 4 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:11.000] And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.


whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: load time = 113.81 ms
whisper_print_timings: mel time = 15.40 ms
whisper_print_timings: sample time = 11.58 ms / 27 runs ( 0.43 ms per run)
whisper_print_timings: encode time = 266.60 ms / 1 runs ( 266.60 ms per run)
whisper_print_timings: decode time = 66.11 ms / 27 runs ( 2.45 ms per run)
whisper_print_timings: total time = 476.31 ms


 该命令下载转换为自定义 ggml 格式的 base.en 模型,并对文件夹 samples 中的所有 .wav 样本运行推理。


有关详细的使用说明,请运行: ./main -h


请注意,主要示例当前仅使用 16 位 WAV 文件运行,因此请确保在运行该工具之前转换您的输入。例如,您可以像这样使用 ffmpeg 




ffmpeg -i input.mp3 -ar 16000 -ac 1 -c:a pcm_s16le output.wav


内存使用状况











































Model Disk Mem SHA
tiny 75 MB ~125 MB bd577a113a864445d4c299885e0cb97d4ba92b5f
base 142 MB ~210 MB 465707469ff3a37a2b9b8d8f89f2f99de7299dac
small 466 MB ~600 MB 55356645c2b361a969dfd0ef2c5a50d530afd8d5
medium 1.5 GB ~1.7 GB fd9727b6e1217c2f614f9b698455c4ffd82463b4
large 2.9 GB ~3.3 GB 0f4c8e34f21cf1a914c59d8b3ce882345ad349d6

 

浏览 49
点赞
评论
收藏
分享

手机扫一扫分享

编辑 分享
举报
评论
图片
表情
推荐
点赞
评论
收藏
分享

手机扫一扫分享

编辑 分享
举报