An ultra-low-power embedded convolutional neural network for acoustic monitoring of forest ecosystems
Files
Delsart_41661600_2021.pdf
Open access - Adobe PDF
- 18.29 MB
Details
- Supervisors
- Faculty
- Degree label
- Abstract
- For a few years now, artificial intelligence has brought significant improvements in the field of audio classification. However, complex audio recognition tasks often require powerful machine learning models that consume a substantial amount of energy to be trained and operated. One of the most important challenges these promising techniques have to solve is to be integrated inside embedded devices such as microcontrollers and Internet-of-Things edge devices. To this purpose, TinyML is a fast-emerging area of Deep Learning performing inference at extremely low power, in the mW range and below, thus enabling the deployment of millions of environmentally-friendly smart connected devices. One of the major applications is keyword spotting that allows always-on speech-based user interactions. Moreover, due to global warming, forest ecosystems audio monitoring has become more popular because it helps to fight deforestation, which is an essential factor in climate change. Furthermore, audio monitoring in forests also allows fighting animal poaching and positively impacts wildlife conservation by monitoring bird and mammal populations. The focus of this Master thesis is to obtain an embedded ultra-low-power convolutional neural network (CNN) able to classify sounds from a forest environment among the animals, birds, chainsaws, and gunshots classes. The task is to accurately classify samples from a real-world crafted dataset on the Apollo3 Blue microcontroller and to obtain a power consumption below the mW range. The key question of this work is to know if it is possible to get an ultra-low-power high-accuracy convolutional neural network allowing to classify forest sounds. In that way, the proposed dataset, partly based on professional datasets from Rainforest Connection and the Borror Laboratory of Bioacoustics, is processed by changing the sample rate to 16 kHz and segmenting the recordings into 1-second audio slices to create the used dataset. An accuracy of 99.64% is obtained on this crafted dataset based on the MobileNetV1 architecture and by using the Mel-frequency cepstral coefficients optimized feature set as well as known audio augmentation techniques. Due to the power-consuming nature of this model, a shallow network based on depthwise separable convolutions and dense operations is designed. It reaches an accuracy of 93.52% and performs 273× fewer MACC operations than the MobileNetV1 baseline model. A real-time implementation of the optimized model reaching an accuracy of 91.37% is then achieved with an 8-bit fixed-point on the Apollo3 Blue microcontroller, using the ARM CMSIS-NN library. Due to the use of 4-lanes parallel SIMD instructions and cache pre-fetching on the Flash accesses, the design demonstrates an inference time of 4.46 ms. By enabling the deep sleep mode of the microcontroller after the inference, the overall system consumes as little as 13.21 µW on average by performing one inference per second. It would therefore be possible to deploy thousands of these highly energy-efficient embedded CNNs in forests to detect deforestation and animal poaching as well as contributing to wildlife conservation.