Build the speech model
Now that we've created spectrogram images, it's time to build the computer vision model.
If you're following along with the different modules in this PyTorch
learning path, then you should have a good understanding of how
to create a computer vision model (in particular, see the "Introduction to
Computer Vision with PyTorch" Learn module). You'll be using the torchvision
package to build your vision model. The convolutional neural network (CNN)
layer (conv2d) will be used to extract the unique features from the
spectrogram image for each speech command.
Let's import the packages we need to build the model.