TechCabal Ewè Audio Translation Challenge

6 min readOct 2, 2024

TechCabal Ewè Audio Translation Challenge

“Can you build a model to recognise Ewè commands from audio samples?” This is the question posed in this TechCabal challenge. TechCabal is “the most important tech publication documenting the human and business impact of technology, and innovation in Africa”[1]. This year, TechCabal’s conference, Moonshot by TechCabal, gathers “audacious innovators, investors, operators, and stakeholders to network, collaborate, share insights, and celebrate the continent’s innovation”(ibid). This year’s edition themed “Build for the World” will run from October 9th-10th, 2024 at Eko Convention Centre Lagos, Nigeria. Get the details about Moonshot here.

The challenge of recognizing Ewè commands from audio samples is a multi-class classification. So, there are eight classes, eight basic directional commands in Ewè — “up,” “down,” “stop,” “go,” “left,” “right,” “yes,” and “no” — and the objective of the challenge is “to create machine learning models that can accurately classify audio recordings of these Ewè commands”.

I tried out the challenge. In the dataset, there are 5334 audio files in the train and 2946 audio files in the test data. These are audio files of Ewè speakers speaking in the context of crossing roads, rain and thunder, rural and forest, and firefighter alarm background noises, which is a challenge, as you will see later on because these noises will hinder getting the correct directional command signal from the audio sample file.

I passed the train, test, and sample submission CSV files into Pandas data frame. Both the train and test files have the column, audio_filepath, which has the file names of the audio files and can be used to reference them.

The idea is to get the audio train files, split it into train and validation datasets, extract the audio features, and train the model, (in this case CNN model) based on the features.

Since it is a multi-class classification, with eight classes, you want to have a fairly equal representation of all the classes in each of the train and validation datasets. So, to achieve this, I used the StratifiedShuffleSplit from the Scikit-Learn module which is used for splitting datasets in a way that maintains the same proportion of each class label in the splits. This is very useful for classification problems where preserving the class distribution in both training and validation (or testing) sets is important.

sss = StratifiedShuffleSplit(n_splits=1, test_size=0.17, random_state=58)

for train_index, val_index in sss.split(train_data.drop('class', axis=1), train_data['class']):
    train_df, val_df = train_data.iloc[train_index], train_data.iloc[val_index]

I reset the indices of the shuffled train and validation datasets and encoded the target class into numbers since the data type is a string.

Remember that the directional command sounds in the files are sounds in the context of background noises. It had to be cleaned so that we could get the true signal from the noise.

I used a bandpass filter to remove irrelevant frequency bands, to “minimize disjointedness at the start and finish of each frame”[2]; and then two feature extraction algorithms to extract features from the audio files, which will be fed to PyTorch’s Convolutional Neural Network (CNN) model for training and validation. The two feature extraction algorithms I used were Mel Frequency Cepstral Coefficients (MFCC) and Mel Spectrogram, from the paper I read, suggested by Vangelis Oden: “Some Commonly Used Speech Feature Extraction Algorithms” by Sabur Ajibola Alim, and Nahrul Khair Alang Rashid. “Feature extraction is accomplished by changing the speech waveform to a form of parametric representation at a relatively minimized fata rate for subsequent processing and analysis” [2]. So, the feature extraction yields a multidimensional feature vector for every speech signal in the audio file.

# Bandpass filter (removes irrelevant frequency bands)
def butter_bandpass(lowcut, highcut, fs, order=5):
    nyquist = 0.5 * fs
    low = lowcut / nyquist
    high = highcut / nyquist
    b, a = butter(order, [low, high], btype='band')
    return b, a

def bandpass_filter(data, lowcut, highcut, fs, order=5):
    b, a = butter_bandpass(lowcut, highcut, fs, order=order)
    y = lfilter(b, a, data)
    return y

def extract_mfcc(file_path, n_mfcc=40, max_len=100):
    """Extract MFCC features from an audio file."""
    audio, sample_rate = librosa.load(file_path, res_type='kaiser_fast')
    mfccs = librosa.feature.mfcc(y=audio, sr=sample_rate, n_mfcc=n_mfcc)

# Extract features (time-domain and frequency-domain)
def extract_features(file_path, n_mels=40, max_length=100):  # audio_file => file_path
    # Load the audio file
    audio, sr = librosa.load(file_path)
    
    # Apply bandpass filter (e.g., remove frequencies outside 20-4000 Hz)
    filtered_audio = bandpass_filter(audio, lowcut=20, highcut=4000, fs=sr)
    
    # Time-domain features
    rms = librosa.feature.rms(y=filtered_audio)  # Root Mean Square energy
    zcr = librosa.feature.zero_crossing_rate(filtered_audio)  # Zero-crossing rate
    
    # Frequency-domain features
    spectral_centroid = librosa.feature.spectral_centroid(y=filtered_audio, sr=sr)
    spectral_bandwidth = librosa.feature.spectral_bandwidth(y=filtered_audio, sr=sr)
    spectral_flux = np.diff(librosa.feature.spectral_bandwidth(y=filtered_audio, sr=sr), axis=1)
    
    # Time-frequency representation: Mel-spectrogram
    mel_spectrogram = librosa.feature.melspectrogram(y=filtered_audio, sr=sr, n_mels=n_mels)
    
    # Convert to decibels for better representation
    mel_spectrogram_db = librosa.power_to_db(mel_spectrogram, ref=np.max)

    # mfcc
    mfcc = librosa.feature.mfcc(y=filtered_audio, sr=sr, n_mfcc=n_mels)
   
    # Ensure all features have the same length (pad/truncate)
    def pad_or_truncate(feature, target_length=max_length):
        if feature.shape[1] < target_length:
            padding = target_length - feature.shape[1]
            feature = np.pad(feature, ((0, 0), (0, padding)), 'constant')
        else:
            feature = feature[:, :target_length]
        return feature

    # rms = pad_or_truncate(rms)
    # zcr = pad_or_truncate(zcr)
    # spectral_centroid = pad_or_truncate(spectral_centroid)
    # spectral_bandwidth = pad_or_truncate(spectral_bandwidth)
    # spectral_flux = pad_or_truncate(spectral_flux)
    mel_spectrogram_db = pad_or_truncate(mel_spectrogram_db)
    mfcc = pad_or_truncate(mfcc)

    # print(f'rms: {rms.shape}, zcr: {zcr.shape}, spectral_centroid: {spectral_centroid.shape}')
    # print(f'spectral_bandwidth: {spectral_bandwidth.shape}, spectral_flux: {spectral_flux.shape}')
    # print(f'mel_spectrogram_db: {mel_spectrogram_db.shape}')
    # print(f'mfcc {mfcc.shape}')

    # features = np.vstack([mel_spectrogram_db, mfcc])
    features =  (mel_spectrogram_db + mfcc) / 2
    # print(f'features: {features.shape}')

    return features

After the feature extraction, there may be varying lengths of the extracted feature vectors, so I passed all of them into a function to pad and truncate them into a consistent length, ensuring that all of them have the same length.

Then, I build the Audio Dataset Class for processing the datasets, and the Speech Classifier Class for training, validating the models, and making predictions on the test dataset.

On ten epochs, the losses were minimal:

Epoch [1/10], Loss: 1.6916
Epoch [2/10], Loss: 0.1074
Epoch [3/10], Loss: 0.0401
Epoch [4/10], Loss: 0.0241
Epoch [5/10], Loss: 0.0278
Epoch [6/10], Loss: 0.0235
Epoch [7/10], Loss: 0.0073
Epoch [8/10], Loss: 0.0089
Epoch [9/10], Loss: 0.0111
Epoch [10/10], Loss: 0.0132

The Accuracy is 99.77% and the Mean Squared Error is 0.03. Not bad for a CNN model prediction.

I checked for the misclassified datasets and it only misclassified 2 out of 907 data rows in the validation dataset.

new_df = pd.DataFrame(data={'predictions':predictions, 'labels': val_df['labels']})
new_df['accuracy'] = new_df['predictions'] == new_df['labels']
new_df['accuracy'] = new_df['accuracy'].apply(lambda x: int(x))

new_df['accuracy'].sum()/len(new_df)

print(f'Accuracy Score: {accuracy_score(new_df.predictions, new_df.labels)}')
print(f'Mean Squared Error: {mean_squared_error(new_df.predictions, new_df.labels)}')

Precision: 0.9977678571428572

Recall: 0.9977477477477478

F1-score: 0.9977477020473591

The overall metrics are generally good. So, I used the model, and trained another model with a slight change in the CNN architecture, combined the two predictions, and submitted the result to the Zindi platform for the competition. I was among the first ten in the leaderboard.

It was a very good and instructive challenge. To recap, to classify the audio files into different classes, we:

Used a bandpass filter to remove irrelevant frequency bands.
Processed the audio files and extracted features using Mel-Frequency Cepstral Coefficients and Mel-Spectrograms.
Built a CNN model trained with the extracted features.
Used the model to classify the audio files into eight different classes, as was requested by the challenge.

The code can be found in GitHub.

References:

[1] TechCabal Ewè Audio Translation Challenge https://zindi.africa/competitions/techcabal-ewe-audio-translation-challenge

[2] Some Commonly Used Speech Feature Extraction Algorithms, http://dx.doi.org/10.5772/intechopen.80419

TechCabal Ewè Audio Translation Challenge

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Ubajaka CJ

No responses yet