Cough analyzer

project.Telegram_Chatbot.modulos.analyze_cough.analyze_cough(ogg_path, data)
  • ogg_path – Absolute path where the original .ogg audio file has been downloaded from the Telegram API.

  • data – Metadata extracted from the chatbot user. Only is converted and used in a proper format if the audio contains cough.

Returns bool

True for covid positive, False for covid negative and None for unrecognized audio.

The function is in charge of calling all the defined functions to maintain a clean script and a clear workflow. First, the audio is converted to wav. Then, the long-term features (based on mid-term and, therefore, short-term) are extracted from the converted audio. In this point, the audio is analyze. Only if the audio contains a cough (accordingly to our cough recognition model), the metadata of the chatbot user is extracted and converted to a proper format. Additionally, we predict whether the audio has been recorded by a user that has COVID-19 or not, accordingly to our covid recognition model.

Note that if our first cough recognition model does not predict that the audio contains cough, the function returns None automatically (without computing the metadata or the covid prediction).

project.Telegram_Chatbot.modulos.analyze_cough.butter_lowpass(cutoff, fs, order=5)
  • cutoff – The cutoff frequency. Defines the boundary from which the frequencies will be attenuated.

  • fs – Sampling rate of the signal.

  • order – Order of the filter.

Returns b

Numerator polynomials of the IIR filter (filter coefficients).

Returns a

Denominator polynomials of the IIR filter (filter coefficients).

The function implements a low-pass filter of order 5 (using Butterworth digital filter). The purpose of the filter is to attenuate the frequencies whoose values are higher than the defined cutoff. The order 5 has been choosen because it provides a compromise between stability and sharpness of the transition between preserved and attenuated frequencies. Additionally, the cutoff frequency is expressed as the fraction of the Nyquist frequency, which at the same time is half the sampling rate of the signal. The filter is applied as a consequence that iOS devices record audios with greater spectral information (more high-frequency information) than the ones recorded by Android devices.

project.Telegram_Chatbot.modulos.analyze_cough.butter_lowpass_filter(data, cutoff, fs, order=5)
  • data – Original signal (numpy array).

  • cutoff – Cutoff frequency.

  • fs – Sampling rate of the signal.

  • order – Order of the Butterworth filter.

Returns y

Filtered signal.

The function extracts the 5-order low-pass filter coefficients and filters the original signal. In this way, the frequencies whose values are higher than the cutoff frequency are masked (attenuated).


filepath – absolute path where the original .ogg file is located.

Returns p

duration of the audio in seconds (it has decimals).

The function just quickly verifies the duration of the audio. If the duration is shorter than 1 second or larger than 7 seconds, a new cough recording is requested from the user.


data – List of data from the user extracted by the chatbot.

Returns metadata_bf

Pandas DataFrame containing the converted metadata.

The function converts the data extracted by the chatbot into input metadata for the covid recognition model. Basically, all data is boolean except the age. The metadata is first builded in a key:value way (dictionary) and then transformed to a pandas DataFrame.


original_path – The absolute path where the .ogg audio has been downloaded from the Telegram API.

Returns converted_path

The absolute path where the converted .wav audio has been stored.

Convertion of original audio from .ogg format to .wav. The .ogg format is the default used by Telegram because its size. However, in order to analyze properly the audios, a transformation to a higher quality format such as .wav is needed.

project.Telegram_Chatbot.modulos.analyze_cough.cough_prediction(X_new, opt_thresh=0.5)
  • X_new – Pandas DataFrame which contains a row vector of features from the raw audio.

  • opt_thresh – The optimal threshold defined as 0.6.

Returns bool

Boolean output depending whether the audio is classified as cough or no-cough respectively.

This function loads the cough recognition model and predicts whether the audio is cough based on the long-term averaging of its mid-term (and therefore short-term) features.

It uses a classification model trained with 500 audios divided in two halves; the first half contains cough audios. The second half contains audios of sneezing, clearing throat and other human sounds. As the dataset has been splitted in training/testing, we have been able to compute the optimal threshold (0.6) of the model from the ROC curves. The optimal threshold divides the output probability in a way that provides the best compromise between specificity and sensibility. The model is stored as a pickle (binary data).

Finally, if the outputed probability of the model is equal or larger than the optimal threshold, the input audio is classified as a cough. Otherwise, the audio is classified as a no-cough.

project.Telegram_Chatbot.modulos.analyze_cough.covid_prediction(X_new, metadata, optimal_threshold=0.8)
  • X_new – pandas DataFrame which contains a row vector of long-term features from the raw audio.

  • metadata – Pandas DataFrame which contains the metadata (symptomatology) of the user who has cought.

  • optimal_threshold – The optimal threshold defined as 0.8 (note that the model is not calibrated).

Returns bool

boolean output depending whether the audio is classified as covid cough or no-covid cough respectively.

The function tries to predict whether a cough audio is recorded by a user that have COVID-19 or not. The covid recognition model has been trained with the Coswara dataset. Although several approaches has been tried (Transfer Learning of CNN based on spectrogram analysis, Convolutional autoencoders, etc.) the one which has provided the best results is the model implemented here (Extreme Randomized Tree classifier).

It uses not only the bunch of long-term features extracted from an audio in a tabular way, but also the metadata of each patient as an input (because the Coswara dataset contained these meta-information). The model is stored in binary as a pickle, too.

project.Telegram_Chatbot.modulos.analyze_cough.extract_features_audio(filename, low_pass_filt=False, cutoff_freq=4096, amplification=False, compute_FFT=False)
  • filename – Absolute path where the audio is stored.

  • low_pass_filt – If True, the signal is filtered.

  • cutoff_freq – The cutoff frequency. As default, is defined at 4096Hz as we have seen that Android audios reach its maximum peak at this frequency.

  • amplification – If True, the signal is amplified.

  • compute_FFT – If True, then the 1D DFT and the Mel-Spectrogram are computed.

Returns signal

The signal filtered and amplified (if applicable).

Returns sampling_rate

The sampling rate of the signal.

Returns xf

Frequencies [Hz] of the 1D DFT. Note that the maximum frequency is the Nyquist one.

Returns yf

Gain of each frequency of the signal in each Frequency bin.

Returns S_dB

Mel-Spectrogram in decibels.

The function loads the signal and the sampling rate of the audio by using the Librosa library. Note that the audio is loaded as a mono signal (not stereo). Then it verifies if the sampling rate is correct. Furthermore, it filters the signal by using the low-pass Butterworth filter and amplify it, too. Lastly, if it is necessary, it computes the 1-D discrete Fourier Transform in order to work with the signal in the Frequency domain. As the signal is symmetric, it only extracts the positive part. The Mel-Spectrogram of the signal is also extracted and then transformed to decibels. As we are working with human sounds, the mel-spectrogram is better as it scales the original signal accordingly to the human audition.


data – Original signal.

Returns data*factor

Amplified signal.

The function normalize the amplitude of the signal by increasing its gain. Basically, it multiplies the signal by a gain factor. In this way, the signal gain ranges from -1 to +1. This is useful because the audios recorded by iOS tend to have greater gains than the ones recorded by Android.


wav_file_path – The absolute path where the converted-to-wav audio is located.

Returns final_df

Pandas DataFrame which contains almost 150 features extracted from the raw audio in a tabular way.

This function is the core of the script. It find a way to go from low-level audio data samples to a higher-level representation of the audio content. We are interested to extract higher-level audio features that are capable of discriminating between different audio classes (cough/no-cough, covid/no-covid). Basically, the function extracts the mid-term features (based on the short-term ones) from the .wav audio and computes the long term averaging of the mid-term statistics.

The most important concept of audio feature extraction is short-term windowing (or framing): this simply means that the audio signal is split into short-term windows (or frames). In this case, we have defined the short-term windows as well as the short-term steps equal to 10 msecs. Consequently, there is no-overlapping between windows (or frames). For each frame (whoose length is defined by the short-term windows parameter), we extract a set of short-term audio features. Those features are extracted directly from the audio sample values (Time domain) as well as from the FFT values (Frequency domain).

Then, we extract 2 statistics, namely the mean and the std of each short-term feature sequence, using the provided mid-term window size. These statistics represents the mid-term features of the audio. Finally, we perform a long-term averaging in order to obtain a single large mean feature vector per each audio.

Additionally, we use the OpenSmile library to extract cepstral features based on the cepstrum for each audio. Then, we concatenate both vectors. In this way, for each input audio, we extract a bunch of almost 150 features contained in a row vector. Hopefully, these features have enough discriminative information to classify correctly the audios.