Home MFCC
Post
Cancel

MFCC

最近了解下梅尔倒谱系数在此记录下。

梅尔倒频谱(Mel-Frequency Cepstrum, MFC):一个可用来代表短期音频的频谱,其原理基于用非线性的梅尔刻度(mel scale)表示的对数频谱及其线性余弦转换(linear cosine transform)上。

梅尔频率倒谱系数(Mel-Frequency Cepstral Coefficients, MFCC): 是一组用来创建梅尔倒频谱的关键系数。由音乐信号当中的片段,可以得到一组足以代表此音乐信号之倒频谱(Cepstrum),而梅尔倒频谱系数即是从这个倒频谱中推得的倒频谱(也就是频谱的频谱)。与一般的倒频谱不同 ,梅尔倒频谱最大的特色在于,于梅尔倒频谱上的频带是均匀分布于梅尔刻度上的,也就是说,这样的频带相较于一般所看到、线性的倒频谱表示方法,和人类非线性的听觉系统更为接近。

应用: 是一种广泛用于自动语音和说话人识别的特征,在音频压缩的技术中,便常常使用梅尔倒频谱来处理。例如:可以自动辨认一个人透过电话说的数字。梅尔倒频谱系数通常也可以作为声纹识别(Speaker Recognition),也就是用来识别某段语音频号的发话者是谁。

梅尔倒频谱 Mel-frequency-cepstrum

步骤

梅尔倒谱实现步骤

梅尔倒谱系数通常是用一下方法得到的:

  1. 对音频信号进行预加重,一般都是说为了补偿高频分量的损失
  2. 分帧加窗, 消除各个帧两端可能会造成的信号不连续性
  3. 做FFT,计算功率谱和对应的能量
  4. 计算梅尔滤波器组
  5. 将滤波器组对功率谱进行滤波
  6. 将滤波结果取log
  7. 做dct变换
  8. 保留dct2-13的系数
  9. 做lifting
  10. 附加功率谱的能量

预加重

对语音信息进行预加重通常使用如下方式:
\(\begin{aligned} y(i) = x(i) - \alpha * x(i-1) \end{aligned}\)

  • $\alpha$ 通常为0.97
1
2
3
4
5
6
7
def pre_add_weight(signal, alpha=0.97):
    """
    预加重
    https://zhuanlan.zhihu.com/p/34798429
    """
    signal = np.insert(signal, 0, 0)
    return signal[1:] - alpha * signal[:-1]

分帧加窗

  • 分帧: 对语音信号进行25ms(帧长)分一帧,同时每次滚动10ms(步长),即相邻两帧重叠一部分,对末尾不足正常帧长的补0。
  • 加窗:通常是为了消除各个帧两端可能会造成的信号不连续性。常用的窗函数是汉明窗(hamming) \(\begin{aligned} w(n) = \alpha - (1 - \alpha) * \cos( \dfrac{2 * \pi * n}{N-1}), \qquad(0 \leq n \leq N - 1) \end{aligned}\)
    • 其中$\alpha=0.5386$称做hamming窗,$\alpha=0.5$叫作hann窗,N为窗长度,通常为了简化计算令$\alpha=0.54$
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
def hamming(N):
    """
    生成汉明窗(hamming windows)
    https://zh.wikipedia.org/wiki/%E7%AA%97%E5%87%BD%E6%95%B0#Hamming%E7%AA%97
    http://t.zoukankan.com/dylancao-p-14212421.html
    """
    return 0.54 - 0.46 * np.cos(np.arange(0, N) * np.pi * 2 / (N - 1))

def split_frame(signal, win_len, win_step, win_func=hamming):
    """
    分帧加窗
    """
    hamming_win = win_func(win_len)
    sig_len = len(signal)
    pad_len = 0
    num_frames = 0
    if sig_len < win_len:
        pad_len = win_len - sig_len
        num_frames = 1
    else:
        pad_n = math.ceil((sig_len - win_len) / win_step)
        pad_len = pad_n * win_step + win_len - sig_len
        num_frames = pad_n + 1
    pad_signal = np.zeros(sig_len + pad_len)
    pad_signal[:sig_len] = signal
   
    frames = []
    for i in range(num_frames):
        r = pad_signal[i*win_step:i*win_step+win_len] * hamming_win     
        frames.append(pad_signal[i*win_step:i*win_step+win_len] * hamming_win)
    return np.array(frames)

计算功率谱

计算功率谱,需要先做FFT,然后将结果按照如下公式计算:

\[\begin{aligned} P(k) = \dfrac{1}{N}|S(k)|^2 \end{aligned}\]
1
2
3
4
5
6
7
8
def power_spectral(frames, n_fft=512):
    """
    计算功率图谱
    """
    n_fft = n_fft or frames.shape[1]
    pow_spec = np.fft.rfft(frames, n_fft)  # n_fft / 2 + 1
    pow_spec = np.abs(pow_spec)
    return 1.0 / n_fft * np.square(pow_spec)

生成梅尔滤波器组

将线性频率映射到基于听觉感知的Mel非线性频率中,然后转换到倒谱上。普通频率和Mel频率的转换如下: \(\begin{aligned} Mel(f) &= 2595 * \log_{10}(1 + f/700) \\ Mel^{-1}(m) &= 700 * (10^{m/2595} - 1) \end{aligned}\) 通常步骤如下:

  1. 将最低频率和最高频率转到Mel频率(low_mel, high_mel)
  2. 根据Mel滤波器组的个数(n_filt)生成n_filt + 2 个等距分布low_mel和high_mel之间的点(mel_points)
  3. 将生成的mel_points根据FFT size和采样率(sample_rate)转到频率下得到fft_bins,转换公式如下: \(\begin{aligned} bin = floor((n\_fft + 1) * mel2hz(mel\_points) / sample\_rate) \end{aligned}\)
  4. 根据如下公式生成梅尔滤波器组 \(\begin{aligned} H_j(i) &= 0 \qquad(i < bin(j)) \\ &= \dfrac{i - bin(j)}{bin(j+1) - bin(j)} \qquad(bin(j) \leq i < bin(j+1)) \\ &= \dfrac{bin(j+2) - i}{bin(j+2) - bin(j+1)} \qquad(bin(j+1) \leq i < bin(j+2)) \\ &= 0 \qquad( i \geq bin(j+2)) \end{aligned}\)
    • 其中j下标代表滤波器个数,i下标代表滤波器长度(n_fft // 2 + 1)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
def get_mel_filter_banks(low_freq, high_freq, sample_rate, n_filt=26, n_fft=512):
    low_mel = hz2mel(low_freq)
    high_mel = hz2mel(high_freq)
    mel_points = np.linspace(low_mel, high_mel, n_filt + 2)
    # f(i) = floor((nfft+1)*h(i)/samplerate)
    bin = np.floor((n_fft+1)*mel2hz(mel_points) / sample_rate)
    fbank = np.zeros([n_filt, n_fft//2+1])
    for j in range(0, n_filt):
        for i in range(int(bin[j]), int(bin[j+1])):
            fbank[j, i] = (i - bin[j]) / (bin[j+1]-bin[j])
        for i in range(int(bin[j+1]), int(bin[j+2])):
            fbank[j, i] = (bin[j+2]-i) / (bin[j+2]-bin[j+1])
    return fbank

离散余弦变换(DCT)

一般都是使用DCT-II,这里参考matlab的实现公式。下面有个简单的实现帮助理解,时间复杂度O(nn)不够快,因此实际使用的是Scipy.fftpack.dct 复杂度O(nlong),这个之后再去学习下。表示如下: \(\begin{aligned} y(k) = \sqrt{\dfrac{2}{N}}\sum_{n=0}^{N-1}x_n* \dfrac{1}{\sqrt{1 + \theta_{k0}}}\cos(\dfrac{\pi}{N} * (n + \dfrac{1}{2}) * k) \qquad(k=0,...N-1) \end{aligned}\)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
def matlab_dct2(x):
    """
    https://ww2.mathworks.cn/help/signal/ref/dct.html#responsive_offcanvas
    https://zh.wikipedia.org/wiki/%E7%A6%BB%E6%95%A3%E4%BD%99%E5%BC%A6%E5%8F%98%E6%8D%A2
    系数为 sqrt(2/n) 同时增加克罗内克函数(Kronecker) delta(k,0)
    see also scipy.fftpack.dct(x, type=2, norm='ortho')
    """
    N = len(x)
    y = np.zeros(N)
    for k in range(N):
        t0 = x * np.cos(np.pi * (np.arange(N) + 1/2) * k / N)
        # for Kronecker
        kronecker_delta = k == 0
        t0 = t0 * 1 / np.sqrt(kronecker_delta + 1)
        y[k] = np.sqrt(2 / N) * np.sum(t0)
    return y

lifter对倒谱进行正弦提升

1
2
3
4
5
def lifter(cepstra, L=22):
    nframes, ncoeff = np.shape(cepstra)
    n = np.arange(ncoeff)
    lift = 1 + (L/2.) * np.sin(np.pi*n/L)
    return lift*cepstra

实现

通过以上步骤可以得到实现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
def mfcc(signal, sample_rate, win_durtation=0.025, win_step=0.010, num_cep=13, n_filter=26, win_func=hamming, pre_weight=0.97):
    
    win_len = round(sample_rate * win_durtation)
    win_step = round(sample_rate * win_step)

    signal = pre_add_weight(signal, pre_weight)
    frames = split_frame(signal, win_len, win_step, win_func=hamming)
    
    n_fft = win_len
    pow_spec = power_spectral(frames, n_fft=win_len)
    energy = np.sum(pow_spec, 1)
    energy = np.where(energy == 0,np.finfo(float).eps,energy) # if energy is zero, we get problems with log

    mel_filter_banks = get_mel_filter_banks(0, sample_rate // 2, sample_rate, n_filt=n_filter, n_fft=n_fft)
    
    feat = np.dot(pow_spec, mel_filter_banks.T)
    feat = np.where(feat == 0,np.finfo(float).eps,feat) # if feat is zero, we get problems with log
    feat = np.log(feat)

    feat = scipy.fftpack.dct(feat, type=2, norm='ortho')
    feat = feat[:, :num_cep]
    feat = lifter(feat)
    feat[:,0] = np.log(energy) # replace first cepstral coefficient with log of frame energy

    return feat

if __name__ == "__main__":
    # test
    import python_speech_features
    import python_speech_features.sigproc
    import scipy.signal
    import matplotlib.pyplot as plt

    samplerate = 1000
    sig = np.linspace(0, 1, samplerate)
    sig = np.sin(sig)

    frames = split_frame(sig, 20, 10)
    
    feat1 = python_speech_features.mfcc(sig,samplerate=samplerate, nfft=25, winfunc=hamming, nfilt=26)
    feat2 = mfcc(sig, samplerate, n_filter=26)

    # print(feat1[:5])
    # print(feat2[:5])
    print(np.allclose(feat1, feat2))

This post is licensed under CC BY 4.0 by the author.