ARPA format

How to generate an arpa file

Given the following text,

a b c d e
d e f a
a b c d e f a

we can convert it to a n-gram LM saved in ARPA format using the script shared/make_kn_lm.py:

cat > test.txt <<EOF
a b c d e
d e f a
a b c d e f a
EOF

wget https://raw.githubusercontent.com/k2-fsa/icefall/master/icefall/shared/make_kn_lm.py

python3 ./make_kn_lm.py -ngram-order 3 -text ./test.txt -lm test.arpa

The content of test.arpa is given below:

\data\
ngram 1=8
ngram 2=10
ngram 3=9

\1-grams:
-99.0000000	<s>	-0.8573325
-0.6989700	a	-0.7481880
-1.0000000	b	-0.8573325
-1.0000000	c	-0.8061800
-0.6989700	d	-1.1583625
-1.0000000	e	-0.7481880
-0.6989700	</s>
-1.0000000	f	-0.8061800

\2-grams:
-0.2041200	<s> a	-0.9542425
-0.5351132	<s> d	0.3010300
-0.3590219	a b	-0.3010300
-0.3590219	a </s>
-0.0579919	b c	-0.3010300
-0.0579919	c d	0.0000000
-0.0280287	d e	-0.1760913
-0.3590219	e </s>
-0.3590219	e f	-0.3010300
-0.0579919	f a	-0.9542425

\3-grams:
-0.0280287	<s> a b
-0.0280287	a b c
-0.0280287	b c d
-0.0280287	c d e
-0.5351132	d e </s>
-0.2041200	d e f
-0.0579919	<s> d e
-0.0280287	e f a
-0.0280287	f a </s>

\end\

How to interpret an arpa file

\data\
ngram 1=8
ngram 2=10
ngram 3=9

An arpa file begins with the literal string \data\ in the first line. The lines that follow it contain the number of entries for each order:

ngram 1=8: There are 8 entries for unigram

ngram 2=10: There are 10 entries for bigram

ngram 3=9: There are 9 entries for trigram. The highest order of this file is 3.

\1-grams:
-99.0000000   <s>     -0.8573325
-0.6989700    a       -0.7481880
-1.0000000    b       -0.8573325
-1.0000000    c       -0.8061800
-0.6989700    d       -1.1583625
-1.0000000    e       -0.7481880
-0.6989700    </s>
-1.0000000    f       -0.8061800

\1-grams: means the following entries belong to unigram. Each entry of unigram has 2 or 3 columns.

Column 0: probability in \(\log_{10}\), i.e., \(\log_{10}(p)\)

Column 1: the word

Column 2: back-off probability in \(\log_{10}\), If this column is absent, it is \(\log_{10}(1) = 0\) by default

Caution

\[\log(p) = \frac{\log_{10}(p)}{\log_{10}\mathrm{e}} = \log_{10}(p) \log(10) = \log_{10}(p) \times 2.302585092994046\]

\[\log_{10}(p) = \frac{\log(p)}{\log(10)} = \frac{\log(p)}{2.302585092994046} = \log(p) \times 0.4342944819032518\]

How to use score a sentence using arpa

Example 1: p(a | <s>)

\[\log_{10} \mathrm{p}(\mathrm{a}|\lt\!\mathrm{s}\!\gt) = -0.2041200\]

We can read the value of \(\log_{10} \mathrm{p}(\mathrm{a}|\lt\!\mathrm{s}\!\gt)\) directly from the arpa file.

Example 2: p(a b | <s>)

\[\begin{split}\log_{10} \mathrm{p}(\mathrm{a} \mathrm{b}|\lt\!\mathrm{s}\!\gt) &= \log_{10} \mathrm{p}(\mathrm{a}|\lt\!\mathrm{s}\!\gt) + \log_{10} \mathrm{p}(\mathrm{b}|\lt\!\mathrm{s}\!\gt \mathrm{a})\\ &= (-0.2041200) + (-0.0280287) \\ &= -0.23214869\end{split}\]

Example 3: p(a b </s> | <s>)

\[\begin{split}\log_{10} \mathrm{p}(\mathrm{a}\; \mathrm{b} \lt\!/\mathrm{s}\!\gt| \lt\!\mathrm{s}\!\gt) &= \log_{10} \mathrm{p}(\mathrm{a}|\lt\!\mathrm{s}\!\gt) + \log_{10} \mathrm{p}(\mathrm{b}|\lt\!\mathrm{s}\!\gt \mathrm{a}) + \log_{10}\mathrm{p}(\lt\!/\mathrm{s}\!\gt|\mathrm{a}\;\mathrm{b})\\ &= (-0.2041200) + (-0.0280287) + \log_{10}\mathrm{p}(\lt\!/\mathrm{s}\!\gt|\mathrm{a}\;\mathrm{b})\\ &= -0.23214869 + \log_{10}\mathrm{p}(\lt\!/\mathrm{s}\!\gt|\mathrm{a}\;\mathrm{b})\\ &= -0.23214869 + \log_{10}p_{\mathrm{backoff}}(\mathrm{a}\;\mathrm{b}) + \log_{10}\mathrm{p}(\lt\!/\mathrm{s}\!\gt|\mathrm{b})\\ &= -0.23214869 + (-0.3010300) + \log_{10}\mathrm{p}(\lt\!/\mathrm{s}\!\gt|\mathrm{b})\\ &= -0.53317869 + \log_{10}p_{\mathrm{backoff}}(\mathrm{b}) + \log_{10}\mathrm{p}(\lt\!/\mathrm{s}\!\gt)\\ &= -0.53317869 + (-0.8573325) + \log_{10}\mathrm{p}(\lt\!/\mathrm{s}\!\gt)\\ &= -1.39051119 + \log_{10}\mathrm{p}(\lt\!/\mathrm{s}\!\gt)\\ &= -1.39051119 + (-0.6989700) \\ &= -2.08948119 \\\end{split}\]

Caution

The arpa file does not contain a b </s>, so when computing \(\log_{10}\mathrm{p}(\lt\!/\mathrm{s}\!\gt|\mathrm{a}\;\mathrm{b})\), we use

\[\log_{10}\mathrm{p}(\lt\!/\mathrm{s}\!\gt|\mathrm{a}\;\mathrm{b}) = \log_{10}p_{\mathrm{backoff}}(\mathrm{a}\;\mathrm{b}) + \log_{10}\mathrm{p}(\lt\!/\mathrm{s}\!\gt|\mathrm{b})\]

Similary, b </s> also does not exist in the arpa file, we use:

\[\log_{10}\mathrm{p}(\lt\!/\mathrm{s}\!\gt|\mathrm{b}) = \log_{10}p_{\mathrm{backoff}}(\mathrm{b}) + \log_{10}\mathrm{p}(\lt\!/\mathrm{s}\!\gt)\]

Example 4: p(b d </s> | <s>)

\[\begin{split}\log_{10} \mathrm{p}(\mathrm{b}\; \mathrm{d} \lt\!/\mathrm{s}\!\gt | \lt\!\mathrm{s}\!\gt) &= \log_{10} \mathrm{p}(\mathrm{b}|\lt\!\mathrm{s}\!\gt) + \log_{10} \mathrm{p}(\mathrm{d}|\lt\!\mathrm{s}\!\gt \mathrm{b}) + \log_{10}\mathrm{p}(\lt\!/\mathrm{s}\!\gt|\mathrm{b}\;\mathrm{d})\\ &= \log_{10}p_{\mathrm{backoff}}(\mathrm{\lt\!\mathrm{s}\!\gt}) +\log_{10} \mathrm{p}(\mathrm{b}) + \log_{10} \mathrm{p}(\mathrm{d}|\lt\!\mathrm{s}\!\gt \mathrm{b}) + \log_{10}\mathrm{p}(\lt\!/\mathrm{s}\!\gt|\mathrm{b}\;\mathrm{d})\\ &= (-0.8573325) + (-1.0000000) + \log_{10} \mathrm{p}(\mathrm{d}|\lt\!\mathrm{s}\!\gt \mathrm{b}) + \log_{10}\mathrm{p}(\lt\!/\mathrm{s}\!\gt|\mathrm{b}\;\mathrm{d})\\ &= (-0.8573325) + (-1.0000000) + \log_{10}p_{\mathrm{backoff}}(\mathrm{\lt\!\mathrm{s}\!\gt\mathrm{b}}) + \log_{10} \mathrm{p}(\mathrm{d}|\mathrm{b}) + \log_{10}\mathrm{p}(\lt\!/\mathrm{s}\!\gt|\mathrm{b}\;\mathrm{d})\\ &= (-0.8573325) + (-1.0000000) + 0 + \log_{10} \mathrm{p}(\mathrm{d}|\mathrm{b}) + \log_{10}\mathrm{p}(\lt\!/\mathrm{s}\!\gt|\mathrm{b}\;\mathrm{d})\\ &= (-0.8573325) + (-1.0000000) + 0 + \log_{10}p_{\mathrm{backoff}}(\mathrm{b}) + \log_{10} \mathrm{p}(\mathrm{d}) + \log_{10}\mathrm{p}(\lt\!/\mathrm{s}\!\gt|\mathrm{b}\;\mathrm{d})\\ &= (-0.8573325) + (-1.0000000) + 0 + (-0.8573325) + (-1.1583625) + \log_{10}\mathrm{p}(\lt\!/\mathrm{s}\!\gt|\mathrm{b}\;\mathrm{d})\\ &= -3.8730275 + \log_{10}\mathrm{p}(\lt\!/\mathrm{s}\!\gt|\mathrm{b}\;\mathrm{d})\\ &= -3.8730275 + \log_{10}p_{\mathrm{backoff}}(\mathrm{b}\;\mathrm{d}) + \log_{10}\mathrm{p}(\lt\!/\mathrm{s}\!\gt|\mathrm{d})\\ &= -3.8730275 + 0 + \log_{10}\mathrm{p}(\lt\!/\mathrm{s}\!\gt|\mathrm{d})\\ &= -3.8730275 + 0 + \log_{10}p_{\mathrm{backoff}}(\mathrm{d}) + \log_{10}\mathrm{p}(\lt\!/\mathrm{s}\!\gt)\\ &= -3.8730275 + 0 + (-1.1583625) + (-0.6989700) \\ &= -5.7303600 \\\end{split}\]

Caution

There is no <s> b in the arpa file, so when computing \(\log_{10} \mathrm{p}(\mathrm{b}|\lt\!\mathrm{s}\!\gt)\), we use

\[\log_{10} \mathrm{p}(\mathrm{b}|\lt\!\mathrm{s}\!\gt) = \log_{10}p_{\mathrm{backoff}}(\mathrm{\lt\!\mathrm{s}\!\gt}) + \log_{10} \mathrm{p}(\mathrm{b})\]

There is no <s> b d in the arpa file, so when computing \(\log_{10} \mathrm{p}(\mathrm{b}|\lt\!\mathrm{s}\!\gt)\), we use

\[\begin{split}\log_{10} \mathrm{p}(\mathrm{d}|\lt\!\mathrm{s}\!\gt \mathrm{b}) &= \log_{10}p_{\mathrm{backoff}}(\mathrm{\lt\!\mathrm{s}\!\gt\mathrm{b}}) + \log_{10} \mathrm{p}(\mathrm{d}|\mathrm{b}) \\ &= 0 + \log_{10} \mathrm{p}(\mathrm{d}|\mathrm{b}) \\ &= \log_{10}p_{\mathrm{backoff}}(\mathrm{b}) +\log_{10} \mathrm{p}(\mathrm{d}) \\\end{split}\]

How to use kenLM to compute scores

First, let us install kenlm with the following command:

pip install https://github.com/kpu/kenlm/archive/master.zip

Listing 1 ./code/test-kenlm.py

#!/usr/bin/env python3

import kenlm


def test():
    # Note: When we use p(x), we are actually referring to log10(p(x))
    model = kenlm.LanguageModel("./test.arpa")
    # model.score() return probability in log10()

    # p("a") in log10
    assert abs(model.score("a", bos=False, eos=False) - (-0.6989700)) < 1e-5

    # p(a|<s>) in log10
    assert abs(model.score("a", eos=False) - (-0.2041200)) < 1e-5

    # p(a b | <s>)
    # = p(a | <s>) + p(b | <s> a)
    # = (-0.204120) + (-0.0280287)
    # = -0.23214869
    #  print(model.score("a b", eos=False, bos=True))
    assert abs(model.score("a b", eos=False, bos=True) - (-0.23214869)) < 1e-5

    # p(a b </s> | <s>)
    # = p(a | <s>) + p(b | <s> a) + p(</s> | a b)
    # = (-0.204120) + (-0.0280287) + backoff(a b) + p(</s> | b)
    # = (-0.204120) + (-0.0280287) + backoff(a b) + p(</s> | b)
    # = (-0.204120) + (-0.0280287) + (-0.3010300) + backoff(b) + p(</s>)
    # = (-0.204120) + (-0.0280287) + (-0.3010300) + (-0.8573325) + (-0.6989700)
    # = -2.0894812
    #  print(model.score("a b", eos=True, bos=True))
    assert abs(model.score("a b", eos=True, bos=True) - (-2.0894812)) < 1e-5
    # Pay attention to the computation of p(</s> | a b)
    # p(</s> | a b)
    # = backoff (a b) + p (</s> | b)
    # = backoff (a b) + backoff(b) + p(</s>)
    #
    # Also note that p(</s>) is 0

    # p(b d </s> | <s>)
    # = p(b | <s>) + p (d | <s> b) + p(</s> | b d)
    # = backoff(<s>) + p(b) + p (d | <s> b) + p(</s> | b d)
    # = backoff(<s>) + p(b) + backoff(<s> b) + p(d | b) + p(</s> | b d)
    # = backoff(<s>) + p(b) + backoff(<s> b) + backoff(b) +  p(d) + p(</s> | b d)
    # = backoff(<s>) + p(b) + backoff(<s> b) + backoff(b) +  p(d) + backoff(b d) + p(</s> | d)
    # = backoff(<s>) + p(b) + backoff(<s> b) + backoff(b) +  p(d) + backoff(b d) + backoff(d) + p(</s>)
    # = (-0.8573325) + (-1.0000000) +  0     + (-0.8573325) + (-0.6989700) + 0   + (-1.1583625) + (-0.6989700)
    # = -5.2709675
    #  print(model.score("b d", eos=True, bos=True))
    assert abs(model.score("b d", eos=True, bos=True) - (-5.2709675)) < 1e-5
    print(model.score("b d", eos=True, bos=True))
    print(list(model.full_scores("b d", eos=True, bos=True)))
    # Note:
    # p(b | <s>) = backoff(<s>) + p(b)
    #
    # p(d | <s> b) = backoff(<s> b) + p (d | b)
    # since backoff (<s> b) does not exist, so it is 0
    #
    # p(d | b) = backoff(b) + p(d)
    #
    # p (</s> | b d) = backoff(b d) + p(</s> | d)
    #                = backoff(b d) + backoff(d) + p(</s>)


def test_statefull():
    model = kenlm.LanguageModel("./test.arpa")
    s1 = kenlm.State()
    s2 = kenlm.State()
    model.BeginSentenceWrite(s1)
    accum = model.BaseScore(s1, "a", s2)  # p(a | <s>)
    #  print(accum)  # -0.2041200
    assert abs(accum - model.score("a", bos=True, eos=False)) < 1e-5
    accum += model.BaseScore(s2, "b", s1)  # p(a | <s>) +  p(b | <s> a)
    #  print(accum)  # -0.23214869
    assert abs(accum - model.score("a b", bos=True, eos=False)) < 1e-5

    # reset
    s1 = kenlm.State()
    s2 = kenlm.State()
    model.BeginSentenceWrite(s1)
    accum = model.BaseScore(s1, "b", s2)  # p(b | <s>)
    #  print(accum)  # -1.857332
    assert abs(accum - model.score("b", bos=True, eos=False)) < 1e-5
    # backoff(<s>) + p(b) = -0.8573325 + (-1) = -1.8573325
    accum += model.BaseScore(s2, "c", s1)  # p(b | <s>) + p(c | b)
    #  print(accum)  # -1.91532436
    # p(b | <s>) + p(c | b) = -1.8573325 + (-0.0579919) = -1.9153244
    assert abs(accum - model.score("b c", bos=True, eos=False)) < 1e-5

    accum += model.BaseScore(s1, "d", s2)  # p(b | <s>) + p(c | b) + p(d | b c)
    #  print(accum)
    # p(b | <s>) + p(c | b) + p(d | b c) = -1.9153244 + (-0.0280287) = -1.9433531
    assert abs(accum - model.score("b c d", bos=True, eos=False)) < 1e-5

    # now for oov
    # reset
    s1 = kenlm.State()
    s2 = kenlm.State()
    model.BeginSentenceWrite(s1)
    accum = model.BaseScore(s1, "g", s2)  # p(g | <s>)
    # p(g | <s>) = backoff(<s>) + p(<unk>) = -0.8573325 + (-100) = -100.8573325
    print(accum)  # -100.8573325
    for i in ["a", "b", "c", "d", "e", "f", "</s>"]:
        assert (
            abs(model.BaseScore(s2, i, s1) - model.score(i, bos=False, eos=False))
            < 1e-5
        )
    #  print(model.BaseScore(s2, "kk", s1)) # -100
    accum += model.BaseScore(s2, "a", s1)  # p(g | <s>) + p(a)
    print(accum)
    assert abs(accum - model.score("g a", bos=True, eos=False)) < 1e-5
    accum += model.BaseScore(s1, "b", s2)  # p(g | <s>) + p(a) + p(b|a)
    print(accum)


def main():
    test()
    test_statefull()


if __name__ == "__main__":
    main()