Model building

CRFsuite

This R package wraps the CRFsuite C/C++ library (https://github.com/chokkan/crfsuite), allowing the following:

  • Fit a Conditional Random Field model (1st-order linear-chain Markov)
  • Use the model to get predictions alongside the model on new data
  • The focus of the implementation is in the area of Natural Language Processing where this R package allows you to easily build and apply models for named entity recognition, text chunking, part of speech tagging, intent recognition or classification of any category you have in mind.

For users unfamiliar with Conditional Random Field (CRF) models, you can read this excellent tutorial http://homepages.inf.ed.ac.uk/csutton/publications/crftut-fnt.pdf

Data format

In order to build a CRF model, you need to have

  1. sequences of labels (the hidden state Y) and
  2. attributes of the observations corresponding to the labels (X).

Regarding the label sequence:

Generally the labels follow the IOB type of scheme which look something like: B-ORG, I-ORG, B-YOUROWNLABEL, I-YOUROWNLABEL or O. Indicating the beginning of a certain category (B-), the intermediate part of a certain category (I-) or outside the category (O).

  • Hence the text I went to the New York City District on holidays would e.g. be labelled as O, O, O, O, B-LOCATION, I-LOCATION, I-LOCATION, I-LOCATION, O, O

Regarding the attributes of the label observations:

The attributes of the observations are mostly something like the term itself, the neighbouring terms, the parts of speech, the neighbouring parts of speech or any specific feature you can extract and which is relevant to your business domain (e.g. the number of numbers in the token, how far is it from the start of the document or end of the document, is the token capitalised, does it contain an ampersand, …).

Example data

As an example, let’s get some data in Dutch for doing Named Entity Recognition which was distributed as part of the CoNLL-2002 shared task challenge. This dataset contains 1 row per term and provides entity labels as well as the parts of speech tag for each term.

library(crfsuite)
x <- ner_download_modeldata("conll2002-nl")
subset(x, doc_id == 100)
          data doc_id sentence_id         token  pos  label
  1: ned.train    100        8882            EK Pron B-MISC
  2: ned.train    100        8882      Magazine    N I-MISC
  3: ned.train    100        8882        Canvas    N  B-ORG
  4: ned.train    100        8882         23.45  Num      O
  5: ned.train    100        8883  Tourjournaal    N B-MISC
 ---                                                       
343: ned.train    100        8916 gepresenteerd    V      O
344: ned.train    100        8916          door Prep      O
345: ned.train    100        8916          Stef    N  B-PER
346: ned.train    100        8916      Wijnants    N  I-PER
347: ned.train    100        8916             . Punc      O

Attributes

As basic feature enrichment we add the parts of speech tag of the preceding and the next term which we will use later when building the model and do the same for the token. The R package data.table has a nice shift function for this.

library(data.table)
Warning: package 'data.table' was built under R version 3.3.3
x <- as.data.table(x)
x <- x[, pos_previous   := shift(pos, n = 1, type = "lag"), by = list(doc_id)]
x <- x[, pos_next       := shift(pos, n = 1, type = "lead"), by = list(doc_id)]
x <- x[, token_previous := shift(token, n = 1, type = "lag"), by = list(doc_id)]
x <- x[, token_next     := shift(token, n = 1, type = "lead"), by = list(doc_id)]

Note that CRFsuite handles all attributes equivalently, in order to distinguish between the columns, we need to prepend the column name logic to each column similar as shown at http://www.chokkan.org/software/crfsuite/tutorial.html. This is done using a custom txt_sprintf function which is similar as sprintf but handles NA values gracefully.

x <- x[, pos_previous   := txt_sprintf("pos[w-1]=%s", pos_previous), by = list(doc_id)]
x <- x[, pos_next       := txt_sprintf("pos[w+1]=%s", pos_next), by = list(doc_id)]
x <- x[, token_previous := txt_sprintf("token[w-1]=%s", token_previous), by = list(doc_id)]
x <- x[, token_next     := txt_sprintf("token[w-1]=%s", token_next), by = list(doc_id)]
subset(x, doc_id == 100, select = c("doc_id", "token", "token_previous", "token_next"))
     doc_id         token           token_previous              token_next
  1:    100            EK                       NA     token[w-1]=Magazine
  2:    100      Magazine            token[w-1]=EK       token[w-1]=Canvas
  3:    100        Canvas      token[w-1]=Magazine        token[w-1]=23.45
  4:    100         23.45        token[w-1]=Canvas token[w-1]=Tourjournaal
  5:    100  Tourjournaal         token[w-1]=23.45       token[w-1]=Canvas
 ---                                                                      
343:    100 gepresenteerd             token[w-1]=,         token[w-1]=door
344:    100          door token[w-1]=gepresenteerd         token[w-1]=Stef
345:    100          Stef          token[w-1]=door     token[w-1]=Wijnants
346:    100      Wijnants          token[w-1]=Stef            token[w-1]=.
347:    100             .      token[w-1]=Wijnants                      NA
x <- as.data.frame(x)

Model

Train your own CRF model

Once you have data which are tagged with your own categories, you can build a CRF model. On the previous data, split it into a training and test dataset.

crf_train <- subset(x, data == "ned.train")
crf_test <- subset(x, data == "testa")

And start building your model.

  • By default, the CRF model is trained using L-BFGS with L1/L2 regularization but other training methods are also available, namely: SGD with L2-regularization / Averaged Perceptron / Passive Aggressive or Adaptive Regularization of Weights).
  • In the below example we use the default parameters and decrease the iterations a bit to have a model ready within 30 seconds.
  • Provide the label with the categories (y) and the and the attributes of the observations (x) and indicate what is the sequence group (in this case we take document identifier).
  • The model will be saved to disk in file tagger.crfsuite
model <- crf(y = crf_train$label, 
             x = crf_train[, c("pos", "pos_previous", "pos_next", 
                               "token", "token_previous", "token_next")], 
             group = crf_train$doc_id, 
             method = "lbfgs", file = "tagger.crfsuite",
             options = list(max_iterations = 25, feature.minfreq = 5, c1 = 0, c2 = 1)) 
model
Conditional Random Field saved at C:\Users\Jan\AppData\Local\Temp\RtmpGQdcoA\Rbuild13287e6fc1a\crfsuite\vignettes\tagger.crfsuite
  size of the model in Mb: 0.79
  number of categories: 9
  category labels: O, B-ORG, B-MISC, B-PER, I-PER, B-LOC, I-MISC, I-ORG, I-LOC
To inspect the model in detail, summary(yourmodel, 'modeldetails.txt') and inspect the modeldetails.txt file
stats <- summary(model)
Summary statistics of last iteration: 
Loss: 37014.993192
Feature norm: 30.527576
Error norm: 2230.388754
Active features: 11822
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.348

Dumping summary of the model to file C:\Users\Jan\AppData\Local\Temp\RtmpYzL9PD\crfsuite_a5c7fc92c48.txt
plot(stats$iterations$loss, pch = 20, type = "b", 
     main = "Loss evolution", xlab = "Iteration", ylab = "Loss")