Type: Package
Title: R Implementation of Wordpiece Tokenization
Version: 2.1.3
Description: Apply 'Wordpiece' (<doi:10.48550/arXiv.1609.08144>) tokenization to input text, given an appropriate vocabulary. The 'BERT' (<doi:10.48550/arXiv.1810.04805>) tokenization conventions are used by default.
Encoding: UTF-8
URL: https://github.com/macmillancontentscience/wordpiece
BugReports: https://github.com/macmillancontentscience/wordpiece/issues
Depends: R (≥ 3.3.0)
License: Apache License (≥ 2)
RoxygenNote: 7.1.2
Imports: dlr (≥ 1.0.0), fastmatch (≥ 1.1), memoise (≥ 2.0.0), piecemaker (≥ 1.0.0), rlang, stringi (≥ 1.0), wordpiece.data (≥ 1.0.2)
Suggests: covr, knitr, rmarkdown, testthat (≥ 3.0.0)
VignetteBuilder: knitr
Config/testthat/edition: 3
NeedsCompilation: no
Packaged: 2022-03-03 14:19:39 UTC; jonathan.bratt
Author: Jonathan Bratt ORCID iD [aut, cre], Jon Harmon ORCID iD [aut], Bedford Freeman & Worth Pub Grp LLC DBA Macmillan Learning [cph]
Maintainer: Jonathan Bratt <jonathan.bratt@macmillan.com>
Repository: CRAN
Date/Publication: 2022-03-03 15:10:02 UTC

Determine Casedness of Vocabulary

Description

Determine Casedness of Vocabulary

Usage

.get_casedness(v)

## Default S3 method:
.get_casedness(v)

## S3 method for class 'wordpiece_vocabulary'
.get_casedness(v)

## S3 method for class 'character'
.get_casedness(v)

Arguments

v

An object of class wordpiece_vocabulary, or a character vector.

Value

TRUE if the vocabulary is case-sensitive, FALSE otherwise.


Determine Vocabulary Casedness

Description

Determine whether or not a wordpiece vocabulary is case-sensitive.

Usage

.infer_case_from_vocab(vocab)

Arguments

vocab

The vocabulary as a character vector.

Details

If none of the tokens in the vocabulary start with a capital letter, it will be assumed to be uncased. Note that tokens like "\[CLS\]" contain uppercase letters, but don't start with uppercase letters.

Value

TRUE if the vocabulary is cased, FALSE if uncased.


Constructor for Class wordpiece_vocabulary

Description

Constructor for Class wordpiece_vocabulary

Usage

.new_wordpiece_vocabulary(vocab, is_cased)

Arguments

vocab

Character vector of tokens.

is_cased

Logical; whether the vocabulary is cased.

Value

The vocabulary with is_cased attached as an attribute, and the class wordpiece_vocabulary applied.


Process a Vocabulary for Tokenization

Description

Process a Vocabulary for Tokenization

Usage

.process_vocab(v)

## Default S3 method:
.process_vocab(v)

## S3 method for class 'wordpiece_vocabulary'
.process_vocab(v)

## S3 method for class 'character'
.process_vocab(v)

Arguments

v

An object of class wordpiece_vocabulary or a character vector.

Value

A character vector of tokens for tokenization.


Process a Wordpiece Vocabulary for Tokenization

Description

Process a Wordpiece Vocabulary for Tokenization

Usage

.process_wp_vocab(v)

## Default S3 method:
.process_wp_vocab(v)

## S3 method for class 'wordpiece_vocabulary'
.process_wp_vocab(v)

## S3 method for class 'integer'
.process_wp_vocab(v)

## S3 method for class 'character'
.process_wp_vocab(v)

Arguments

v

An object of class wordpiece_vocabulary.

Value

A character vector of tokens for tokenization.


Validator for Objects of Class wordpiece_vocabulary

Description

Validator for Objects of Class wordpiece_vocabulary

Usage

.validate_wordpiece_vocabulary(vocab)

Arguments

vocab

wordpiece_vocabulary object to validate

Value

vocab if the object passes the checks. Otherwise, abort with message.


Tokenize an Input Word-by-word

Description

Tokenize an Input Word-by-word

Usage

.wp_tokenize_single_string(words, vocab, unk_token, max_chars)

Arguments

words

Character; a vector of words (generated by space-tokenizing a single input).

vocab

Character vector of vocabulary tokens. The tokens are assumed to be in order of index, with the first index taken as zero to be compatible with Python implementations.

unk_token

Token to represent unknown words.

max_chars

Maximum length of word recognized.

Value

A named integer vector of tokenized words.


Tokenize a Word

Description

Tokenize a single "word" (no whitespace). The word can technically contain punctuation, but in BERT's tokenization, punctuation has been split out by this point.

Usage

.wp_tokenize_word(word, vocab, unk_token = "[UNK]", max_chars = 100)

Arguments

word

Word to tokenize.

vocab

Character vector of vocabulary tokens. The tokens are assumed to be in order of index, with the first index taken as zero to be compatible with Python implementations.

unk_token

Token to represent unknown words.

max_chars

Maximum length of word recognized.

Value

Input word as a list of tokens.


Load a vocabulary file, or retrieve from cache

Description

Load a vocabulary file, or retrieve from cache

Usage

load_or_retrieve_vocab(vocab_file)

Arguments

vocab_file

path to vocabulary file. File is assumed to be a text file, with one token per line, with the line number corresponding to the index of that token in the vocabulary.

Value

The vocab as a character vector of tokens. The casedness of the vocabulary is inferred and attached as the "is_cased" attribute. The vocabulary indices are taken to be the positions of the tokens, starting at zero for historical consistency.

Note that from the perspective of a neural net, the numeric indices are the tokens, and the mapping from token to index is fixed. If we changed the indexing (the order of the tokens), it would break any pre-trained models.


Load a vocabulary file

Description

Load a vocabulary file

Usage

load_vocab(vocab_file)

Arguments

vocab_file

path to vocabulary file. File is assumed to be a text file, with one token per line, with the line number corresponding to the index of that token in the vocabulary.

Value

The vocab as a character vector of tokens. The casedness of the vocabulary is inferred and attached as the "is_cased" attribute. The vocabulary indices are taken to be the positions of the tokens, starting at zero for historical consistency.

Note that from the perspective of a neural net, the numeric indices are the tokens, and the mapping from token to index is fixed. If we changed the indexing (the order of the tokens), it would break any pre-trained models.

Examples

# Get path to sample vocabulary included with package.
vocab_path <- system.file("extdata", "tiny_vocab.txt", package = "wordpiece")
vocab <- load_vocab(vocab_file = vocab_path)

Format a Token List as a Vocabulary

Description

We use a special named integer vector with class wordpiece_vocabulary to provide information about tokens used in wordpiece_tokenize. This function takes a character vector of tokens and puts it into that format.

Usage

prepare_vocab(token_list)

Arguments

token_list

A character vector of tokens.

Value

The vocab as a character vector of tokens. The casedness of the vocabulary is inferred and attached as the "is_cased" attribute. The vocabulary indices are taken to be the positions of the tokens, starting at zero for historical consistency.

Note that from the perspective of a neural net, the numeric indices are the tokens, and the mapping from token to index is fixed. If we changed the indexing (the order of the tokens), it would break any pre-trained models.

Examples

my_vocab <- prepare_vocab(c("some", "example", "tokens"))
class(my_vocab)
attr(my_vocab, "is_cased")

Objects exported from other packages

Description

These objects are imported from other packages. Follow the links below to see their documentation.

fastmatch

%fin%

rlang

%||%

wordpiece.data

wordpiece_vocab


Set a Cache Directory for wordpiece

Description

Use this function to override the cache path used by wordpiece for the current session. Set the WORDPIECE_CACHE_DIR environment variable for a more permanent change.

Usage

set_wordpiece_cache_dir(cache_dir = NULL)

Arguments

cache_dir

Character scalar; a path to a cache directory.

Value

A normalized path to a cache directory. The directory is created if the user has write access and the directory does not exist.


Retrieve Directory for wordpiece Cache

Description

The wordpiece cache directory is a platform- and user-specific path where wordpiece saves caches (such as a downloaded vocabulary). You can override the default location in a few ways:

Usage

wordpiece_cache_dir()

Value

A character vector with the normalized path to the cache.


Tokenize Sequence with Word Pieces

Description

Given a sequence of text and a wordpiece vocabulary, tokenizes the text.

Usage

wordpiece_tokenize(
  text,
  vocab = wordpiece_vocab(),
  unk_token = "[UNK]",
  max_chars = 100
)

Arguments

text

Character; text to tokenize.

vocab

Character vector of vocabulary tokens. The tokens are assumed to be in order of index, with the first index taken as zero to be compatible with Python implementations.

unk_token

Token to represent unknown words.

max_chars

Maximum length of word recognized.

Value

A list of named integer vectors, giving the tokenization of the input sequences. The integer values are the token ids, and the names are the tokens.

Examples

tokens <- wordpiece_tokenize(
  text = c(
    "I love tacos!",
    "I also kinda like apples."
  )
)