How to tokenize a string?

#nlp #node #javascript #tutorial

To tokenize a string using winkNLP, read the text using readDoc. Then use the tokens method to extract a collection of tokens from the string. Follow this with the out method to get this collection as a JavaScript array. This is how you can tokenize a string:

// Load wink-nlp package  & helpers.
const winkNLP = require( 'wink-nlp' );
// Load "its" helper to extract item properties.
const its = require( 'wink-nlp/src/its.js' );
// Load english language model — light version.
const model = require( 'wink-eng-lite-model' );
// Instantiate winkNLP.
const nlp = winkNLP( model );

// Input string
const text = '#Breaking:D Can’t get over this #Oscars selfie from 
@TheEllenShow🤩https://pic.twitter.com/C9U5NOtGap';
// Read text
const doc = nlp.readDoc( text );
// Tokenize the string
const tokens = doc.tokens();
console.log( tokens.out() );

This returns an array of tokens:

[
  '#Breaking', ':D', 'Ca', 'n’t', 'get', 'over', 'this', '#Oscars', 
'selfie','from', '@TheEllenShow', '🤩', 
'https://pic.twitter.com/C9U5NOtGap'
]

winkNLP has a lossless tokenizer which preserves and reproduces the original text. The tokenizer intelligently handles hyphenation, contractions and abbreviations. It also detects token types like ‘word’, ‘number’, ‘punctuation’, ‘symbol’, etc.

DEV Community

How to tokenize a string?

Top comments (0)

Read next

anyone wanna join me building my first app?(I'm a beginner bare with me)

Build a Simple Chatbot with Svelte and ElizaBot

How to solve Data Synchronization in Next.js

Your coding year in review