String Tokenizing based on StrExtract()

November 19, 2015 • about 4 minutes to read

I've been building a number of solutions lately that relied heavily on parsing text. One thing that seems to come up repeatedly is the need to split strings but making sure that certain string tokens are excluded. For example, a recent MarkDown parser I've built for Help Builder needs to make sure it first excludes all code snippets, then performs standard parsing then puts the code snippets back for custom parsing.

Another scenario is when Help Builder imports .NET classes and it has to deal with generic parameters. Typically parameters are parsed via commas to separate them, but .NET generics may add commas as part of generic parameter lists.

Both of those scenarios require that code be parsed by first pulling out a token from a string and replacing it with a placeholder, then performing some other operation and then putting the the original value back.

For me this has become common enough that I decided I could really use a couple helpers for this. Here are two functions that help with this:

************************************************************************
*  TokenizeString
****************************************
***  Function: Tokenizes a string based on an extraction string and
***            returns the tokens as a collection. 
***    Assume: Pass the source string by reference to update it
***            with token delimiters.
***            Extraction is done with case insensitivity
***      Pass:  @lcSource   -  Source string - pass by reference
***             lcStart     -  Extract start string
***             lcEnd       -  Extract End String
***             lcDelimiter -  Delimiter embedded into string
***                            #@# (default) produces:
***                            #@#<sequence Number>#@#   
***    Return: Collection of tokens
************************************************************************
FUNCTION TokenizeString(lcSource,lcStart,lcEnd,lcDelimiter)
LOCAL loTokens, lcExtract
 
IF EMPTY(lcDelimiter)
   lcDelimiter = "#@#"
ENDIF
 
loTokens = CREATEOBJECT("Collection")
 
lnX = 1
DO WHILE .T.
    lcExtract = STREXTRACT(lcSource,lcStart,lcEnd,1,1+4)
    IF EMPTY(lcExtract)
       EXIT
    ENDIF
    loTokens.Add(lcExtract)
    
    lcSource = STRTRAN(lcSource,lcExtract,lcDelimiter + TRANSFORM(lnx) + lcDelimiter)
    lnx = lnx + 1 
ENDDO
 
RETURN loTokens
ENDFUNC
*   TokenizeString
 
************************************************************************
*  DetokenizeString
****************************************
***  Function: Detokenizes an individual value of the string
***    Assume:
***      Pass:  lcString    - Value that contains a token
***             loTokens    - Collection of tokens
***             lcDelimiter - Delimiter for token id
***    Return: detokenized string or original value if no token
************************************************************************
FUNCTION DetokenizeString(lcString,loTokens,lcDelimiter)
LOCAL lnId, loTokens as Collection
 
IF EMPTY(lcDelimiter)
  lcDelimiter = "#@#"
ENDIF
 
DO WHILE .T.
    lnId = VAL(STREXTRACT(lcString,lcDelimiter,lcDelimiter))
    IF lnId < 1
       EXIT
    ENDIF   
    lcString = STRTRAN(lcString,lcDelimiter + TRANSFORM(lnId) + lcDelimiter,loTokens.Item(lnId))
ENDDO
 
RETURN lcString
ENDFUNC
*   DetokenizeString

TokenizeString() basically picks out anything between one or more start and end delimiter and returns a collection of these values (tokens). If you pass the source string in by reference the source is modified to embed token place holders into the the passed string replacing the extracted values.

You can then use DetokenizeString() to detokenize either individual string values or the entire tokenized string.

This allows you to basically work on the string without the tokenized values contained in it which can be useful if the tokenized text requires separate processing or interferes with the string processing of the original string.

An Example – .NET Generic Parameter Parsing

Here's an example of the comma delimited list of parameters I mentioned above. Assume I have a list of comma delimited parameters that needs to be parsed:

DO wwutils
CLEAR
 
lcParameters = "IEnumerable<Field,bool> List, Field field, List<Field,int> fieldList"
? "Original: " 
? lcParameters
?
*** Creates tokens in the lcSource String and returns a collection of the 
*** tokens.
loTokens = TokenizeString(@lcParameters,"<",">")
 
? lcParameters
* IEnumerable#@#1#@# List, Field field, List#@#2#@# fieldList
 
FOR lnX = 1 TO loTokens.Count
   ? loTokens[lnX]
ENDFOR
?
? "Tokenized string: " + lcParameters
?
? "Parsed parameters:"
*** Now parse the parameters
lnCount = ALINES(laParms,lcParameters,",")
FOR lnX = 1 TO lnCount
   *** Detokenize indvidual parameters
   laParms[lnX] = DetokenizeString(laParms[lnX],loTokens)
   ? laParms[lnX]
ENDFOR
 
?
? "Detokenized String (should be same as original):"
*** or you can detokenize the entire string at once
? DetokenizeString(lcParameters,loTokens)

IEnumerable<Field,bool> List, Field field, List<Field,int> fieldList

Notice that this list contains generic parameters embedded in the < > brackets so I can't just run ALINES() on this list. The following code strips out the generic parameters first, then parses the list then adds the token back in. The Tokenization allows picking out a subset of substrings and replace them with tokens so additional parsing can be done without the noise of the generic parameters in brackets that would otherwise break the parse logic. This is quite common in text parsing where you often deal with patterns that you are matching – and trying to avoid edge cases where the pattern breaks down. This is where I've found tokenization super useful.

Specialized Use Cases

In Help Builder I have tons of use cases where this applies as documents are rendered: In parsing code snippets out of documents for parsing because the code snippets are rendered 'raw' while the rest of the document gets rendered as encoded context. Links that require special fixup before being embedded into the document – the tokenization allows easy capture of the links, replacing the the captured token value and writing it back out with a new value. In Web Connection the various template parsers do something very similar with expressions and code blocks that get pulled out of the document then injected back in later as expanded values.

There are lots of variations of how you can use these tokens effectively.

This isn't the sort of thing you run into all the time, but for me it's been surprisingly frequent that I've had to do stuff like this and while this isn't terribly difficult to do manually, it's very verbose code that's ugly to write as part of an application. These two functions greatly simplify application code as it's shrunk to a couple of simple helper functions.

Maybe some of you will find this useful though…

Blog Stats

Rick's Sites

Archives

West Wind News