This EEP contains developed suggestions regarding the module binary_string
first suggested in EEP 9. The module name is now however changed to bstring.
EEP 9 suggests several modules and is partially superseded by later EEP’s (i.e. EEP 11 and EEP 31), while still containing valuable suggestions not yet implemented. This last remaining module suggested in EEP 9 will therefore appear in this separate EEP. This is made in agreement with the original author of EEP 9.
The module bstring is suggested to contain functions for
convenient manipulation of textual data stored in binaries,
i.e. binary strings. It somewhat resembles the string module
(which is list oriented), but is not to be viewed simply as a
string module for binaries.
The module suggested handles binary character encoding in both the standard character encodings of Erlang, namely ISO-Latin-1 and UTF-8.
Text strings are traditionally represented as lists of integers in Erlang. While this is convenient and more or less built into the syntax of the language (i.e. “ABC” is syntactic sugar for [$A,$B,$C]), a more compact representation is often desired. Also, in some circumstances binaries can be more efficient to manipulate in terms of algorithm complexity than lists are (especially in the fixed character width case of ISO-Latin-1).
More modules have been added to the standard libraries lately to aid
the usage of binaries for text strings, both as representing
ISO-Latin-1 characters and Unicode strings encoded in UTF-8. Most
notably the re library, but also the unicode module are fairly
new additions to stdlib which will make life easier for the
programmer when it comes to manipulating binary encoded strings. Also
a module for fast searching and replacing in byte oriented binaries is
present (the module binary), but no traditional string manipulation module is
yet in the libraries. To ease use of binary encoded strings, such a module is
needed.
The module string for text oriented operations on lists has been
present in the standard libraries for so long that most programmers
don’t remember a time when it wasn’t there. It is said to originally
be a merge of two different string modules, written and designed by
two different programmers with possibly slightly different goals and
definitely slightly different views on function naming. While
sometimes criticized for duplicated functionality and inconsistent
function naming, among other things, the module has remained useful
throughout the entire lifespan of Erlang/OTP. The string
representation used has also withstood the evolution of Unicode.
It is worth to note that the only functions in the string module
that actually are language or region dependent are later additions to
the module. Those functions (like to_upper, to_lower, to_integer and
to_float), or their binary equivalence, are not part of the module
interface I suggest for bstring for the simple reason that they
need language support not yet present in Erlang. A future EEP might
suggest such language support (i.e. some kind of “locale” support), but
that is future work not covered by this EEP.
So, however criticized, the string module is very useful for manipulating lists, and the same functionality for binary strings is desirable. While a lot of the functionality will be similar, there are some major issues to consider when implementing a module for manipulating strings encoded in binaries:
Unicode - Binaries can have different encodings. A Character encoded as UTF-8 might take more than one (up to four) byte positions, and even the same character can have different encodings in ISO-Latin-1 and UTF-8 (all codepoints from 128 to 255). The functions need to be informed of the character encoding explicitly, The encoding information is not present in the binaries.
Mixed character encodings - As characters can be encoded in different ways, two strings in the same program could have different encodings. Supplying the functions with non-homogeneous string encoding data should be consistently solved throughout the module, as should the selection of returned encoding where applicable.
Default character encoding - As functions will take extra arguments to specify encoding, a consistent default might be useful. Choosing the default is not entirely simple, as the tradition states ISO-Latin-1, while the future suggests UTF-8.
Languages - Erlang has no notion of “Locale” or preferred number format. A general string module can not assume neither a specific notion of uppercase or lowercase letters, nor a specific number encoding format (especially true for floating point numbers).
Word separators - The space character is certainly not the only word separator for textual data (in any language). The notion of words separated by spaces imposes a restriction of the relevant languages.
Left to right or right to left - Notions like left or right to denote the beginning or end of a string are certainly not language independent. While strings in a language have a beginning and an end, that beginning and end may be placed both to the left, the right or even at the top, bottom or center of the graphical representation. A string manipulation module should not use naming implying a left-to-right script, or any other type of script.
Naming and duplicated functionality - The original string module
has been accused of having somewhat inconsistent naming and
functionality duplicated. In fact the only duplicated functions are
substr and sub_string. Some cleanup of the interface might
be needed.
Byte oriented versus character oriented return values - When dealing
with Unicode data, a character may take more than one byte, why
i.e. counting the number of characters in a string tells you very
little about the actual size of the string in bytes. Furthermore,
later processing of a binary might require byte-oriented
manipulation of a string rather than character oriented (i.e. you
want to manipulate the string using the binary module or with
bit-syntax), while characters are actually what constitutes a
string, not bytes. You would want both.
New or replaced functionality - New functionality have been suggested from several sources,
most notably EEP 9. For example the function split suggested in EEP 9 is very similar to
. Should we keep tokens anyway, for example?
I’ll address the different issues below.
The interface has to support both ISO-Latin-1 and UTF-8. The unicode module supports even more encodings, but Erlang/OTP uses UTF-8 for all “internal” interfaces and UTF-8 is the expected encoding of a binary Unicode string. Even though UTF-8 is compatible with ISO-Latin-1 in the 7bit ASCII range, characters with codepoints between 128 and 255 are encoded differently in the “plain” ISO-Latin-1 encoding and in UTF-8. This means that all functions in the bstring module need to have the actual encoding as one or more extra parameters.
One could invent a more abstract binary string format where the data is for example represented as a tuple with the string and the encoding packed together. However no other module supports such a string construct and I don’t think that would really add something, neither functionality nor readability. Consider code like:
bstring:tokens(Bin,latin1,[$ ,$\n])
compared to:
bstring:tokens({Bin,latin1}, [$ ,$\n]).
or even:
bstring:tokens(#bstring{data = Bin, encoding = latin1}, [$ ,$\n]).
In many cases the extra information needs to be added in connection to the call, making the code no more readable or simple to write than with the separate extra argument. Consider if we had a default value for encoding. The code:
f(Data) ->
bstring:tokens(Data,[$ ,$\n]).
would not in any way indicate if Data was supposed to be a binary with the default encoding or some kind of complex data structure indicating both the actual string and it’s encoding.
I think the extra argument for the encoding is straight forward and simple, and it makes programming easier when using the binary string in other modules as well (i.e. re, binary, file etc). I think we should simply not have a special string datatype for this module, character encoding should be supplied as a separate argument.
To ease transition between character encodings, I think the interface should accept different encodings for both different parameters and the return value. This makes it possible to convert on the fly and for the functions to decide on the most efficient character conversion path for the supplied arguments and the return value.
The downside of this approach is that some functions will take a lot of parameters telling different character encodings, for example a string concatenation routine could look like:
concat(BString1, Encoding1, BString2, Encoding2, Encoding3) -> BString3
being called like:
US = bstring:concat(SA,latin1, SB, latin1, unicode),
which might look a little awkward to write. On the other hand, conversion is made on the fly and you will not need to explicitly call the unicode module to convert the result.
I think implicit conversion is so useful that it is worth the extra arguments. For example a concat function would be more or less useless without it, the bit syntax would be much easier to use if no conversion should be allowed.
Choosing a default character encoding is not obvious. While ISO-Latin-1 is the default in Erlang (i.e. «“korvsmörgås”» gives a ISO-Latin-1 encoded binary string), UTF-8 usage is expected to grow in the future.
Although its tempting to select UTF-8 as the default encoding, I think we should stick to ISO-Latin-1 as the default even for this module. There are several reasons:
We need not, as a rule, impose new standards in every module we add to the standard library. Consistence certainly adds value, and both the bit-syntax, the source code encoding and things like the io:format routine has ISO-Latin-1 as default. Lets not make this module inconsistent with the others.
The string module is often used to manipulate arbitrary lists
of integers, not always actually representing textual data. In the
same way can bstring probably be used to manipulate arbitrary
blobs of bytes if ISO-latin-1 versions are used. ISO-Latin-1 is
actually the raw bytes uninterpreted, why any binary data can be
worked on in a ISO-Latin-1 oriented routine. Using UTF-8 encoding as
default would narrow the use for the default functions to only work
on real text data.
The pure ISO-Latin-1 implementations of the functions will be the
most efficient ones as no data checking at all is needed. Any byte
value is acceptable in any version. Some functions are usable on
UTF-8 strings even though they expect ISO-Latin-1 data. The
difference between the ISO-Latin-1 version and the UTF-8 version
being only indata control. If the data given to, for example
bstring:concat is already checked for correct UTF-8, the simpler
ISO-Latin-1 version of the function is both more efficient and
guaranteed to give as correct output as the input:
CorrectUtf8_1 = give_me_good_string(),
CorrectUtf8_2 = give_me_another_good_string(),
CorrectUtf8_3 = bstring:concat(CorrectUtf8_1, latin1, CorrectUtf8_2, latin1, latin1),
...
Simply put, ISO-Latin-1 versions of the functions are more generally useful than pure UTF-8 versions and are also more efficient.
A wrapper module providing pure UTF-8 interfaces can easily be written. The overhead of going via a wrapper would be relatively lower for an UTF-8 wrapper than for an ISO-Latin-1 ditto, as the overhead of character decoding/encoding of UTF-8 strings in the module would be quite high. Simply put, a wrapper would cost very little compared to the cost of checking the data for UTF-8 correctness.
I actually suggest a module ubstring that has the part of the
bstring interface where a default encoding is implied, but with
the difference that UTF-8 is expected. For example, a function
ubstring:tokens/2 would look like this:
tokens(S,L) -> bstring:tokens(S,unicode,L).
Quite simple.
To conclude, I think all functions should exist in a version where no encoding is supplied and ISO-Latin-1 encoded data is expected.
Even though Unicode characters can be used to express text in most
known, living and dead scripts, language and region knowledge is a
completely different thing. String interfaces often impose language
specific properties of the string, like left-to-right writing
direction, the notion of words built up by space separated groups of
characters, ways of representing numbers and decimal points etc. As
Erlang does not (yet) have a way of specifying such language-, or
region-specific properties of a string, the interface should not
contain language-dependent functionality. The string module did not
originally contain such functions (except that character alignment
functions were named left and right), but unfortunately
functions like to_float and to_upper have been added.
I think that having language-dependent functions in the string
module was a mistake and I do not want to make that mistake
again. Hence I have not included such functions or names in
bstring.
I rather suggest “Locale” functionality as a subject of a future
EEP. For those who consider that simple, try to write a correct
to_upper function for just all European languages, make sure it
works on all platforms that can run Erlang… Maybe not rocket science, but a
lot of metadata is required. Data that is not always available in
the underlying OS, but probably needs to be distributed with Erlang/OTP for
consistent functionality. Definitely worth it’s own EEP.
In connection with language independence, I think we should drop the
notion of words as a group of characters separated by space. The word
“token” is more general and does not in the same way indicate language
constructs. The string module has the ASCII space character as a
default for word separation, which I think should be dropped in
bstring. Whatever should separate tokens should be supplied,
possibly as alternatives. I therefore suggest the functions
bstring:num_tokens and bstring:nth_token to fulfill the
functionality of string:words and string:sub_word.
As in EEP 9, I suggest a new function split to handle the case
of multi-character separators for tokens. A compilation of split
and join makes a convenient replace function too.
As mentioned earlier, I don’t think direction of the graphical
representation should be implied in the interface, why I suggest using
notions like leading and trailing (meaning leading and trailing
characters in the binary) rather than any directional notions. I also
think aligning strings (like in strings:right etc.) could be solved
in one function align, taking one of the atoms leading,
trailing or center as a parameter, if it should at all be
implemented.
I definitely do not think we should have all interfaces from
string duplicated to bstring. Especially interfaces that are
aliases should not be carried along to the bstring module. Most
functions in the string module however have short and fairly
describing names, often similar to names found in other languages. I
think using a r prefix for functionality working from the end of
the string towards the beginning is a good choice, as is c for
complement.
Some functions in string, that are certainly useful, return numbers
denoting character positions. The same functions should definitely be
present in the bstring module and the return values should
definitely be character oriented. However byte offsets are definitely
useful, for example if we use a function like span to find the
first character not in a set of characters, we might want the byte
offset of that first character too.
I suggest adding some interfaces returning byte offsets, or part()’s
like the ones used in the binary module and by re, to cope
with the need for byte offsets and lengths in some circumstances. A
b suffix to the function name could denote such functionality, so
that bstring:span returns a character position while
bstring:spanb returns a byte position and btring:str returns a
character position and bstring:strb returns a part(). Although
this will in the end give rise to more functions in the interface,
having return-type-changing options in an option list is not the way
to go (I know, I have them in re, but it’s still not generally a
good idea…).
When writing a general string module, there is no end to the new, more
or less esoteric, functionality one could add. I think we, at least
in an initial implementation, should stick to the functionality
outlined in EEP 9, namely extending str and friends to
optionally take a list of alternative strings to search for, add a
function split to take care of multi-character separators (as
opposed to single character separators in the function tokens) and
a substitution function, which I think should be named replace as
in other modules.
The use of pre-compiled matches from the binary module is however
not a good idea, as the binary module has no notion of character
encoding. Search strings need to be given in defined character
encodings and both the “haystacks” and the “needles” encoding need to
be known when doing an efficient search. So - no pre-compiled search
expressions.
As made obvious above, I prefer the name bstring for a binary
string module in favor of the more verbose name binary_string
originally suggested. In that module bstring, I suggest the
following interfaces, expressed as in a manual page of OTP.
encoding() = latin1 | unicode | utf8
- The encoding of characters in the binary data, both input and output
bstring()
- Binary with characters encoded either in ISO-Latin-1 or UTF-8
unicode_char() = non_negative_integer()
- An integer representing a valid unicode codepoint
non_negative_integer()
- An integer >= 0
align(BString, Alignment, Number, Char) -> Result #
align(BString, Encoding, Alignment, Number, Char) -> Result #
Types:
BString = Result = bstring()
Encoding = encoding()
Alignment = leading | trailing | center
Number = non_negative_integer()
Char = unicode_char()
Aligns the characters in BString in a Result of Number characters according to the Alignment parameter. Alignment is done by inserting the character Char in the beginning or end (or both) of the binary string.
The resulting binary string will contain exactly Number characters, the string is truncated if it contains more characters than Number - either at the end if Alignment is leading, or at the beginning if Alignment is trailing, or at both ends if Alignment is center . If Encoding is unicode, the Result may well contain more bytes than Number, as one character may require several bytes.
Example:
> bstring:align(<<"Hello">>, latin1, center, 10, $.).
<<"..Hello...">>
If the encoding is not given, it is assumed to be latin1, implying that no interpretation is given to the bytes in the binary string.
Raises a badarg exception if BString does not contain characters encoded according to the Encoding parameter, Encoding or Alignment has an invalid value, the character Char cannot be encoded in the character encoding given as Encoding or any of the parameters are of the wrong type.
chr(BString, Character) -> Position #
chr(BString, Encoding, Character) -> Position #
rchr(BString, Character) -> Position #
rchr(BString, Encoding, Character) -> Position #
Types:
BString = bstring()
Encoding = encoding()
Character = unicode_char()
Position = integer()
Returns the (zero-based) character position of the first/last occurrence of Character in BString . -1 is returned if Character does not occur.
Note that the character position is not the same as the byte position. Use the chrb and rchrb functions to get the byte positions.
If Character cannot be represented in the encoding, it is not an error, you are just certain to get -1 as a return value.
If the encoding is not given, it is assumed to be latin1, implying that no interpretation is given to the bytes in the binary string.
Raises a badarg exception if the searched part of BString does not contain characters encoded according to the Encoding parameter, the Encoding has an invalid value, or any of the parameters are of the wrong type.
chrb(BString, Character) -> {BytePosition, ByteLength} #
chrb(BString, Encoding, Character) -> {BytePosition, ByteLength} #
rchrb(BString, Character) -> {BytePosition, ByteLength} #
rchrb(BString, Encoding, Character) -> {BytePosition, ByteLength} #
Types:
BString = bstring()
Encoding = encoding()
Character = unicode_char()
BytePosition = integer()
ByteLength = non_negative_integer()
Works as chr and rchr respectively, but returns the byte position and byte length of the character.
If the character is not found, {-1,0} is returned.
concat(BString1, BString2) -> BString3 #
concat(BString1, Encoding1, BString2, Encoding2, Encoding3) -> BString3 #
Types:
BString1 = BString2 = BString3 = bstring()
Encoding1 = Encoding2 = Encoding3 = encoding()
Concatenates two binary strings to form a new string. Returns the new binary string in the encoding given by Encoding3.
If the encoding is not given, it is assumed to be latin1, implying that no interpretation is given to the bytes in the binary string.
Raises a badarg exception if BString1 or Bstring2 does not contain characters encoded according to the Encoding1 and Encoding2 parameters, the encoding parameters has an invalid value, the codepoints in the in-parameters cannot be represented in the output encoding or any of the parameters are of the wrong type.
equal(BString1, BString2) -> bool() #
equal(BString1, Encoding1, BString2, Encoding2) -> bool() #
Types:
BString1 = BString2 = bstring()
Encoding1 = Encoding2 = encoding()
Tests whether two binary strings are equal. Returns true if they are, otherwise false .
Encoding1 is the encoding of BString1 and Encoding2 is the encoding of BString2 .
Note that the strings can have different encoding and that it is the character values encoded in the strings that are compared. The binary strings are scanned as long as they are equal, meaning that if the function returns true, both strings are correctly encoded, while a return value of false does not guarantee correct encoding in both binary strings. An exception is raised if faulty encoding is determined while comparing the strings, not if parts of the string not inspected contain encoding errors.
If the encoding is not given, it is assumed to be latin1, implying that no interpretation is given to the bytes in the binary string.
Raises a badarg exception if wrongly encoded characters, according to the encoding parameters, are encountered during comparison, the encoding parameters has an invalid value or any of the parameters are of the wrong type.
join(BStringList, Separator) -> Result #
join(BStringList, BStringListEncoding, Separator, SeparatorEncoding, ResultEncoding) -> Result #
Types:
BStringList = [bstring()]
BStringListEncoding = SeparatorEncoding = ResultEncoding = encoding()
Separator = bstring()
Result = bstring()
Returns a binary string with the elements of BStringList separated by the binary string in Seperator .
All the binary strings in BStringList should have the same encoding (given as BStringListEncoding . The Separator can however have a different encoding (given as SeparatorEncoding ), as can the Result (given as ResultEncoding ).
Example:
> bstring:join([<<"one">>, <<"two">>, <<"three">>], latin1, <<", ">>, latin1, latin1).
<<"one, two, three">>
If the encoding is not given, it is assumed to be latin1, implying that no interpretation is given to the bytes in the binary string.
Raises a badarg exception if binary strings in BStringList or the Separator do not contain characters encoded according to the BStringListEncoding and SeparatorEncoding parameters respectively, the encoding parameters has an invalid value, the codepoints in the in-parameters cannot be represented in the output encoding ResultEncoding or any of the parameters are of the wrong type.
len(BString) -> Length #
len(BString, Encoding) -> Length #
Types:
BString = bstring()
Encoding = encoding()
Length = non_negative_integer()
Returns the number of characters in the binary string.
If the encoding is not given, it is assumed to be latin1, implying that no interpretation is given to the bytes in the binary string.
Raises a badarg exception if BString does not contain characters encoded according to the Encoding parameter, the Encoding has an invalid value or any of the parameters are of the wrong type.
nth_token(BString, N, CharList) -> Result #
nth_token(BString, Encoding, N, CharList) -> Result #
Types:
BString = Result = bstring()
Encoding = encoding()
CharList = [ unicode_char() ]
N = non_negative_integer()
Returns the token number N of BString (zero-based). Tokens are separated by the characters in CharList .
The returned token will have the same encoding as BString .
For example:
> bstring:nth_token(<<" Hello old boy !">>,latin1,3,[$o, $ ]).
<<"ld b">>
CharList is to be viewed as a set of characters, order is not significant. Codepoints given in CharList that cannot be represented by the Encoding, is not an error.
Values of N >= number of tokens in BString will result in the empty binary string <<>> being returned.
If the encoding is not given, it is assumed to be latin1, implying that no interpretation is given to the bytes in the binary string.
Raises a badarg exception if BString does not contain characters encoded according to the Encoding parameter, the Encoding has an invalid value, or any of the parameters are of the wrong type.
num_tokens(BString, CharList) -> Count #
num_tokens(BString, Encoding, CharList) -> Count #
Types:
BString = bstring()
Encoding = encoding()
CharList = [ unicode_char() ]
Count = non_negative_integer()
Returns the number of tokens in String, separated by the characters in CharList .
The result is the same as for length(bstring:tokens(BString,Encoding,CharList)), but avoids building the result.
For example:
> num_tokens(<<" Hello old boy!">>, latin1, [$o, $ ]).
4
CharList is to be viewed as a set of characters, order is not significant. Codepoints given in CharList that cannot be represented by the Encoding, is not an error.
If the encoding is not given, it is assumed to be latin1, implying that no interpretation is given to the bytes in the binary string.
Raises a badarg exception if BString does not contain characters encoded according to the Encoding parameter, the Encoding has an invalid value, or any of the parameters are of the wrong type.
span(BString, Chars) -> Length #
span(BString, Encoding, Chars) -> Length #
rspan(BString, Chars) -> Length #
rspan(BString, Encoding, Chars) -> Length #
cspan(BString, Chars) -> Length #
cspan(BString, Encoding, Chars) -> Length #
rcspan(BString, Chars) -> Length #
rcspan(BString, Encoding, Chars) -> Length #
Types:
BString = bstring()
Encoding = encoding()
Chars = [ integer() ]
Length = non_negative_integer()
Returns the length (in characters) of the maximum initial (span and cspan) or trailing (rspan and rcspan) segment of BString, which consists entirely of characters from (span and rspan), or not from (cspan and rcspan) Chars.
Chars is to be viewed as a set of characters, order is not significant. Codepoints given in Char that cannot be represented by the Encoding, is not an error.
For example:
> bstring:span(<<"\t abcdef">>,latin1," \t").
5
> bstring:cspan((<<"\t abcdef">>,latin1, " \t").
0
Codepoints in Chars that can not be represented by Encoding is not considered an error.
If the encoding is not given, it is assumed to be latin1, implying that no interpretation is given to the bytes in the binary string.
Raises a badarg exception if the searched part of BString does not contain characters encoded according to the Encoding parameter, the Encoding has an invalid value, or any of the parameters are of the wrong type.
spanb(BString, Chars) -> ByteLength #
spanb(BString, Encoding, Chars) -> ByteLength #
rspanb(BString, Chars) -> ByteLength #
rspanb(BString, Encoding, Chars) -> ByteLength #
cspanb(BString, Chars) -> ByteLength #
cspanb(BString, Encoding, Chars) -> ByteLength #
rcspanb(BString, Chars) -> ByteLength #
rcspanb(BString, Encoding, Chars) -> ByteLength #
Types:
BString = bstring()
Encoding = encoding()
Chars = [ integer() ]
ByteLength = non_negative_integer()
Work exactly as the functions span, rspan, cspan and rcspan respectively, but returns the number of bytes rather than the number of characters.
split(BString, Separators, Where) -> Tokens #
split(BString, Encoding, Separators, SepEncoding, Where, ReturnEncoding) -> Tokens #
Types:
String = bstring()
Encoding = SepEncoding = ReturnEncoding = encoding()
Separators = [ bstring() ]
Where = first | last | all
Tokens = [bstring()]
Returns a list of tokens in BString, separated by the binary strings in Separators .
The Tokens returned are encoded according to ReturnEncoding .
Example:
> bstring:split(<<"abc defxxghix jkl">>, latin1, [<<"x">>,<<" ">>],all,latin1).
[<<"abc">>, <<"def">>, <<"ghi">>, <<"jkl">>]
Separators is to be viewed as a set of binary strings, order is not significant. Codepoints given in Separators that cannot be represented by the Encoding, is not an error.
The Where parameter specifies at which occurrence of any of the Separators the binary string is to be split, either at the first occurrence, the last occurrence or at all occurrences, in which case the Tokens may be an arbitrary long list.
If the encoding is not given, it is assumed to be latin1, implying that no interpretation is given to the bytes in the binary string.
Raises a badarg exception if BString or Separators does not contain characters encoded according to the Encoding and SepEncoding parameters respectively, the resulting tokens cannot be encoded in the ReturnEncoding, the Encoding has an invalid value, or any of the parameters are of the wrong type.
str(BString, SubBStrings) -> Position #
str(BString, Encoding, SubBStrings, SubEnc) -> Position #
rstr(BString, SubBStrings) -> Position #
rstr(BString, Encoding, SubBStrings, SubEnc) -> Position #
Types:
BString = bstring()
SubBString = bstring() | [ bstring() ]
Encoding = SubEnc = encoding()
Position = integer()
Returns the (zero-based) character position where the first/last occurrence of any of the SubBStrings begins in BString . -1 is returned if SubBString does not exist in BString .
Note that the Character position is not the same as the byte position. Use the strb and rstrb functions to get the byte positions.
The encoding need not be the same for BString and SubBStrings, however all strings in SubBStrings need to have the same encoding.
If the codepoints in SubBString can not be represented in the encoding of BString, that is not an error, but will always result in the return value -1.
Example:
> bstring:str(<<" Hello Hello World World ">>,latin1,<<"Hello World">>,latin1).
7
Note that if both encodings are the same and repeated searches with the same SubBStrings are to be performed, it is more efficient to use the binary:match/{2,3} functions with a precompiled pattern on the raw binary data.
If the encoding is not given, it is assumed to be latin1, implying that no interpretation is given to the bytes in the binary string.
Raises a badarg exception if the searched part of BString or SubBString does not contain characters encoded according to the Encoding and SubEnc parameters, the Encoding has an invalid value, or any of the parameters are of the wrong type.
strb(BString, SubBStrings) -> {BytePosition, ByteLength} #
strb(BString, Encoding, SubBStrings, SubEnc) -> {BytePosition, ByteLength} #
rstrb(BString, SubBStrings) -> {BytePosition, ByteLength} #
rstrb(BString, Encoding, SubBStrings, SubEnc) -> {BytePosition, ByteLength} #
Types:
BString = bstring()
SubBString = bstring() | [ bstring() ]
Encoding = SubEnc = encoding()
BytePosition = integer()
ByteLength = non_negative_integer()
Works as str and rstr respectively, but returns the byte position and byte length of the found substring.
Note that ByteLength is the length the found substring has in BString, regardless of the encoding in SubBStrings, so that ByteLength may be both larger and smaller than byte_size(SubBString) depending on the binary string’s encoding.
If the substring is not found, {-1,0} is returned.
strip(BString, Which, CharList) -> Result #
strip(BString, Encoding, Which, CharList) -> Result #
Types:
BString = Result = bstring()
Encoding = encoding()
Which = leading | trailing | both
CharList = [ unicode_char() ]
Removes leading (Which = leading), trailing (Which = trailing) or both leading and trailing (Which = both) characters belonging to the set indicated by CharList from the binary string BString .
This is essentially the same as using spanb and/or rspanb in combination with bit syntax to remove the characters.
Example:
> bstring:strip(<<"...He.llo.....">>, latin1, both, [$.]).
<<"He.llo">>
If the encoding is not given, it is assumed to be latin1, implying that no interpretation is given to the bytes in the binary string.
Raises a badarg exception if scanned part of BString does not contain characters encoded according to the Encoding parameter, Encoding or Which has an invalid value, or any of the parameters are of the wrong type.
replace(BString, Separators, Replacement, Where) -> Result #
replace(BString, Encoding, Separators, SeparatorsEncoding, Replacement, ReplacementEncoding, Where, ResultEncoding) -> Result #
Types:
BString = bstring()
Encoding = SeparatorsEncoding = ReplacementEncoding, ResultEncoding = encoding()
Separators = [ bstring() ]
Replacement = bstring()
Where = first | last | all
Result = bstring()
Produces the same result as
bstring:join(bstring:split(BString,Encoding,Separators,SeparatorsEncoding,Where,
unicode),
unicode,Replacement,ReplacementEncoding,ResultEncoding)
but with less overhead.
substr(BString, Start, Length) -> SubBString #
substr(BString, Encoding, Start, Length) -> SubBString #
Types:
BString = SubBString = bstring()
Encoding = bstring()
Start = integer()
Length = non_negative_integer() | infinity
Returns a substring of String, starting at the zero-based character position Start, and ending at the end of the binary string (if Length is infinity or up to, but not including, the character position Start+Length (if Length is a non negative integer).
The returned SubBString will have the same encoding as BString .
Example:
> bstring:substr(<<"Hello World">>, latin1, 3, 5).
<<"lo Wo">>
A negative value of Start denotes abs(Start) characters from the end of BString, so that -1 is the last character position in the binary string.
Example:
> bstring:substr(<<"Hello World">>, latin1, -1, 3).
<<"rld">>
As the true length of an UTF-8 encoded binary string is quite costly to determine ( O(N), where N is the number of bytes in the binary), the function is very forgiving about positions given outside of the string, both Start s and Length s. Character positions outside of the string in either direction are collapsed to the empty binary string.
Examples:
> bstring:substr(<<"01234">>, latin1, 5, 5).
<<>>
> bstring:substr(<<"01234">>, latin1, 4, 5).
<<"4">>
> bstring:substr(<<"01234">>, latin1, -5, 100).
<<"01234">>
> bstring:substr(<<"01234">>, latin1, -6, 1).
<<>>
> bstring:substr(<<"01234">>, latin1, -6, 2).
<<"0">>
If the encoding is not given, it is assumed to be latin1, implying that no interpretation is given to the bytes in the binary string.
If the encoding is not given, it is assumed to be latin1, implying that no interpretation is given to the bytes in the binary string.
Raises a badarg exception if the searched part of BString does not contain characters encoded according to the Encoding parameter, the Encoding has an invalid value, or any of the parameters are of the wrong type.
tokens(BString, SeparatorList) -> Tokens #
tokens(BString, Encoding, SeparatorList) -> Tokens #
Types:
String = bstring()
Encoding = encoding
SeparatorList = [ non_negative_integer() ]
Tokens = [bstring()]
Returns a list of tokens in BString, separated by the characters in SeparatorList .
The Tokens returned are encoded in the same character encoding as the BString .
Example:
> bstring:tokens(<<"abc defxxghix jkl">>, latin1, [$x,$ ]).
[<<"abc">>, <<"def">>, <<"ghi">>, <<"jkl">>]
SeparatorList is to be viewed as a set of characters, order is not significant. Codepoints given in SeparatorList that cannot be represented by the Encoding, is not an error.
If the encoding is not given, it is assumed to be latin1, implying that no interpretation is given to the bytes in the binary string.
Raises a badarg exception if the searched part of BString does not contain characters encoded according to the Encoding parameter, the Encoding has an invalid value, or any of the parameters are of the wrong type.
This module can, and probably should, be implemented entirely in
Erlang, no BIF’s or NIF’s are needed. Both the binary and
unicode modules can be utilized to speed up conversion and indata
checking. The Unicode versions will definitely be slower than the
ISO-Latin-1 versions, as character encoding, decoding and checking is
bound to produce overhead.
The suggested wrapper ubstring should not impose any significant
cost compared to calling bstring with all encoding arguments set
to unicode.
The idea is to make string manipulation using binaries convenient as it has a great positive impact on systems memory-wise. Increased speed compared to list-oriented strings is not the goal, although it may well be a side-effect.
No specific reference implementation is made, the code will however be made available on GitHub during any development.
This document is licensed under the Creative Commons license.