-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Internal encoding and string type #96
Comments
In case you are currently not using UTF-8, I would like to draw your attention to http://utf8everywhere.org/, where arguments for choosing UTF-8 instead of other encodings are presented. |
@Krzmbrzl yes, we use UTF-8 "internally". The decision to use Clearly you are concerned about |
I wouldn't exactly call it a deal-breaker. From my experience, it's just a little unusual to be required to use wide strings and wide characters for everything. Especially since in SeQuant most strings are tensor and index names, which generally tend to be rather short. Thus, the required additional
Yes and no. Based on my research on wide character / wide string support in C++, it seems like the size of a wide character is not the same across different platforms. E.g. on Windows it seems to be only 2 bytes wide, which means that BMP characters can be represented in one code point (e.g. one wide character), but any non-BMP character will still use multiple code points per character.
As long as the things that you are looking for are ASCII characters, I think it should work just fine, shouldn't it? Afaik, UTF encodings guarantee that all code points that look like ASCII characters, really are ASCII characters. So one shouldn't have to worry about accidentally matching a code point that looks like the searched-for ASCII character, but really is part of a multi-code-point encoding.
Side-note: A quick test (https://godbolt.org/z/TbnGb3Gx6) seems to indicate that as soon as one is using wide strings / wide characters, you are no longer using UTF-8, but rather UTF -16 or UTF-32 (see above). See also https://www.boost.org/doc/libs/1_82_0/libs/locale/doc/html/recommendations_and_myths.html Overall, I tend to think that in order to properly deal with all facets of Unicode properly (on all platforms), use of a dedicated Unicode library makes things easier. That'd be ICU or Boost.Locale. |
@Krzmbrzl thanks for making me revisit the issue. Yes, unfortunately wide strings everywhere is unusual/uncomfortable. Despite the drawbacks, I still think using Indeed (and I was not aware of this) that the default wide character encoding is platform- and compiler-option dependent: https://godbolt.org/z/eY1TocTPf In my opinion ASCII chars are not enough: greek characters at a minimum are needed, and many others (subscripts) are also very useful for making the code resemble math as far as possible. Going beyond BMP though is not as relevant. |
Until things break, because someone used a non-BMP character :)
Index labels and tensor names are the two instances in which I have mostly encountered the wstring API so far. So yeah, providing overloads for these APIs to also accept regular strings would mitigate the effect somewhat. |
I would highly recommend sticking to a single string encoding in your code. UTF-8 encoding everywhere is probably the easiest way to go. Converting between encodings gets very complicated to say the least. The standard c++ library string conversion implementations are a bit buggy from what I have read (though I can't find the reference anymore). Complications arise in that there are multiple ways some characters can be encoded. This also causes problems for string comparison as two equivalent characters can have different byte patterns. Using wide characters does not avoid this problem either. To understand this better, here is a brief explanation of unicode character encoding.
The most common cases where you will see grapheme clusters with more than 1 code point is for characters with accents or other marks, and emojis. This hierarchy makes things difficult to deal with when trying to find character boundaries. The other half of the difficulty is that a single character can have multiple encodings. The order or number of code points in a character can differ (typically this can happen for characters with multiple accents or marks). Or can have a different number of code points (typically letters with a single accent or mark can be encoded with 1 or 2 code points). The encoding that is used depends on specific implementations, OS, and other factors. Similarly, conversion between encodings can be implemented in multiple ways due to the fact that some characters have multiple valid encodings. Using wide characters can simplify iterating over single characters, assuming all the characters you want to represent are composed of a single code unit. But it does not address the issue of grapheme clusters that consist of multiple code points, nor issues around conversion. If you want to be very robust, I recommend using ICU character break iterator to split string on grapheme clusters. https://unicode-org.github.io/icu/userguide/boundaryanalysis/ If you want to assume everything is a single code point (assuming you are not using accents or marks on characters), then simply splitting on code points may be sufficient. You can write a relatively short function with ICU to do this. #include <string_view>
#include <unicode/uchar.h>
#include <unicode/utf8.h>
// Use ICU to iterate over each code point in a string, and call a callback for each one.
template <typename callbackT>
void foreach_char(std::string_view str, callbackT callback) {
UErrorCode status = U_ZERO_ERROR;
UCharIterator iter;
uiter_setUTF8(&iter, str.data(), str.size());
UChar32 c;
while ((c = uiter_next32(&iter)) >= 0) {
callback(c);
}
} |
Another, seemingly lightweight option for an external UTF-8 string library is https://github.com/sheredom/utf8.h |
From e.g.
SeQuant/examples/srcc/srcc.cpp
Lines 97 to 98 in 1dd6257
I tend to imply that SeQuant wants to use UTF-8. However, in the code you chose to use wide strings (
std::wstring
), which would instead indicate that you really want to use UTF-16 or UTF-32.This opens up two questions for me:
L
)The text was updated successfully, but these errors were encountered: