- String Encoding
- Strings in Windows
- Literals and String Data Types
- C string
- C++ String
- String to Number Conversion
- Number to String Conversion
- Spiting String By Delimiter
- string_view
- Find String Case Insensitive
- The small string optimization
ASCII uses 7 bits to represent a character. By using 7 bits, we can have a maximum of 2^7 (= 128) distinct combinations. Which means that we can represent 128 characters maximum. The last bit (8th) is used for avoiding errors as parity bit. Most ASCII characters are printable characters of the alphabet such as abc, ABC, 123, ?&!, etc. The others are control characters such as carriage return, line feed, tab, etc. ASCII was meant for English only.
The main difference between ANSI and ASCII is the number of characters they can represent. ASCII was the first to be developed and when its limitations were reached, ANSI was one of the ways created to expand the number of characters that can be represented in an encoding. In ANSI, 8 bits are used; increasing the maximum number of characters to be represented up to 256. This is expanded even further because of how ANSI uses code pages with different character sets. There are a number of ANSI code pages that are meant for other languages like Japanese, Chinese, and many others. The application processing the file just needs to know which code page is in use in order to decipher the files properly.
There are many variants of Extended ASCII characters (8-bit system) to cover regional characters and symbols. One example is the extended ASCII characters which includes various letters needed for writing languages of Western Europe and certain special characters. This encoding is called ISO Latin-1 or ISO 8859-1, (ISO refers to International Organization for Standardization), which is the default character set in most browsers. The ISO 8859-1 character set includes the original ASCII character set (values 0 to 127), plus an extended character set (codes from 160-255) which contains the characters used in Western European countries and some commonly used special characters. Many Windows systems use another related 8-bit encoding, and this Microsoft specific encoding is referred to as ANSI, or Windows-1252. It is similar to ISO 8859-1 except that character codes 128-159 in ISO 8859-1 are reserved for controls whereas ANSI uses most of them for printable characters. ANSI stands for American National Standards Institute. The ANSI character set includes the standard ASCII character set (values 0 to 127), plus an extended character set (values 128 to 255).
Refs: 1
Some clever people started using the 8th bit (the bit used for parity) to encode more characters to support
their language (to support é
, in French, for example). Just using one extra bit doubled the size of the
original ASCII table to map up to 256 characters (2^8 = 256 characters).
ASCII Extended solves the problem for languages that are based on the Latin alphabet but not other languages that are completely different (Greek, Russian, Chinese) We would have needed an entirely new character set. Unicode doesn't contain every character from every language. unicode-table
You need to "encode" this abstract representation. That's where an encoding comes into play Character encoding: is used to map every character to series of bits, number or electrical pulse (Morse code, ASCII, Unicode)
UTF-8
andUTF-16
are variable length encoding.- In
UTF-8
, a character may occupy a minimum of 8 bits. - In
UTF-16
, a character length starts with 16 bits. UTF-32
is a fixed length encoding of 32 bits.
Refs: 1
char
: 8-bit character - underlying C/C++ data typeCHAR
: alias of char - Windows data typeLPSTR
: null-terminated string of CHAR (Long Pointer)LPCSTR
: constant null-terminated string of CHAR (Long Pointer)
char is supposed to hold a character, usually an 8-bit character.
wchar_t
is supposed to hold a wide character, and then, things get tricky:
On Linux, a wchar_t is 4 bytes, while on Windows, it's 2 bytes.
neither char nor wchar_t is directly tied to unicode!.
wchar_t
: 16-bit character - underlying C/C++ data typeWCHAR
: alias ofwchar_t
- Windows data typeLPWSTR
: null-terminated string ofWCHAR
(Long Pointer)LPCWSTR
: constant null-terminated string ofWCHAR
(Long Pointer)
TCHAR
: alias of WCHAR if UNICODE is defined; otherwise CHARLPTSTR
: null-terminated string of TCHAR (Long Pointer)LPCTSTR
: constant null-terminated string of TCHAR (Long Pointer)
Item | 8-bit | 16-bit | Varies |
---|---|---|---|
character | CHAR | WCHAR | TCHAR |
string | LPSTR | LPWSTR | LPTSTR |
string (const) | LPCSTR | LPCWSTR | LPCTSTR |
LPCTSTR = Long Pointer to a Const TCHAR STRing. a long pointer is the same as a pointer. There were two flavors of pointers under 16-bit windows.
LPSTR = char*
LPCSTR = const char*
LPWSTR = wchar_t*
LPCWSTR = const wchar_t* (This type is declared in WinNT.h as follows: typedef CONST WCHAR *LPCWSTR;)
LPTSTR = char* or wchar_t* depending on _UNICODE
LPCTSTR = const char* or const wchar_t* depending on _UNICODE
This type is declared in WinNT.h as follows:
#ifdef UNICODE
typedef LPCWSTR LPCTSTR;
#else
typedef LPCSTR LPCTSTR;
#endif
std::string
is a basic_string templated on a char, and std::wstring
on a wchar_t
.
Type | Definition |
---|---|
std::string | std::basic_string |
std::wstring | std::basic_string<wchar_t> |
std::u8string (C++20) | std::basic_string<char8_t> |
std::u16string (C++11) | std::basic_string<char16_t> |
std::u32string (C++11) | std::basic_string<char32_t> |
Refs: 1
-
_T()
, and its Win32 equivalentTEXT()
, are preprocessor macros that prepend the input value with L if _UNICODE or UNICODE are defined, respectively. The_T()
macro was added when you needed to support Windows NT and later (which support Unicode) and Windows 9x/ME (which do not). These days any code using these macros is obsolete, since all modern Windows versions are Unicode-based._T("Hello")
//if defined UNICODE, change "Hello" into UNICODE; otherwise, keep it in ANSI. The plain versions without the underscore affect the character set the Windows header files treat as default. So if you define UNICODE, then GetWindowText will map to GetWindowTextW instead of GetWindowTextA, for example. Similarly, the TEXT macro will map to L"..." instead of "...". The versions with the underscore affect the character set the C runtime header files treat as default. So if you define _UNICODE, then _tcslen will map to wcslen instead of strlen, for example. Similarly, the _TEXT macro will map to L"..." instead of "...". UNICODE is used by Windows headers, whereas _UNICODE is used by C-runtime/MFC headers. -
Use
_TCHAR
and_T()
with C functions. UseTCHAR
andTEXT()
with the Win32 API. CString is based on the TCHAR data type.", so use TEXT() -
Use
LPTSTR
andLPCTSTR
instead ofchar *
andconst char *
LPCSTR
is a pointer to a const stringLPCTSTR
is a pointer to a const TCHAR string, (TCHAR being either a wide char or char depending on whether UNICODE is defined in your project)LPTSTR
is a pointer to a (non-const) TCHAR string
For C++ strings, use std::wstring
instead of std::string
You don't need to use in <const char *> when you define c style strings. The reason is you don't want to increase or decrease the length of your string as it has fixed sized memory. Just because it is <char *> it doesn't mean it is in heap, and we don't call delete. it is null terminated character \0. Char are initialized with single quotation '' If you use double quotation "" it is char *
A string literal or anonymous string is a type of literal for the representation of a string value.
in x = "foo"
, where "foo"
is a string literal with value foo
. Literal are of type char
in C but const char
in C++
auto c0 = 'A'; // char
auto c1 = u8'A'; // char
auto c2 = L'A'; // wchar_t
auto c3 = u'A'; // char16_t
auto c4 = U'A'; // char32_t
// Multicharacter literals
auto m0 = 'abcd'; // int, value 0x61626364
// String literals
auto s0 = "hello"; // const char*
auto s1 = u8"hello"; // const char*, encoded as UTF-8
auto s2 = L"hello"; // const wchar_t*
auto s3 = u"hello"; // const char16_t*, encoded as UTF-16
auto s4 = U"hello"; // const char32_t*, encoded as UTF-32
const char* multiline = R"(line1
line2
line3)";
Strings are actually one-dimensional array of characters terminated by a null character '\0'.
Here the name
is a stack variable:
char name[10] = { 'b','e','h','n','a','m','\0' };
compiler output:
char name[10] = {'b', 'e', 'h', 'n', 'a', 'm', '\0', '\0', '\0', '\0'};
stack variable,
char name[] = { 'b','e','h','n','a','m','\0' };
compiler output:
char name[7] = {'b', 'e', 'h', 'n', 'a', 'm', '\0'}
stack variable,
char name[] = "behnam";
compiler output:
char name[7] = "behnam";
it is on the code section of memory, this type of definition is not recommend
char* name = "behnam";
and it is better to use
const char* name = "behnam";
This will complies:
name[0] = 'C';
but it will cause segmentation fault
as the variable is on the code section
and code section is read only
("behnam"
is a string literal and name
holds the starting address of that.)
This is allowed (Value of name
can be changed):
name = "Margarethe";
this code is okay since there is a '\0'
at the end of string:
std::cout << "name: " << name << std::endl;
this is also okay:
char name[7] = { 'b','e', 'h', 'n', 'a','m','\0' }; //or = { 'b','e', 'h', 'n', 'a','m', 0 };
std::cout << "name: " << name << std::endl;
but this code will print lots of strange character until it hit the null termination character 0
char name[6] = { 'b','e', 'h', 'n', 'a','m' };
Declares a pointer whose data cannot be changed through the pointer:
const int *p = &someInt;
or
int const *p;
Declares a pointer who cannot be changed to point to something else:
int * const p = &someInt;
to make it easy to read remove the variable type, then read it like:
const int *p;
==> const *p ;
==> *p
is which is data is fixed.
int const *p;
==> const *p ;
==> *p
is which is data is fixed.
int * const p
==> * const p
==> p
is fixed which is an address.
Consider below two statements:
char a1[] = "Behnam";
char *p1 = "Behnam";
a1
is an array whilep1
is a pointera
is stored at stack, butp1
resides in code section of memorya1++
is invalid butp1++
is valid.sizeof(a1)
will return7
(six chars +'\0'
) butsizeof(p1)
while return8
(pointer size)a1
and&a1
are same butp1
and&p1
are not same.a1[1]='n'
is okay butp1[1]='n'
will cause segmentation fault.
sizeof(string)
tells you the size of the pointer, so it should NOT be used.strlen( "my string" )
could be used for c strings.
char my_str[100]="my string";
std::cout<<"size of string is: "<<strlen( my_str )<<" bytes and string is: "<< my_str <<std::endl;
str.size()
also return the size of c++ strings.
In C++, the strdup
function (from C) is used to duplicate a string by allocating memory and copying the content of an existing string to that new memory. You would need to use strdup
when you want to create a copy of a string that you plan to manage manually (for instance, when working with raw pointers and dynamic memory allocation).
Here is an example of when you might use strdup
:
#include <iostream>
#include <cstring> // for strdup and free
int main() {
// Original string
const char* original = "Hello, World!";
// Duplicate the string using strdup
char* duplicate = strdup(original);
// Print both strings
std::cout << "Original: " << original << std::endl;
std::cout << "Duplicate: " << duplicate << std::endl;
// Free the memory allocated by strdup
free(duplicate);
return 0;
}
When to use strdup
:
- C-style strings: If you're working with raw C-style strings (
char*
) and need to make a copy of the string that requires its own memory management. - Manual memory management: When the duplicated string will be used independently and may need to be freed later.
Important notes:
strdup
allocates memory usingmalloc
, so you must free it withfree
when you're done using the duplicated string.- In C++, you generally don't need to use
strdup
if you are working withstd::string
, which manages memory automatically. Instead, usestd::string
's copy constructor or assignment operator, which is safer and more idiomatic.
C++ alternative using std::string
:
#include <iostream>
#include <string>
int main() {
// Original string
std::string original = "Hello, World!";
// Duplicate the string using std::string
std::string duplicate = original;
// Print both strings
std::cout << "Original: " << original << std::endl;
std::cout << "Duplicate: " << duplicate << std::endl;
// No need to free memory, std::string handles it automatically
return 0;
}
In modern C++ code, using std::string
is preferable to avoid manual memory management.
A standard C-style API for searching a substring within a string can be implemented using the strstr
function, which is part of the C standard library (<string.h>). The strstr
function searches for the first occurrence of a substring in a string and returns a pointer to the beginning of the substring if found. Otherwise, it returns NULL
.
char haystack[] = "Hello, World!"; // Now this is modifiable
const char *needle = "World";
// Use strstr to find the first occurrence of the needle in the haystack
char *result = strstr(haystack, needle);
if (result) {
printf("Found substring: %s\n", result);
} else {
printf("Substring not found.\n");
}
std::string
usually allocates memory dynamically, and must copy the C-style string literal to it at run time.
std::string str = "initializer syntax";
std::string str("converting constructor syntax");
std::string str = string("explicit constructor syntax");
std::string str{"uniform initializer syntax"};
char* p1 = "John";
The problem is that string literals "this is a string literal"
are of type char
in C but const char
in C++.
This will compile but will cause segmentation fault in run time:
p1[0] = 'C';
However this will not compile:
const char* p1 = "John";
p1[0] = 'C';
to read it remove the data type char, so we would have const * p1
, which mean the place in the memory that p1
is pointing is const and can not be changed.
Another solution is to change the literal from const char *
to char *
, which remove teh warning but it is not safe, as you can
still do p1[0] = 'C'
char * p1 = (char *)"John";
You can also use a string object instead
std::string p1 = "John";
std::string::compare()
returns an int:
- equal to zero if
str1
andstr2
are equal, - less than zero if
str1
is less thanstr2
, - greater than zero if
str1
is greater thanstr2
.
if (!str1.compare(str2))
{
// 'str1' and 'str2' are equal.
}
std::string::compare()
is most useful for quick sort and binary search algorithms. Natural sorts and dichotomic searches can be implemented with only std::less
.
you can use str1==str2
as well and and it is more readable and it will return the same value, but the compare function provides more information how the strings differ.
Refs: 1
std::string strNumber=std::to_string(10.3);
char c = 'A';
std::string s(1, c);
or
char c = 'A';
std::string s;
s.push_back(c);
or
std::stringstream ss;
ss << c;
ss >> s; // or, use `s = ss.str()`
int i =std::stoi(strNumber.c_str());//10
float f=std::atof(strNumber.c_str());//10.3
float f=std::stof(strNumber);//10.3
double d =std::stod(strNumber.c_str() );//10.3
std::vector<char> charVec(str.begin(),str.end() );
std::vector<std::string> spilitedString;
std::string s = "scott>=tiger>=mushroom";
std::string delimiter = ">=";
size_t pos = 0;
std::string token;
while ((pos = s.find(delimiter)) != std::string::npos)
{
token = s.substr(0, pos);
spilitedString.push_back(token);
s.erase(0, pos + delimiter.length());
}
std::tolower()
returns an integer, not a std::string
, and you cannot directly cast the result to a std::string
. Instead, you should convert the result of std::tolower()
to a char
, and then construct a std::string
from that character.
unsigned char c = 'A';
char lower_c = static_cast<char>(std::tolower(c));
std::string a(1, lower_c); // Create a string with one character
To correctly use std::tolower
on a std::string, you need to iterate over each character of the string and apply std::tolower
to it. Since std::tolower
works on single characters (and returns an int), you should also cast the result back to char. Here's how you can do this:
std::string input = "Hello, World!";
std::string result;
// Use std::transform to apply std::tolower to each character
std::transform(input.begin(), input.end(), std::back_inserter(result),
[](unsigned char c){ return std::tolower(c); });
std::cout << "Original: " << input << std::endl;
std::cout << "Lowercase: " << result << std::endl;
std::transform
is used to apply the transformation (lowercasing in this case) to each character.std::tolower
is applied to each character of the string. It takes anunsigned char
as input, so we cast each character tounsigned char
to avoid undefined behavior with non-ASCII characters.- The result is accumulated in the
result
string usingstd::back_inserter
.
This will output:
Original: Hello, World!
Lowercase: hello, world!
The function std::isalnum
in C++ checks whether a given character is either an alphanumeric character, i.e., a letter (A-Z, a-z) or a digit (0-9).
- digits (0123456789)
- uppercase letters (ABCDEFGHIJKLMNOPQRSTUVWXYZ)
- lowercase letters (abcdefghijklmnopqrstuvwxyz)
Return Value:
- It returns a non-zero value (typically
true
) if the character is alphanumeric. - It returns 0 (typically
false
) if the character is not alphanumeric.
char c = 'A';
if (std::isalnum(c)) {
std::cout << c << " is alphanumeric." << std::endl;
} else {
std::cout << c << " is not alphanumeric." << std::endl;
}
In this case, if c
is 'A' (an alphanumeric character), std::isalnum(c)
will return a non-zero value, and the output will be:
A is alphanumeric.
auto it = std::search(
sentence.begin(), sentence.end(),
word.begin(), word.end(),
[](char ch1, char ch2) { return std::toupper(ch1) == std::toupper(ch2); }
);
return (it != sentence.end());
Small size STL containers, would be set on stack instead of heap and after the size get bigger they would be allocated on heap, this is called The Small String Optimization. complete example here