Bug Reports
Jsonifyer
Gravatar is a globally recognized avatar based on your email address. Jsonifyer
  n/a
  All
  Apr 24, 2015 @ 08:53am
There's a bug in wwJSONSerialiser. It assumes the input is windows-1252 (or ISO-8859-1).
However if the input is already a UTF-8 string, escaping goes wrong. The up-byte that indicates the start of a UTF-8 byte combination is in itself escaped. That should not happen.

Suppose this string:

ë

when encoded in windows-1252 this is just ascii character 235, the serialiser will serialise that as "\u00EB"

However when encoded in UTF-8 this ë is represented by VFP string

ë

or ascii 195+ascii 171

which the serialiser will transform into "\u00C3\u00AB"

This is incorrect. Now for western character sets you can easily transform the UTF-8 to windows-1252 and then pass it into wwJSONSerialiser.
However for non-western characters you can't do that because they have no windows-1252 representation.

Basically the serialiser needs to be encoding aware.

Gravatar is a globally recognized avatar based on your email address. Re: Jsonifyer
  Rick Strahl
  Robert v.G.
  Apr 24, 2015 @ 10:16am
Robert,

I think you're confusing character encoding with the job of a text parser.

This is not a bug. It's not the serializer's job to deal with text encoding. Text encoding has to be handled at the point of reading input (from HTTP or a File or a database etc.) typically.

Serialization and deserialization should always happen on unencoded strings. That's because there's no way for the serializer to know what the encoding of a string is so the string has to be a native string. How could the parser possibly know if you pass it a string whether the string is UTF-8 encoded, Unicode or ANSI text? It can't and it shouldn't. That' the job of the calling application as it knows where the string came from and what format it is in.

If you are working with UTF-8 strings in code other than at the seams where the text comes into the system (ie streams/APIs) there's something probalby something wrong with your overall text handling process. The best practice is to always immediately decode encoded text when it comes into the system and convert it back to whatever encoding required for storage when you write it back out (to file, http etc.)

When serializing the serializer serializes as a string in native format. This is true in FoxPro, .NET and JavaScript. THe difference with Fox is that it has to deal with ANSI codepages, whereas just about everything else uses Unicode that can represent all characaters. Hence the Windows-1252, because that's the native charset and that IS the correct behavior. In Fox it's possible to get bogus data when the data comes from other sources that use Unicode since Fox is limited to a 256 char charset at any given time.

Don't believe me? Go ahead try using JSON.parse() in JavaScript with a string that contains UTF-8 encoded text (ie the dual characters for the ë) - you'll get the UTF-8 markup chars right back in string as Unicode text.

If you actually have JSON that includes JSON *encoded* UTF-8 characters then the JSON was invalidly generated. If the entire JSON is encoded, then it has to be decoded prior to use in FoxPro - this is true whether you deal with JSON decoding or anything else.

It also looks like you may be using an old version of the serializer. This is what you should see for serialization and deserialization:

loSer = CREATEOBJECT("wwJsonSerializer")
lcJson = loSer.Serialize("Orë")
? lcJson && prints "Orë"
? loSer.DeserializeJson(lcJson) && Orë

Which is the same as what the JavaScript JSON.stringify produces. Note that hte extended character is not encoded. Again, this is as it should be and REQUIRES that the string is in native format. Any encoding/decoding needs to happen before you serialize deserialize.

This means: If you get UTF 8 encoded content from HTTP that should be turned into a plain string from the HTTP stream as soon as you receive it. If you open a UTF-encoded file, that should be UTF8 decoded. A parser deals with strings - it has no idea what the encoding of the string is.

+++ Rick ---


There's a bug in wwJSONSerialiser. It assumes the input is windows-1252 (or ISO-8859-1).
However if the input is already a UTF-8 string, escaping goes wrong. The up-byte that indicates the start of a UTF-8 byte combination is in itself escaped. That should not happen.

Suppose this string:

ë

when encoded in windows-1252 this is just ascii character 235, the serialiser will serialise that as "\u00EB"

However when encoded in UTF-8 this ë is represented by VFP string

ë

or ascii 195+ascii 171

which the serialiser will transform into "\u00C3\u00AB"

This is incorrect. Now for western character sets you can easily transform the UTF-8 to windows-1252 and then pass it into wwJSONSerialiser.
However for non-western characters you can't do that because they have no windows-1252 representation.

Basically the serialiser needs to be encoding aware.




Rick Strahl
West Wind Technologies

Making waves on the Web
from Maui

Gravatar is a globally recognized avatar based on your email address. Re: Jsonifyer
  n/a
  Rick Strahl
  Apr 24, 2015 @ 10:59am
Rick,

Thanks for your extensive reply. I assume I was using and older version of the serializer then because I remember having uxxxx sequences in my output. I'm not even sure any more if it's in use but if it is I'll let you know if this solves the problem.

Robert


Robert,

I think you're confusing character encoding with the job of a text parser.

This is not a bug. It's not the serializer's job to deal with text encoding. Text encoding has to be handled at the point of reading input (from HTTP or a File or a database etc.) typically.

Serialization and deserialization should always happen on unencoded strings. That's because there's no way for the serializer to know what the encoding of a string is so the string has to be a native string. How could the parser possibly know if you pass it a string whether the string is UTF-8 encoded, Unicode or ANSI text? It can't and it shouldn't. That' the job of the calling application as it knows where the string came from and what format it is in.

If you are working with UTF-8 strings in code other than at the seams where the text comes into the system (ie streams/APIs) there's something probalby something wrong with your overall text handling process. The best practice is to always immediately decode encoded text when it comes into the system and convert it back to whatever encoding required for storage when you write it back out (to file, http etc.)

When serializing the serializer serializes as a string in native format. This is true in FoxPro, .NET and JavaScript. THe difference with Fox is that it has to deal with ANSI codepages, whereas just about everything else uses Unicode that can represent all characaters. Hence the Windows-1252, because that's the native charset and that IS the correct behavior. In Fox it's possible to get bogus data when the data comes from other sources that use Unicode since Fox is limited to a 256 char charset at any given time.

Don't believe me? Go ahead try using JSON.parse() in JavaScript with a string that contains UTF-8 encoded text (ie the dual characters for the ë) - you'll get the UTF-8 markup chars right back in string as Unicode text.

If you actually have JSON that includes JSON *encoded* UTF-8 characters then the JSON was invalidly generated. If the entire JSON is encoded, then it has to be decoded prior to use in FoxPro - this is true whether you deal with JSON decoding or anything else.

It also looks like you may be using an old version of the serializer. This is what you should see for serialization and deserialization:

loSer = CREATEOBJECT("wwJsonSerializer")
lcJson = loSer.Serialize("Orë")
? lcJson && prints "Orë"
? loSer.DeserializeJson(lcJson) && Orë

Which is the same as what the JavaScript JSON.stringify produces. Note that hte extended character is not encoded. Again, this is as it should be and REQUIRES that the string is in native format. Any encoding/decoding needs to happen before you serialize deserialize.

This means: If you get UTF 8 encoded content from HTTP that should be turned into a plain string from the HTTP stream as soon as you receive it. If you open a UTF-encoded file, that should be UTF8 decoded. A parser deals with strings - it has no idea what the encoding of the string is.

+++ Rick ---


There's a bug in wwJSONSerialiser. It assumes the input is windows-1252 (or ISO-8859-1).
However if the input is already a UTF-8 string, escaping goes wrong. The up-byte that indicates the start of a UTF-8 byte combination is in itself escaped. That should not happen.

Suppose this string:

ë

when encoded in windows-1252 this is just ascii character 235, the serialiser will serialise that as "\u00EB"

However when encoded in UTF-8 this ë is represented by VFP string

ë

or ascii 195+ascii 171

which the serialiser will transform into "\u00C3\u00AB"

This is incorrect. Now for western character sets you can easily transform the UTF-8 to windows-1252 and then pass it into wwJSONSerialiser.
However for non-western characters you can't do that because they have no windows-1252 representation.

Basically the serialiser needs to be encoding aware.




Gravatar is a globally recognized avatar based on your email address. Re: Jsonifyer
  n/a
  Rick Strahl
  Apr 24, 2015 @ 12:32pm
Rick,

Just as a follow up, I updated wwIpStuff.dll and that seems to have solved my problem.

Thanks for the great help.

Robert


Robert,

I think you're confusing character encoding with the job of a text parser.

This is not a bug. It's not the serializer's job to deal with text encoding. Text encoding has to be handled at the point of reading input (from HTTP or a File or a database etc.) typically.

Serialization and deserialization should always happen on unencoded strings. That's because there's no way for the serializer to know what the encoding of a string is so the string has to be a native string. How could the parser possibly know if you pass it a string whether the string is UTF-8 encoded, Unicode or ANSI text? It can't and it shouldn't. That' the job of the calling application as it knows where the string came from and what format it is in.

If you are working with UTF-8 strings in code other than at the seams where the text comes into the system (ie streams/APIs) there's something probalby something wrong with your overall text handling process. The best practice is to always immediately decode encoded text when it comes into the system and convert it back to whatever encoding required for storage when you write it back out (to file, http etc.)

When serializing the serializer serializes as a string in native format. This is true in FoxPro, .NET and JavaScript. THe difference with Fox is that it has to deal with ANSI codepages, whereas just about everything else uses Unicode that can represent all characaters. Hence the Windows-1252, because that's the native charset and that IS the correct behavior. In Fox it's possible to get bogus data when the data comes from other sources that use Unicode since Fox is limited to a 256 char charset at any given time.

Don't believe me? Go ahead try using JSON.parse() in JavaScript with a string that contains UTF-8 encoded text (ie the dual characters for the ë) - you'll get the UTF-8 markup chars right back in string as Unicode text.

If you actually have JSON that includes JSON *encoded* UTF-8 characters then the JSON was invalidly generated. If the entire JSON is encoded, then it has to be decoded prior to use in FoxPro - this is true whether you deal with JSON decoding or anything else.

It also looks like you may be using an old version of the serializer. This is what you should see for serialization and deserialization:

loSer = CREATEOBJECT("wwJsonSerializer")
lcJson = loSer.Serialize("Orë")
? lcJson && prints "Orë"
? loSer.DeserializeJson(lcJson) && Orë

Which is the same as what the JavaScript JSON.stringify produces. Note that hte extended character is not encoded. Again, this is as it should be and REQUIRES that the string is in native format. Any encoding/decoding needs to happen before you serialize deserialize.

This means: If you get UTF 8 encoded content from HTTP that should be turned into a plain string from the HTTP stream as soon as you receive it. If you open a UTF-encoded file, that should be UTF8 decoded. A parser deals with strings - it has no idea what the encoding of the string is.

+++ Rick ---


There's a bug in wwJSONSerialiser. It assumes the input is windows-1252 (or ISO-8859-1).
However if the input is already a UTF-8 string, escaping goes wrong. The up-byte that indicates the start of a UTF-8 byte combination is in itself escaped. That should not happen.

Suppose this string:

ë

when encoded in windows-1252 this is just ascii character 235, the serialiser will serialise that as "\u00EB"

However when encoded in UTF-8 this ë is represented by VFP string

ë

or ascii 195+ascii 171

which the serialiser will transform into "\u00C3\u00AB"

This is incorrect. Now for western character sets you can easily transform the UTF-8 to windows-1252 and then pass it into wwJSONSerialiser.
However for non-western characters you can't do that because they have no windows-1252 representation.

Basically the serialiser needs to be encoding aware.




Gravatar is a globally recognized avatar based on your email address. Re: Jsonifyer
  Rick Strahl
  Robert v.G.
  Apr 25, 2015 @ 10:54am

The JSON serializer doesn't use wwipstuff.dll. wwHttp doesn't either except for URL decoding - so you must have updated something else as well to see a change?

+++ Rick ---



Rick,

Just as a follow up, I updated wwIpStuff.dll and that seems to have solved my problem.

Thanks for the great help.

Robert


Robert,

I think you're confusing character encoding with the job of a text parser.

This is not a bug. It's not the serializer's job to deal with text encoding. Text encoding has to be handled at the point of reading input (from HTTP or a File or a database etc.) typically.

Serialization and deserialization should always happen on unencoded strings. That's because there's no way for the serializer to know what the encoding of a string is so the string has to be a native string. How could the parser possibly know if you pass it a string whether the string is UTF-8 encoded, Unicode or ANSI text? It can't and it shouldn't. That' the job of the calling application as it knows where the string came from and what format it is in.

If you are working with UTF-8 strings in code other than at the seams where the text comes into the system (ie streams/APIs) there's something probalby something wrong with your overall text handling process. The best practice is to always immediately decode encoded text when it comes into the system and convert it back to whatever encoding required for storage when you write it back out (to file, http etc.)

When serializing the serializer serializes as a string in native format. This is true in FoxPro, .NET and JavaScript. THe difference with Fox is that it has to deal with ANSI codepages, whereas just about everything else uses Unicode that can represent all characaters. Hence the Windows-1252, because that's the native charset and that IS the correct behavior. In Fox it's possible to get bogus data when the data comes from other sources that use Unicode since Fox is limited to a 256 char charset at any given time.

Don't believe me? Go ahead try using JSON.parse() in JavaScript with a string that contains UTF-8 encoded text (ie the dual characters for the ë) - you'll get the UTF-8 markup chars right back in string as Unicode text.

If you actually have JSON that includes JSON *encoded* UTF-8 characters then the JSON was invalidly generated. If the entire JSON is encoded, then it has to be decoded prior to use in FoxPro - this is true whether you deal with JSON decoding or anything else.

It also looks like you may be using an old version of the serializer. This is what you should see for serialization and deserialization:

loSer = CREATEOBJECT("wwJsonSerializer")
lcJson = loSer.Serialize("Orë")
? lcJson && prints "Orë"
? loSer.DeserializeJson(lcJson) && Orë

Which is the same as what the JavaScript JSON.stringify produces. Note that hte extended character is not encoded. Again, this is as it should be and REQUIRES that the string is in native format. Any encoding/decoding needs to happen before you serialize deserialize.

This means: If you get UTF 8 encoded content from HTTP that should be turned into a plain string from the HTTP stream as soon as you receive it. If you open a UTF-encoded file, that should be UTF8 decoded. A parser deals with strings - it has no idea what the encoding of the string is.

+++ Rick ---


There's a bug in wwJSONSerialiser. It assumes the input is windows-1252 (or ISO-8859-1).
However if the input is already a UTF-8 string, escaping goes wrong. The up-byte that indicates the start of a UTF-8 byte combination is in itself escaped. That should not happen.

Suppose this string:

ë

when encoded in windows-1252 this is just ascii character 235, the serialiser will serialise that as "\u00EB"

However when encoded in UTF-8 this ë is represented by VFP string

ë

or ascii 195+ascii 171

which the serialiser will transform into "\u00C3\u00AB"

This is incorrect. Now for western character sets you can easily transform the UTF-8 to windows-1252 and then pass it into wwJSONSerialiser.
However for non-western characters you can't do that because they have no windows-1252 representation.

Basically the serialiser needs to be encoding aware.







Rick Strahl
West Wind Technologies

Making waves on the Web
from Maui

Gravatar is a globally recognized avatar based on your email address. Re: Jsonifyer
  n/a
  Rick Strahl
  Apr 26, 2015 @ 07:11am
Rick

I see in your code:

DECLARE INTEGER JsonEncodeString IN wwipstuff.dll string json,string@ output

This call changed over time. Maybe I've been using an archaïc version. As a test I used input:

ру́сский язы́к, russky yazyk
한국어 / 조선말,

That version JSonified hat as:

"\u00D1\u0080\u00D1\u0083\u00CC\u0081\u00D1\u0081\u00D1\u0081\u00D0\u00BA\u00D0\u00B8\u00D0\u00B9 \u00D1\u008F\u00D0\u00B7\u00D1\u008B\u00CC\u0081\u00D0\u00BA, russky yazyk\r\n\u00ED\u0095\u009C\u00EA\u00B5\u00AD\u00EC\u0096\u00B4 / \u00EC\u00A1\u00B0\u00EC\u0084\u00A0\u00EB\u00A7\u0090,"

The new output creates byte sequences without unicode escape sequences like:

"ру́сский язы́к, russky yazyk\r\n한국어 / 조선말,"

which is correct.

Anyway thanks.

Robert

The JSON serializer doesn't use wwipstuff.dll. wwHttp doesn't either except for URL decoding - so you must have updated something else as well to see a change?

+++ Rick ---



Rick,

Just as a follow up, I updated wwIpStuff.dll and that seems to have solved my problem.

Thanks for the great help.

Robert


Robert,

I think you're confusing character encoding with the job of a text parser.

This is not a bug. It's not the serializer's job to deal with text encoding. Text encoding has to be handled at the point of reading input (from HTTP or a File or a database etc.) typically.

Serialization and deserialization should always happen on unencoded strings. That's because there's no way for the serializer to know what the encoding of a string is so the string has to be a native string. How could the parser possibly know if you pass it a string whether the string is UTF-8 encoded, Unicode or ANSI text? It can't and it shouldn't. That' the job of the calling application as it knows where the string came from and what format it is in.

If you are working with UTF-8 strings in code other than at the seams where the text comes into the system (ie streams/APIs) there's something probalby something wrong with your overall text handling process. The best practice is to always immediately decode encoded text when it comes into the system and convert it back to whatever encoding required for storage when you write it back out (to file, http etc.)

When serializing the serializer serializes as a string in native format. This is true in FoxPro, .NET and JavaScript. THe difference with Fox is that it has to deal with ANSI codepages, whereas just about everything else uses Unicode that can represent all characaters. Hence the Windows-1252, because that's the native charset and that IS the correct behavior. In Fox it's possible to get bogus data when the data comes from other sources that use Unicode since Fox is limited to a 256 char charset at any given time.

Don't believe me? Go ahead try using JSON.parse() in JavaScript with a string that contains UTF-8 encoded text (ie the dual characters for the ë) - you'll get the UTF-8 markup chars right back in string as Unicode text.

If you actually have JSON that includes JSON *encoded* UTF-8 characters then the JSON was invalidly generated. If the entire JSON is encoded, then it has to be decoded prior to use in FoxPro - this is true whether you deal with JSON decoding or anything else.

It also looks like you may be using an old version of the serializer. This is what you should see for serialization and deserialization:

loSer = CREATEOBJECT("wwJsonSerializer")
lcJson = loSer.Serialize("Orë")
? lcJson && prints "Orë"
? loSer.DeserializeJson(lcJson) && Orë

Which is the same as what the JavaScript JSON.stringify produces. Note that hte extended character is not encoded. Again, this is as it should be and REQUIRES that the string is in native format. Any encoding/decoding needs to happen before you serialize deserialize.

This means: If you get UTF 8 encoded content from HTTP that should be turned into a plain string from the HTTP stream as soon as you receive it. If you open a UTF-encoded file, that should be UTF8 decoded. A parser deals with strings - it has no idea what the encoding of the string is.

+++ Rick ---


There's a bug in wwJSONSerialiser. It assumes the input is windows-1252 (or ISO-8859-1).
However if the input is already a UTF-8 string, escaping goes wrong. The up-byte that indicates the start of a UTF-8 byte combination is in itself escaped. That should not happen.

Suppose this string:

ë

when encoded in windows-1252 this is just ascii character 235, the serialiser will serialise that as "\u00EB"

However when encoded in UTF-8 this ë is represented by VFP string

ë

or ascii 195+ascii 171

which the serialiser will transform into "\u00C3\u00AB"

This is incorrect. Now for western character sets you can easily transform the UTF-8 to windows-1252 and then pass it into wwJSONSerialiser.
However for non-western characters you can't do that because they have no windows-1252 representation.

Basically the serialiser needs to be encoding aware.







© 1996-2024