UTF-8 with an XmlWriter (or even HtmlTextWriter for that matter) can sometimes be tricky if you’re sending output back into anything but a file. If you write data to a string or data to a stream that gets immediately fed into an output stream in a Web application or a POST buffer for an HTTP request you might find that the formatting of the XML generated usually will blow up.
Typically you have code like this:
MemoryStream ms = new MemoryStream();
XmlTextWriter writer = new XmlTextWriter(ms,Encoding.UTF8);
writer.Formatting = Formatting.Indented;
writer.WriteEndElement(); // OFX
If you now take this XML and write this out to string for example you can do:
this.RequestXml = Encoding.Default.GetString(ms.ToArray());
to get a string that contains the UTF-8 encoded string (ie. it has funky characters for any extended over 128 char values). But the string generated has a problem and one that you might easily miss in that it contains a Byte Order Mark (BOM) at the beginning of it:
ï»¿<?xml version="1.0" encoding="utf-8"?>
Byte order marks are usually used for UTF-8 encoded files that are stored on disk, but if you send an XML response back from a Web Request or you store an XML document as text somewhere you typically don’t want this byte order mark at the front. The same issue applies if you use the stream directly to fire into the HTTP output stream in ASP.NET or as a POST buffer in a WebRequest POST request. If output goes anywhere but to file you typically don't want that BOM at the begging of the output.
It’s not real obvious how to get rid of the BOM either – You figure the XmlWriter would have an option for this, but the byte order mark usage is determined by the Encoding instance.
The default Encoding.UTF8 encoding has the the byte order mark enabled and you can’t turn it of. Instead if you want to generate XML without the BOM you have to create a new encoding and pass it into the XMLTextWriter like this:
// *** Create encoding manually in order not to
// *** create leading Byte order marks
Encoding Utf8 = new UTF8Encoding(false);
MemoryStream ms = new MemoryStream();
XmlTextWriter writer = new XmlTextWriter(ms,Utf8);
The BOM coding can only be specified in the constructor and the default Encoding.UTF8 is set to include the BOM so your only option is to override and create a new one.
Now, converting XML to string is usually not a good idea and should be avoided whenever possible. Rather keeping XML in stream or byte format and then loading it back into an XmlReader or XmlDocument is preferrable, but sometimes string storage is required such as in this older application I’m using.
The problem with strings is the encoding of course. Xml is usually UTF-8 encoded, so notice that I have to decide whether I want to retrieve the data as Unicode (use Encoding.UTF8 to decode to get the original data back and which effectively turns the XML document into UTF-16) or as 'encoded' Unicode string that pretends to be UTF-8 (use Encoding.Default to retrieve the funky UTF-8 markup characters). It gets confusing quickly even without the byte order marks involved. String encodings are no fun to deal with and if you can help it avoid encoding and recoding and pretzling your brain <s>.
Looking at how data the data in the existing application in the database already is structured it includes the UTF8 encoding in the stored content – the app takes that data and fires it off via HTTP to a background service application that processes it at a later point in time. <shrug> I’m stuck with this but ideally this should probably be stored as binary and then later just sent of into the WebRequets POST input stream. But using the string with this UTF-8 encoding works as well although it feels wrong <s>… so it goes with legacy code…
Incidentally it took me a while to figure out why the server I was eventually POSTing the data to was failing. It kept erroring out with Bad Request errors. When I picked up the log data the data looked fine. I went as far as even using Beyond Compare to check two responses and they were identical. Not until I hooked up an Fiddler to look at the raw HTTP response did I notice the damn Byte Order Marks. <s>