Steven Black turned me on to Html Tidy and its COM counterpart TidyCom which is a very useful utility for cleaning up hideous HTML. Specifically hideous Microsoft editor generated HTML.

 

By the way, why is it that none of Microsoft’s shipping tools (Whidbey MAY be a first) can create decent HTML? One look at the mess that is MS Word output, including the new ‘compact’ Html format, is quite laughable. Sure it works, but man if you ever have to stick that output into another document with anything other than frontpage – good look.


Well, Html Tidy will tidy up that mess nicely.

 

Another place where Microsoft’s HTML creation is atrocious is for the HTML Edit control. Holy shit does that create a load of crap. You can pass in pretty, perfectly formatted HTML and the editor will promptly hack that to pieces. It’ll do such wonderful things as stripping off quotes of attributes and run as much code onto a single line as it possibly can.

 

I of course have been using this control in Help Builder, because there’s really no decent alternative for complex HTML editing (XStandard is not bad but it renders quiet differently than most other browsers which is not ideal for an Html Help environment).

 

So, TidyCom makes a nice addition to clean up the hideous HTML for source editing if desired.

 

Unfortunately using this makes it a bit more difficult to find the current selection point. Help Builder allows swapping HTML Edit view and Html Source view and tries to highlight the cursor position in source view when you swap. The formatting makes this very difficult. It didn’t work well before using TidyCom either, because Help Builder does a few conversions on the text (like link embedding cleanup and fixing up base paths that the control blindly injects).

 

In case you ever need to do this, here’s the function that attempts to do this:

 

FUNCTION SetTextModeSelection()

LOCAL loEdit, loParent, lnAt, lcSelectedHtml, llError

 

*** Clear whatever previous selections there were first

THIS.SELLENGTH = 0

THIS.SelStart = 0

 

loEdit =  THIS.PARENT.oHTMLedit

loRange =  loEdit.DOCUMENT.SELECTION.CreateRange()

 

IF VARTYPE(loRange) # "O"

   RETURN

ENDIF

 

loParent = NULL

TRY

   loParent = loRange.ParentElement

CATCH

   llError = .T.

ENDTRY

 

IF ISNULL(loParent)

   TRY

      loParent = loRange.commonParentElement

   CATCH

      llError = .T.

   ENDTRY

ENDIF

 

*** Nothing we can do here

IF ISNULL(loParent)

   RETURN

ENDIF

 

DO CASE

   CASE loParent.nodeName = "IMG"

      lcSelectedHtml = ["] + FixBasePath(loParent.src,loParent) + ["]

      lnAt = ATC(lcSelectedHtml,THIS.VALUE)

   CASE loParent.nodeName == "A"

      lcSelectedHtml = ["] + FixBasePath(loParent.href,loParent) + ["]

      lnAt = ATC(lcSelectedHtml,THIS.VALUE)

   OTHERWISE

      *** It's text - try working with the plain text

      lcSelectedHtml = loRange.TEXT

      lnAt = 0

      IF !EMPTY(lcSelectedHtml)

         lnAt = ATC(lcSelectedHtml,THIS.VALUE)

      ENDIF

      IF lnAt = 0

 

         IF loRange.Expand("sentence")

            lcSelectedHtml = loRange.TEXT

            lnAt=ATC(lcSelectedHtml,THIS.VALUE)

            IF lnAt = 0

               *** Last Straw  - read inner Html and try to pickup inner block

               IF !ISNULL(loParent)

                  lcSelectedHtml = loParent.innerText

                  lnAt=ATC(lcSelectedHtml,THIS.VALUE)

                 

                  IF lnAt = 0

                     lcSelectedHtml = loParent.innerHtml

                     lnAt=ATC(lcSelectedHtml,THIS.VALUE)

                  ENDIF

                  IF lnAt = 0

                     lcSelectedHtml = STREXTRACT(lcSelectedHtml,">","<",1)

                     IF !EMPTY(lcSelectedHtml)

                        lnAt = ATC(lcSelectedHtml,this.Value)

                     ENDIF

                    

                  ENDIF

              ENDIF

            ENDIF                    

         ENDIF

      ENDIF

 

ENDCASE

 

IF lnAt > 0

   THIS.SELSTART = lnAt-1

   THIS.SELLENGTH = LEN(lcSelectedHtml)

   THIS.SETFOCUS()

ENDIF

 

It’s a brute force approach and it works most of the time, but certainly not all of the time. Text selections are near impossible to find if they mix plain text and embedded objects like links or markup tags in a block.

 

If only the control would leave the markup intact this would actually be reasonably easy.

 

Another major bummer that I can’t figure out is how to keep track of the edit position in the browser control. The control has Range objects to handle selections, but nowhere can you capture the position of the selection as something you can persist beyond a single browser session. Because I have to refresh the document with the changes made in edit mode the document reloads so the original range becomes invalid. But there’s no way that I can see to ‘remember’ the range position – like a numeric text insertion point that would at least get me close…

 

If anybody knows how to do this, drop me a line.