Shooting the Trouble

Yesterday, Conductor users began reporting errors with the Paste functionality of the CKEditor, the rich text editor used by Conductor.  The problem manifested on some browsers, but not all – Chrome 17 appeared immune, but there were reports of problems from Firefox 3.6, IE 9, IE 8, Chrome 16.

<Quasi-Religious-Diatribe>

loathe, despise, and detest Paste from Word for HTML content.  It just doesn’t work well.

You would never rely on Google Translate or Babelfish to convert your English brochure to Spanish and then give that brochure to native Spanish speakers.  Did you translate that to  Castillian, Cuban, Mexican, or some other Spanish dialect?  So why would you write your content in Word then paste it into a browser and expect a high fidelity copy?

This is a complicated problem that I wish were resolved.  I understand that HTML can be intimidating…especially when you bring CSS and browser compatibility into the mix…but even the “best translation service” is fairly terrible to a native speaker.

So roll up your sleeves and get comfortable with the HTML.  If you are writing anything for the web, you need to understand it.

And hats off to those few people around the world who maintain the Rosetta Stone for Paste from Word functions.  It is a thankless job that requires working with some truly horrific ever shifting data and attempting as best you can to map that to an ever shifting landscape of browser implementations of HTML.

</Quasi-Religious-Diatribe>

deep breaths…count down from 10, 9, 8, 7, 6, 5, 4, 3, 2, 1…

Okay, I’m back.

The Initial Problem

The problem I was trying to solve was a pernicious Chrome and Safari paste from Word bug. First, lets follow the steps below to see the initial problem.

Step 1: Copy some simple text from Word

Nothing fancy, just a paragraph with a bulleted list.

Copy from Microsoft Word 2008 for Mac

Copy from Microsoft Word 2008 for Mac

Step 2: Paste Text into Chrome

Things look reasonable, though I don’t like the copied bullet. At this point, most people save their page and go about their business.  Its a reasonable thing to do. But step 3 reveals the problem.

Paste to Chrome Step 1

Paste to Chrome Step 1

Step 3: Toggle into Source View Mode

In the source view, there is a <p>&nbsp;</p> that “just appears”.  As it turns out, that little bit of HTML is present in Step 2, but is for some reason invisible in the CKEditor.

Paste to Chrome Step 2

Paste to Chrome Step 2

Step 4: Toggle out of Source View Mode

And the paragraph appears.

Paste to Chrome Step 3

Paste to Chrome Step 3

The Initial Yet Problematic Solution

As always, go to Google when you encounter a coding problem.  And as it turns out the CKEditor paste error is a known problem…without an elegant solution.  What was happening is the Paste action in CKEditor was wrapping the content in a P-tag.

If I were to copy from Word “Simple Paragraph”, the pasted content would be something like  “<p><p>Simple Paragraph</p></p>”.  Chrome would resolve that by created “<p>&nbsp;</p><p>Simple Paragraph</p>” but not before the CKEditor had rendered Step 2 from above.

CKEditor provides a hook for the Paste event.  So I implemented a solution, but unfortunately it wasn’t adequate for all browsers.  So, I spent a good portion of yesterday investigating what was going on.

Painful Discovery

While I was working on the solution, I discovered something truly horrific.  Each browser receives different pasted values from Word.  And likewise handles this paste differently.

Chrome on OSX

…60+ lines of XML declarations then…
<!--StartFragment-->

<p class="MsoNormal">This is a paragraph</p>

<p class="MsoListParagraphCxSpFirst" style="text-indent:-.25in;mso-list:l0 level1 lfo1"><!--[if !supportLists]--><span style="font-family:Symbol;mso-fareast-font-family:Symbol;mso-bidi-font-family:
Symbol">·<span style="font-family: Times New Roman; font-size: 7pt; ">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
</span></span><!--[endif]-->Item one</p>

<p class="MsoListParagraphCxSpLast" style="text-indent:-.25in;mso-list:l0 level1 lfo1"><!--[if !supportLists]--><span style="font-family:Symbol;mso-fareast-font-family:Symbol;mso-bidi-font-family:
Symbol">·<span style="font-family: Times New Roman; font-size: 7pt; ">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
</span></span><!--[endif]-->Item two</p>

<!--EndFragment-->

Firefox on OSX

…170 lines of style declarations then…
<p class="MsoNormal">
  This is a paragraph
</p>
<p class="MsoListParagraphCxSpFirst" style="text-indent:-.25in; mso-list:l0 level1 lfo1">
  <span style="font-family:Symbol; mso-fareast-font-family:Symbol; mso-bidi-font-family: Symbol">
    <span style="mso-list:Ignore">·
      <span style="font:7.0pt &quot; Times New Roman&quot;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span>
    </span>
  </span>
  Item one
</p>
<p class="MsoListParagraphCxSpLast" style="text-indent:-.25in; mso-list:l0 level1 lfo1">
  <span style="font-family:Symbol; mso-fareast-font-family:Symbol; mso-bidi-font-family: Symbol">
    <span style="mso-list:Ignore">·
      <span style="font:7.0pt &quot; Times New Roman&quot;">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;</span>
    </span>
  </span>
  Item two
</p>

And all of the other browsers receive comparably different information from Microsoft Word.

Final Pasted Value That Is Saved in Conductor

And here we have the “final” pasted value of several different browsers.

Chrome 17 on OS X

<p>This is a paragraph</p><p>·&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span>Item one</p><p>·&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span>Item two</p>

Firefox 10 on OS X

<p>This is a paragraph</p><ul><li >Item one</li><li >Item two</li></ul>

Safari 5.1.1 on OS X

<p>This is a paragraph</p><ul><li   >Item one</li><li  >Item two</li></ul>

Opera 11.6 on OS X

<p>
  &nbsp;</p>
<p class="MsoNormal">
  &nbsp;</p>
<p class="MsoNormal">
  This is a paragraph</p>
<p class="MsoListParagraphCxSpFirst" style="text-indent:-.25in;mso-list:l0 level1 lfo1">
  <span style="font-family:Symbol;mso-fareast-font-family:Symbol;mso-bidi-font-family:
Symbol"><span style="mso-list:Ignore">&middot;<span style="font:7.0pt &quot;Times New Roman&quot;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span></span>Item one</p>
<p class="MsoListParagraphCxSpLast" style="text-indent:-.25in;mso-list:l0 level1 lfo1">
<span style="font-family:Symbol;mso-fareast-font-family:Symbol;mso-bidi-font-family:
Symbol"><span style="mso-list:Ignore">&middot;<span style="font:7.0pt &quot;Times New Roman&quot;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span></span>Item two</p>

Chrome on Windows

<div>This is a paragraph</div><div>•<span class="Apple-tab-span" >  </span>Item one</div><div>•<span class="Apple-tab-span" > </span>Item two</div><div><br></div>

IE8 on Windows

This is a paragraph<BR>•&nbsp;Item one<BR>•&nbsp;Item two<BR>

As you can see, the above content varies wildly.  It is the result of three different programs negotiating the Copy/Paste behavior: Microsoft Word, the browser, and CKEditor.  And in many cases, the pasted HTML is quite terrible – I’d say only Firefox and Safari are proper HTML given the source Word document.

The Current Solution to the Paste Issue

Below is the code that I have settled on for fixing the original paste from Word problems for Chrome and Safari. Below is the code that I tried to use but failed. But keep reading after the solution:

CKEDITOR.on("instanceReady", function (ev) {
  ev.editor.on("paste", function (e) {
    if (e.data["html"]) {
      // Strip lang, style, size, face, and bizarro Word tags
      var input = e.data["html"].replace(/<([^>]*)(?:lang|style|size|face|[ovwxp]:\w+)=(?:[^]*|""[^""]*""|[^\s>]+)([^>]*)>/gi, "<$1$2>");
      var output = ;

      // The Paste action in CKEditor was wrapping the content in a p-tag;
      // By only using the innerHTML of the first element, the auto wrapping
      // of a p-tag instead wraps the first element in a p-tag.
      // So pasting: <p>Hello</p><p>World</p>
      //   Was Pasted as <p><p>Hello</p><p>World</p></p>
      //   Resolves as <p>&nbsp;</p><p>Hello</p><p>World</p> AND
      //     the <p>&nbsp;</p> was invisible
      //   Paste as Hello<p>World</p>
      //   Resolves as <p>Hello</p><p>World</p>
      // I have trepidations about this, but it appears to work in a
      // relatively general case.

      // Internet Explorer may not paste well-formed HTML, but instead
      // paste innerHTML
      if ($(input).html() == "" ) {
        output = input;
      } else {

        // Iterate over the top-level DOM elements
        $(input).each(function(key,value){

          // For the first top-level DOM element, we want the innerHTML, so
          // that it can be wrapped by a P-tag…either in the Browser or in
          // the CKEditor
          if (key == 0) {
            output += value.innerHTML;
          } else {
            // outerHTML exists in some browsers as a native property
            // It is likely more reliable than the html method (in fact
            // in Chrome, $(<div>Bob</div>).html() returned Bob)
            if (value.outerHTML == undefined) {
              output += $(value).html();
            } else {
              // Likely more reliably than html(), as it is a native browser
              // method in some "modern" browser
              output += value.outerHTML;
            }
          }
        });
      };
      e.data["html"] = output;
    }
  });
});

Eventually, I opted to clean the HTML on the server side using the following regular expression:

text.sub!(/\A(\<p[^\>]*\>[\t\s]*\&nbsp\;[\t\s]*\<\/p\>)*/m,"")

Comments are closed.