使用 JavaScript RegExp() 過濾 HTML entity NBSP

2023-02-14 19:20 HTML JavaScript

總結
筆記
關於 HTML entity  
過濾空白的程式碼
參考文件

總結

最近在實作「將字串中的空白移除」功能時，發現在執行完 str.replaceAll(/[\u0020\u3000]+/g, '') 後，資料中還是會有空白字元沒刪乾淨的問題。追查後發現是沒有過濾掉 HTML entity   的緣故。此篇文章將簡單筆記一下何謂  ，並提供最後解決問題時使用的程式碼供參考。

筆記

關於 HTML entity ` `

HTML entity 是一組文字，固定以 & 開頭、以 ; 結尾。通常用來表現保留字元（reserved character）或是不可視（invisible characters）字元。

MDN: An HTML entity is a piece of text (“string”) that begins with an ampersand (&) and ends with a semicolon (;). Entities are frequently used to display reserved characters (which would otherwise be interpreted as HTML code), and invisible characters (like non-breaking spaces).

  是一種不換行字元（non-breaking space），與透過空白鍵敲擊出來的空白不同，此字元會避免「換行」發生。

stackOverFlow: One is non-breaking space and the other is a regular space. A non-breaking space means that the line should not be wrapped at that point, just like it wouldn’t be wrapped in the middle of a word.

維基百科的說明如下：

Wikipedia: In word processing and digital typesetting, a non-breaking space — also called NBSP, required space, hard space, or fixed space (though it is not of fixed width) — is a space character that prevents an automatic line break at its position.

在程式碼上還有另一個差異，  的 character code 為 160，而空白鍵產出的空白字元則是 32（參考 ASCII table）。

過濾空白的程式碼

程式碼如下。使用 HTML_ENTITY_SPACE 的理由是為了給這個特殊字元一個有意義的名字，接著再透過 new RegExp() 來組合出「包含半形、全形與 NBSP」的過濾模式即可。

// 為 `&nbsp;` 建立有意義的變數名稱
const HTML_ENTITY_SPACE = String.fromCharCode(160);

// 透過 new RegExp() 來設定過濾半形、全形以及 HTML entity 空白的 regex 模式，符合此模式的內容將被取代掉
function removeFullAndHalfSpace(rawString: string): string {
  const regexpSpace = new RegExp(`[\u0020\u3000${HTML_ENTITY_SPACE}]`, 'g');
  return rawString.replaceAll(regexpSpace, '');
}

普通文組 2.0

使用 JavaScript RegExp() 過濾 HTML entity NBSP

總結

筆記

關於 HTML entity &nbsp;

過濾空白的程式碼

參考文件

關於 HTML entity ` `