Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Many packages does not distinguish halfwidth and fullwidth characters #310

Open
4 tasks done
simnalamburt opened this issue Mar 12, 2024 · 23 comments
Open
4 tasks done
Labels
🤞 phase/open Post is being triaged manually

Comments

@simnalamburt
Copy link

Initial checklist

Affected packages and versions

[email protected]

Link to runnable example

No response

Steps to reproduce

Open below with browser:

<!doctype html>
<script type="module">
import {reporter} from 'https://esm.sh/[email protected]?bundle'
import {remark} from 'https://esm.sh/[email protected]?bundle'
import remarkGfm from 'https://esm.sh/[email protected]?bundle'
import remarkLint from 'https://esm.sh/[email protected]?build'
import remarkLintTablePipeAlignment from 'https://esm.sh/[email protected]?bundle'

const file = await remark()
  .use(remarkGfm)
  .use(remarkLint)
  .use(remarkLintTablePipeAlignment)
  .process(`
False positive
--------
Below should be OK but linter will report it as error.

| Alpha | 한글 |
|-------|------|
| a     | b    |

False negative
--------
Below should be error but linter will report it as OK.

| Alpha | 한글 |
|-------|------|
| a     | b  |
`)

document.body.innerHTML = `<pre>${reporter(file)}</pre>`
</script>

Expected behavior

The first one should be fine and the second one should be an error.

16:16-16:17 warning Misaligned table fence table-pipe-alignment remark-lint

⚠ 1 warning

Actual behavior

However, the first one gives an error and the second one does not.

8:16-8:17 warning Misaligned table fence table-pipe-alignment remark-lint

⚠ 1 warning

Runtime

Node v17, Node v16, Node v14, Node v12, Deno, Electron, Other (please specify in steps to reproduce)

Package manager

npm 8, npm 7, npm 6, yarn 2, yarn 1, pnpm, Other (please specify in steps to reproduce)

OS

Windows, Linux, macOS, Other (please specify in steps to reproduce)

Build and bundle tools

Webpack, Rollup, esbuild, Parcel, Create React App, Gatsby, Next.js, Remix, Docusaurus, Snowpack, Vite, Other (please specify in steps to reproduce)

@github-actions github-actions bot added 👋 phase/new Post is being triaged automatically 🤞 phase/open Post is being triaged manually and removed 👋 phase/new Post is being triaged automatically labels Mar 12, 2024
@wooorm
Copy link
Member

wooorm commented Mar 13, 2024

Hey!

This rule measures characters.
The size of displayed characters is different in different places. Characters will be different in a terminal, different here on GitHub, different in your editor, different in my editor.

There isn’t a good solution. I recommend turning this rule off instead. See https://github.com/remarkjs/remark-lint/tree/main/packages/remark-lint-table-pipe-alignment#fix for more info

@simnalamburt
Copy link
Author

simnalamburt commented Mar 13, 2024

Majority of programmers codes with fixed sized font and width of CJK characters in such situation are well defined and documented in here: https://unicode.org/Public/UNIDATA/EastAsianWidth.txt

We at least need to create a new option to respect the size of full width character. Something like "fixed width font mode"

@wooorm
Copy link
Member

wooorm commented Mar 13, 2024

Have you tested your ideas in different code editors?

@simnalamburt
Copy link
Author

simnalamburt commented Mar 13, 2024

I'm Korean. Yes I have been used CJK letters more than 20 years in many editors including vi, nvim, Vim, nano, micro, ed, terminal, emacs, VSCode, atom, zed, notepad.exe, notepad++, Visual Studio, sublime. In these editors, CJK letters will be displayed as fullwidth character (which have double width of halfwidth character) unless you explicitly change your font into something other than fixed width font.

@simnalamburt
Copy link
Author

simnalamburt commented Mar 13, 2024

Just in case you might not familiar with CJK: https://en.wikipedia.org/wiki/Halfwidth_and_fullwidth_forms

image

image

@wooorm
Copy link
Member

wooorm commented Mar 13, 2024

I’m looking at your code example here on GH and they’re both displayed “misaligned”

@simnalamburt
Copy link
Author

I don't think it's a good idea to ignore this issue to satisfy GitHub's misrepresentation of CJK characters.

@simnalamburt
Copy link
Author

I at least want to solve this problem by creating a linter option that respects the size of full-width characters. Since it's an option, it shouldn't affect existing behavior. I strongly believe that this should be the default behavior, not an option to opt-in, and this is what any CJK developer would have done. But if you disagree, I'd be happy to see it fixed as an option. What do you think?

@wooorm
Copy link
Member

wooorm commented Mar 13, 2024

GH is using the web. CSS. A monospaced font. Many editors use the web/css/monospace.

My point is that all tools are going to display it differently.
What is the point in having an option to this rule which will look fine for some users and broken for other users.

That's what the rule currently does: broken for some, good for some. So I suggest not using this rule.

You say that all tools behave one way. I show GH doing it differently. I know editors display emoji differently. Can you provide screenshots for different tools displaying your examples correctly?

@simnalamburt
Copy link
Author

This it it

image

Emoji containing version if you need:

image

@simnalamburt
Copy link
Author

"remark-lint-table-pipe-alignment" have been existed for a long time because it is useful to some users even if it's fine for some non-CJK users and broken for CJK users.

And so the new fullwidth-respecting option will be the same. It will be pretty much useful for CJK users.

@wooorm
Copy link
Member

wooorm commented Mar 13, 2024

Everyone uses emoji. Not just Koreans.

different tools

Can you please show different editors? And label which editors you show?

@simnalamburt
Copy link
Author

neovim:
image

iTerm2:
image

VSCode:
image

zed:
image

I'll upload windows examples tomorrow since I cannot access my windows devices for right now.

@xnuk
Copy link

xnuk commented Mar 13, 2024

GH is using the web. CSS. A monospaced font. Many editors use the web/css/monospace.

And many editors try to vertically align a CJK character with 2 latin alphabets, regardless of font configs. Terminal-based editors already do this - every Unicode-aware terminal emulators try to treat a CJK character width as 2 latin chars by default.

Ace, a web-based editor (which is used by Rust Playground), tries to vertically align intentionally. They measures font size and manually put twice size of it, like: <span style="width: 14.3984px;" class="ace_cjk">한</span>.

image

GitHub's align isn't right because, in my computer, they don't use a monospaced font, but they use two fonts: 'Liberation Mono' and 'Noto Sans Mono CJK KR' (because Liberation Mono doesn't have any CJK chars). Using only 'Noto Sans Mono CJK KR' mitigates the problem:

image

Using popular alternative fonts like Neo둥근모 or D2Coding just fits right:

Neo둥근모 D2Coding
image image

@xnuk
Copy link

xnuk commented Mar 14, 2024

Yes, character width can be different by editors, and by fonts. But the point is, you should Unicode-aware if you do "align" something, and 'treat CJK character width as same as two latin alphabets' is a good practice and sensible one. wcwidth-like functions can help this.

simnalamburt added a commit to simnalamburt/remark-lint that referenced this issue Mar 19, 2024
@simnalamburt
Copy link
Author

I've created a patch that addresses this issue as follows: 1e659c2

What do you think? (I didn't know how to make it an opt-in option, so I included it in the default behavior for now)

simnalamburt added a commit to simnalamburt/remark-lint that referenced this issue Mar 19, 2024
simnalamburt added a commit to portone-io/developers.portone.io that referenced this issue Mar 19, 2024
simnalamburt added a commit to portone-io/developers.portone.io that referenced this issue Mar 19, 2024
simnalamburt added a commit to portone-io/developers.portone.io that referenced this issue Mar 19, 2024
simnalamburt added a commit to portone-io/developers.portone.io that referenced this issue Mar 19, 2024
simnalamburt added a commit to simnalamburt/remark-lint that referenced this issue Mar 19, 2024
@simnalamburt
Copy link
Author

Similar issue found at 'remark-lint-table-cell-padding'. While code below is compact enough, remark-lint-table-cell-padding reports linter errors.

image

Plaintext:

|Enum Value  |Description   |
|------------|--------------|
|ORDER       |주문 정산     |
|ORDER_CANCEL|주문 취소 정산|
|MANUAL      |수동 정산     |

@simnalamburt
Copy link
Author

simnalamburt commented Mar 19, 2024

Similar issue found at 'remark-lint-maximum-line-length'. While code below is obviously crossed 80-width column, remark-lint-maximum-line-length does not report error.

image

Plaintext:

Mercury mercury mercury mercury mercury mercury mercury mercury mercury mercury
mercury.

한국어~ 한국어~ 한국어~ 한국어~ 한국어~ 한국어~ 한국어~ 한국어~ 한국어~ 한국어~ 한국어~ 한국어~ 한국어~ 한국어~ 한국어~ 한국어~

Most formatters like Prettier, rustfmt, ... properly handles fullwidth characters in this situations:

image

@simnalamburt simnalamburt changed the title 'remark-lint-table-pipe-alignment' does not distinguish halfwidth and fullwidth characters Many packages does not distinguish halfwidth and fullwidth characters Mar 19, 2024
@wooorm
Copy link
Member

wooorm commented Apr 2, 2024

GitHub's align isn't right because, in my computer, they don't use a monospaced font, but they use two fonts: 'Liberation Mono' and 'Noto Sans Mono CJK KR' (because Liberation Mono doesn't have any CJK chars). Using only 'Noto Sans Mono CJK KR' mitigates the problem:

Every tool (that is reasonably good) does that, your editor too.
There are many Unicode characters. Fonts don’t have glyphs for all Unicode characters.
So they use a “font stack”. A preferred font renders a glyph, if not found, then the next font is used, and so forth.

Using popular alternative fonts like Neo둥근모 or D2Coding just fits right:

Right, so this comes back to my original point. How things show, depends on the final user’s computer, and what that user chooses.
Of course, CJK users have configured their editor to display CJK well.
Then, some other user comes in, who uses GitHub, and there the alignment is different.

Different users will see things differently. It is not known to remark-lint how things will display.

But the point is, you should Unicode-aware if you do "align" something, and 'treat CJK character width as same as two latin alphabets' is a good practice and sensible one. wcwidth-like functions can help this.

That is a strong statements while the topic remains vague.

What is “Unicode-aware”? Please explain or link to a specification or algorithm.
What about Emoji?
What about ANSI colors?
Control characters?
What do you want to do about ambiguous characters?
Which Unicode version do you want to go with?

Why should remark-lint align differently than GitHub?

wcwidth is indeed interesting. There is also https://www.npmjs.com/package/wcwidth, which is a JavaScript version, but not maintained. I have also used string-width in the past.

We do support configurable functions in remark-gfm: https://github.com/remarkjs/remark-gfm#example-stringlength.
I am open to such functionality of course.
But it needs to work well.

And, could there be a good default that works for every user?

@simnalamburt
Copy link
Author

Please explain or link to a specification or algorithm.

OK.

Specification

Algorithm

In order of decreasing precedence:

  1. In the following cases, the width of a string differs from the sum of the widths of its constituent characters:
    • The sequence "\r\n" has width 1.
    • Emoji-specific ligatures:
    • Script-specific ligatures:
      • For all the following ligatures, the insertion of any number of default-ignorable combining marks anywhere in the sequence will not change the total width. In addition, for all non-Arabic ligatures, the insertion of any number of '\u{200D}' ZERO WIDTH JOINERs will not affect the width.
      • Arabic: A character sequence consisting of one character with Joining_Group=Lam, followed by any number of characters with Joining_Type=Transparent, followed by one character with Joining_Group=Alef, has total width 1. For example: لا‎, لآ‎, ڸا‎, لٟٞأ
      • Buginese: "\u{1A15}\u{1A17}\u{200D}\u{1A10}" (<a, -i> ya, ᨕᨗ‍ᨐ) has total width 1.
      • Hebrew: "א\u{200D}ל" (Alef-Lamed, א‍ל) has total width 1.
      • Khmer: Coeng signs consisting of '\u{17D2}' followed by a character in '\u{1780}'..='\u{1782}' | '\u{1784}'..='\u{1787}' | '\u{1789}'..='\u{178C}' | '\u{178E}'..='\u{1793}' | '\u{1795}'..='\u{1798}' | '\u{179B}'..='\u{179D}' | '\u{17A0}' | '\u{17A2}' | '\u{17A7}' | '\u{17AB}'..='\u{17AC}' | '\u{17AF}' have width 0.
      • Lisu: Tone letter combinations consisting of a character in the range '\u{A4F8}'..='\u{A4FB}' followed by a character in the range '\u{A4FC}'..='\u{A4FD}' have width 1. For example: ꓹꓼ
      • Old Turkic: "\u{10C32}\u{200D}\u{10C03}" (𐰲‍𐰃) has total width 1.
      • Tifinagh: A sequence of a Tifinagh consonant in the range '\u{2D31}'..='\u{2D65}' | '\u{2D6F}', followed by either '\u{2D7F}' TIFINAGH CONSONANT JOINER or '\u{200D}', followed by another Tifinangh consonant, has total width 1. For example: ⵏ⵿ⴾ
    • In an East Asian context only, <, =, or > have width 2 when followed by '\u{0338}' COMBINING LONG SOLIDUS OVERLAY. The two characters may be separated by any number of characters whose canonical decompositions consist only of characters meeting one of the following requirements:
  2. In all other cases, the width of the string equals the sum of its character widths:
    1. '\u{2D7F}' TIFINAGH CONSONANT JOINER has width 1 (outside of the ligatures described previously).
    2. '\u{115F}' HANGUL CHOSEONG FILLER and '\u{17A4}' KHMER INDEPENDENT VOWEL QAA have width 2.
    3. '\u{17D8}' KHMER SIGN BEYYAL has width 3.
    4. The following have width 0:
    5. Characters with an East_Asian_Width of Fullwidth or Wide have width 2.
    6. Characters fulfilling all of the following conditions have width 2 in an East Asian context, and width 1 otherwise:
    7. All other characters have width 1.

@wooorm
Copy link
Member

wooorm commented Nov 1, 2024

That‘s 1 of the 30 questions I asked. What about the others?

@simnalamburt
Copy link
Author

simnalamburt commented Nov 2, 2024

The discussions in this issue tracker are becoming increasingly frustrating.

I mentioned that I had already created a patch because I thought you might find it cumbersome to fix this issue: 1e659c2.

I believe I have sufficiently demonstrated that full-width and half-width characters are standardized concepts in Unicode and how they should operate. I think I have clearly explained in my response on September 27 how to handle Emojis, ANSI colors, Control characters, and ambiguous characters, as well as which Unicode version to follow.

I believe that correctly handling full-width characters is a good default, but I have already stated on March that I agree not to change the default because doing so would break backward compatibility.

remark-gfm already has the functionality to customize stringLength with a function of your choice: https://github.com/remarkjs/remark-gfm#example-stringlength. I simply want the same option to be available in remark-lint. (This way, the output generated by remark-gfm will automatically pass the checks by remark-lint.) By doing so, existing users who are not using the stringLength option will not experience any changes in behavior. Defining stringLength becomes the responsibility of the user rather than the library, so the maintainers of remark-lint do not need to follow-up the complex Unicode standards.

Please provide me a rough guide on how to add the stringLength option to remark-lint. Then I will create a patch.

@finalchild

This comment has been minimized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤞 phase/open Post is being triaged manually
Development

No branches or pull requests

4 participants