MarkDown之二:标准化考虑

Markdown 是否需要标准化?这是个颇具争议的问题。有人反对(如 Markdown 发明人),有人致力于此(CommonMark)。事实是,目前 Markdown 并未达成标准,不同组织的 Markdown 源文件在渲染为 HTML 时存在各种各样的差异(风格)。

但写作者适应某个渲染工具的成本并不高,很容易就可调整过来。

  1. JOHN GRUBER:发明人的MarkDown官方站点
  2. Markdown 维基:维基百科(中文)
  3. Markdown 指南:关于MarkDown的一切
  4. Markdown Community:GitHub上的MarkDown社区
  5. Markdown 简介:简书上一篇介绍,稍显啰嗦
  6. CommonMark:MarkDown标准化规范
  7. GFM:GitHub Flavored Markdown Spec | Github风格Markdown规范

标准化

随着时间的推移,出现了许多Markdown实现。非正式规范中的一些含糊不清引起了人们的注意,促使Markdown解析器的一些开发人员努力实现标准化。

从 2012 年开始, Markdown 的粉丝提出了一个标准的、明确的 Markdown 语法规范,以及一套全面的测试,用来验证 Markdown 实现是否规范。但是由于 John Gruber 极力反对标准化,所以不能使用 Markdown 这个名字,改为了 CommonMark。

格鲁伯:“不同的网站(和人们)有不同的需求。没有一种语法可以让所有人满意。” “ Because different sites (and people) have different needs. No one syntax would make all happy.”

概述

CommonMark 奠定了 Markdown 标准化基础。

2016 年 3 月,IETF 【Internet Engineering Task Force | 国际互联网工程任务组】 发布了 RFC 7763 和 RFC 7764。【RFC,Request For Comments】 RFC 7763 从原始变体引入了 MIME 【Multipurpose Internet Mail Extensions | 多用途互联网邮件扩展】 类型 text/markdown。RFC 7764 讨论并注册了 MultiMarkdownGitHub Flavored Markdown (GFM)PandocCommonMarkMarkdown 等变体。能收录进 RFC,即正式承认了 Markdown 为互联网标准。

2017 年,GitHub 发布了基于 CommonMark 的 GitHub Flavored Markdown(GFM 的正式规范,并且使用 Markdown 作为默认的格式化写作语言。GFM 除了表格、删除线、自动链接和任务列表被 GitHub 规范作为扩展添加之外,它遵循 CommonMark 规范。这个规范更加肯定了 Markdown 的作用,推动了 Markdown 的发展,是目前最流行的 Markdown 标准。因为 Github 用户庞大,大多数平台和软件都向他看齐,纷纷使用 GFM。

Markdown 令人很困惑的一点是,实际上每个 Markdown 应用都实现了有点差别的 Markdown 版本(因为 Markdown 一直没有官方标准),这些不同的 Markdown 风格叫做 Flavor

直到 CommanMark 和 GFM 发布后,大部分应用都是以这两个规范为标准。所以学习前要弄清楚 Markdown 应用所支持的 Flavor,一般通过查看应用的帮助文档或者使用手册可以知道。比如通过 Typora 的说明文档就知道它支持 GFM;有道云笔记的 Markdown 应该也是基于 GFM。

Pandoc 源于输出目标格式的多元化(非单一 HTML),其 Markdown 并不是基于 CommonMark 或 GFM。

变体1:CommonMark

We propose a standard, unambiguous syntax specification 【标准且无歧义的语法规范】 for Markdown, along with a suite of comprehensive tests to validate Markdown implementations against this specification. 【对 Markdown 规范性的测试】 We believe this is necessary, even essential, for the future of Markdown.

The current version of the CommonMark spec (https://spec.commonmark.org/current) is quite robust after many years of public feedback. 【2020年4月,当前最新版本:0.30 @ 2021-6-19】

The following sites and projects have adopted CommonMark:

  • Discourse: Discourse is the 100% open source discussion platform built for the next decade of the Internet. Use it as a mailing list, discussion forum, long-form chat room, and more!

    Discourse 是 Jeff Atwood 推出的一个新的开源论坛项目,摒弃了传统论坛的话题讨论形式、拥有自学习系统、全Web应用同时适用于桌面和移动终端。基于 Ruby on Rails 和 Ember.js 开发,数据库使用 PostgreSQL 和 Redis。 Discourse is a from-scratch reboot, an attempt to reimagine what a modern Internet discussion forum should be today, in a world of ubiquitous smartphones, tablets, Facebook, and Twitter.

  • GitHub:github.com

  • GitLab | 极狐:gitlab.com | gitlab.cn

    GitLab 是一个使用 MIT 许可证 的基于网络的Git仓库管理工具,且具有 wiki 和 issue 跟踪功能,使用Git作为代码管理工具,并在此基础上搭建起来的web服务。 2022年2月消息,极狐(GitLab)正式宣布推出极狐GitLab SaaS (JihuLab.com),为中国用户提供从源代码托管到开发运维的全栈式一体化DevOps SaaS平台与企业级专家咨询服务。

  • Reddit: reddit.com

    Reddit is home to thousands of communities, endless conversation, and authentic human connection. Whether you're into breaking news, sports, TV fan theories, or a never-ending stream of the internet's cutest animals, there's a community on Reddit for you.

  • Qt: qt.io 是一个跨平台的C++应用程序开发框架。广泛用于开发GUI程序,也可用于控制台工具和服务器。Qt使用标准的C++和特殊的代码生成扩展(称为元对象编译器(Meta Object Compiler, moc))以及一些宏。

    The Productivity Platform for the Future
    • Next generation user experience with limitless scalability.
    • Qt is designed for producing cutting-edge software experiences in record-breaking times.
    Design, Develop & Deploy User Interfaces and Applications
    • Qt is the fastest and smartest way to produce industry-leading software that users love.
    • Target embedded, desktop, and mobile platforms with the same code base for all.
  • Stack Overflow / Stack Exchange: stackoverflow.com

    Stack Overflow 一个程序设计领域的问答网站。 Stack Exchange 是一系列问答网站,每一个网站包含不同领域的问题。这些网站参考Stack Overflow,一个关于程序设计的问答网站,也是Stack Exchange的第一个成员。

  • Swift: swift.org 编程语言,支持多编程范式和编译式,用来撰写基于macOS/OS X、iOS、iPadOS、watchOS和tvOS的软件。苹果公司于2014年发布了Swift,让Swift与Objective-C共存在苹果公司的操作系统上。

    苹果宣称Swift的特点是:快速、现代、安全、互动,而且明显优于Objective-C语言。

    Swift取消了Objective-C的指针和其他不安全访问的使用,舍弃了Objective C早期套用Smalltalk风格的语法,全面改为句点表示法(dot-notation)。Swift具备类型推导(type inference)。同时,它提供了类似C++、C#中的名字空间(namespace)、泛型(generic)、运算符重载(operator overloading)。Swift被简单的形容为 “没有C的Objective-C”(Objective-C without the C)

    Swift is a general-purpose programming language built using a modern approach to safety, performance, and software design patterns. Swift is intended as a replacement for C-based languages (C, C++, and Objective-C).

变体2:GFM

GitHub Flavored Markdown, often shortened as GFM, is the dialect 【方言】 of Markdown that is currently supported for user content on GitHub.com and GitHub Enterprise.

This formal specification, based on the CommonMark Spec, defines the syntax and semantics 【语义】 of this dialect.

GFM is a strict superset of CommonMark. All the features which are supported in GitHub user content and that are not specified on the original CommonMark Spec are hence known as extensions, and highlighted as such. 【GFM是CommonMark的超集,包含若干扩展(extensions)。】

Markdown 为何需要标准规范

参考 GFM 文档

John Gruber’s canonical description 【规范描述】 of Markdown’s syntax does not specify the syntax unambiguously. Here are some examples of questions it does not answer:

  1. 【子列表需要多少缩进量】How much indentation is needed for a sublist? The spec says that continuation paragraphs need to be indented four spaces, but is not fully explicit about sublists. It is natural to think that they, too, must be indented four spaces, but Markdown.pl does not require that. This is hardly a “corner case”【极端情况】, and divergences between implementations on this issue often lead to surprises for users in real documents.

  2. 【块引用或标题前是否需要空行】Is a blank line needed before a block quote or heading? Most implementations do not require the blank line. However, this can lead to unexpected results in hard-wrapped text 【硬换行文本】, and also to ambiguities in parsing (note that some implementations put the heading inside the blockquote, while others do not).

  3. 【缩进码块前是否需要空行】 Is a blank line needed before an indented code block? (Markdown.pl requires it, but this is not mentioned in the documentation, and some implementations do not require it.)

  4. 【列表项目被包含仅<p>标签的确切规则】 What is the exact rule for determining when list items get wrapped in <p> tags? Can a list be partially “loose” and partially “tight”? What should we do with a list like this?

    1. one
    
    2. two
    3. three

    Or this?

    1.  one
        - a
    
        - b
    2.  two
  5. 【列表符号是否可以缩进?有序列表符号可右对齐?】 Can list markers be indented? Can ordered list markers be right-aligned?

     8. item 1
     9. item 2
    10. item 2a
  6. 【主题中断的逻辑模糊】 Is this one list with a thematic break 【主题中断】 in its second item, or two lists separated by a thematic break?

    * a
    * * * * *
    * b
  7. 【列表符号由有序改为无序引起的模糊】 When list markers change from numbers to bullets, do we have two lists or one? (The Markdown syntax description suggests two, but the perl scripts and many other implementations produce one.)

    1. fee
    2. fie
    -  foe
    -  fum
  8. 【内联结构标记的优先规则是什么】 What are the precedence rules for the markers of inline structure? For example, is the following a valid link, or does the code span take precedence ?

    [a backtick (`)](/url) and [another backtick (`)](/url).
  9. 【强调标记的优先规则是什么】 What are the precedence rules for markers of emphasis and strong emphasis? For example, how should the following be parsed?

    *foo *bar* baz*
  10. 【块级和内敛结构优先规则是什么】 What are the precedence rules between block-level and inline-level structure? For example, how should the following be parsed?

    - `a long code span can contain a hyphen like this
      - and it can screw things up`
  11. 【列表项是否可包含章节标题】 Can list items include section headings? (Markdown.pl does not allow this, but does allow blockquotes to include headings.)

    - # Heading
  12. 【列表项可否为空】 Can list items be empty?

    * a 
    * 
    * b 
  13. 【引用块或列表项中是否可以定义链接引用】 Can link references be defined inside block quotes or list items?

    > Blockquote [foo].
    >
    > [foo]: /url
  14. 【同一参考有多个定义,优先级规则是什么】 If there are multiple definitions for the same reference, which takes precedence?

    [foo]: /url1
    [foo]: /url2
    
    [foo][]

Because there is no unambiguous spec, implementations have diverged considerably.

As a result, users are often surprised to find that a document that renders one way on one system (say, a GitHub wiki) renders differently on another (say, converting to docbook using pandoc).

GFM 约定 | GFM Preliminaries

字符和行 | Characters and lines

Any sequence of characters is a valid CommonMark document. A character is a ==Unicode== code point. 【character(字符)的定义】

This spec does not specify an encoding; it thinks of lines as composed of characters rather than bytes. A conforming parser may be limited to a certain encoding.

A line is a sequence of zero or more characters other than newline (U+000A换行符) or carriage return (U+000D回车符), followed by a line ending or by the end of file. 【line(行)的定义】

A line ending is a newline (U+000A), a carriage return (U+000D) not followed by a newline, or a carriage return and a following newline. 【line ending(行结尾)的定义】

A line containing no characters, or a line containing only spaces (U+0020) or tabs (U+0009), is called a blank line. 【blank line(空行)的定义】

A whitespace character is a space (U+0020), tab (U+0009), newline (U+000A), line tabulation (U+000B), form feed (U+000C), or carriage return (U+000D). 【whitespace character(空白字符)的定义:空格、制表符、换行符、行制表符、换页符、回车符】 > tab (U+0009)又称common tab, horizontal tabulation (HT) or character tabulation; line tabulation (U+000B) 又称 vertical tabulation (VT)

Whitespace is a sequence of one or more whitespace characters.【whitespace(空白)的定义】

A Unicode whitespace character is any code point in the Unicode Zs 【Separator, Space】 general category, or a tab (U+0009), carriage return (U+000D), newline (U+000A), or form feed (U+000C).

Unicode whitespace is a sequence of one or more Unicode whitespace characters.

A space is U+0020.

A non-whitespace character is any character that is not a whitespace character.

An ASCII punctuation character 【标点字符】 is !, ", #, $, %, &, ', (, ), *, +, ,, -, ., / (U+0021–2F), :, ;, <, =, >, ?, @ (U+003A–0040), [, \, ], ^, _, `` (U+005B–0060), {, |, }, or ~ (U+007B–007E).

A punctuation character is an ASCII punctuation character or anything in the general Unicode categories Pc【Punctuation, Connector】, Pd 【Punctuation, Dash】, Pe 【Punctuation, Close】, Pf 【Punctuation, Final quote】, Pi 【Punctuation, Initial quote】, Po 【Punctuation, Other】, or Ps 【Punctuation, Open】.

制表符 | Tabs

Tabs in lines are not expanded to spaces. However, in contexts where whitespace helps to define block structure, tabs behave as if they were replaced by spaces with a tab stop of 4 characters.

Thus, for example, a tab can be used instead of four spaces in an indented code block. (Note, however, that internal tabs are passed through as literal tabs, not expanded to spaces.) 【缩进时,tab相当于4个spaces;行内,tab为其字面量】

Normally the > that begins a block quote may be followed optionally by a space, which is not considered part of the content. In the following case > is followed by a tab, which is treated as if it were expanded into three spaces. Since one of these spaces is considered part of the delimiter, foo is considered to be indented six spaces inside the block quote context, so we get an indented code block starting with two spaces. 【2个tab跟在>后面,相当于7个spaces,第1个看作是界定符>的一部分,第2~5个共4个spaces开启一个码块,最终码块内容前含2个spaces】

  foo

不安全字符 | Insecure characters

For security reasons, the Unicode character U+0000 must be replaced with the REPLACEMENT CHARACTER (U+FFFD).