Page MenuHomePhorge

Remarkup Code-block: parse language specifier in markdown
ClosedPublic

Authored by valerio.bozzolan on Jun 15 2023, 20:55.

Details

Summary

We add support to code blocks with the language expressed as GitLab/GitHub/StackOverflow/...
"flavored markdown".

So we support this syntax: (to avoid confusion see it online on the Diff)

```php
$asd = 1;
```

Before this change, this was the only supposed syntax in Remarkup, with an explicit "lang=":

```lang=php
$asd = 1;
```

This change introduces a minor risk to eat legitimate Remarkup content, since Remarkup allows
to do a multi-line in this way:

```$asd = 1;
$asd = 2;```

The above example still works, but, there is a chance that hardcore Remarkup people
have a problem when doing a code block to mention programming languages.

In short, this can be problematic since "cpp" will be eaten from this list:

```cpp
php
python
```

Using the above example is not socially nice because it is not usable in GitLab, GitHub and Stack Overflow.

If your first line is eaten:

Just *add* a newline on the top to reach a valid raw Markdown list (suggested, valid in Remarkup + Markdown):

```
cpp
php
python
```

Or, just add "text" to specify that as language (suggested, valid in Remarkup + Markdown):

```text
cpp
php
python
```

Or, just *remove* a newline from the bottom to reach a valid raw Remarkup list (Remarkup-only):

```cpp
php
python```

Or, just specify that you are writing in the language "text" (Remarkup-only):

```lang=text
cpp
php
python```

To reduce impact and help you, the logic of this strict implementation is:

  • must have backticks
  • must not have any valid remarkup option, like lang=, counterexample, etc.
  • must not have content in the same line of the last backticks
  • must have a known language in our proposed subset

If everything is OK, we remove that language from the content since it would be otherwise displayed.

Interestingly, this could improve performance when rendering README files or snippets from external
websites, since - in case - we do not need to guess the language using our deep dark magic.

Closes T15481

Test Plan

We added some nice unit tests. Ensure that this test passes:

PhutilRemarkupEngineTestCase::testEngine

Optionally, take vision of these, before and after:

https://we.phorge.it/P16

Change the test plan slightly every time, to make sure it is not in your cache.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I didn't read the whole novel in the description, but keep in mind that remarkup is very performance sensitive, so make sure not to add any complex algorithms.

src/infrastructure/markup/blockrule/PhutilRemarkupCodeBlockRule.php
48

In short I just added a (.*)

valerio.bozzolan retitled this revision from Remarkup: add support to code blocks in "GitHub Flavored Markdown" to Remarkup: add support to code blocks in "GitHub/GitLab Flavored Markdown".Jun 16 2023, 10:30
valerio.bozzolan edited the summary of this revision. (Show Details)

(Premising that I liked to write a wall of text against Microsoft, randomly)

Thanks avivey, now the change itself is minimal and operative.

Just an additional regex, and very only if there was no language.

src/infrastructure/markup/blockrule/PhutilRemarkupCodeBlockRule.php
67

Note that head($lines) was not adopted since it contains whatever first line of content after heavy trim, so it's not necessarily the thing after the backticks.

valerio.bozzolan edited the test plan for this revision. (Show Details)
  1. Please move the "what is GHFM" chapters into the task.
  2. I'm updating the title to be more descriptive of the change itself.
  3. Instead of manually testing, add test elements into src/infrastructure/markup/remarkup/__tests__/remarkup/
src/infrastructure/markup/blockrule/PhutilRemarkupCodeBlockRule.php
47–48

$header_line is probably better description.

avivey retitled this revision from Remarkup: add support to code blocks in "GitHub/GitLab Flavored Markdown" to Remarkup Code-block: Parse "GitHub-Flavored-Markdown" language specifier.Jun 24 2023, 09:31

rename a variable, add a unit test file

Interestingly, whatever fancy stuff I put inside the file tick-block-multi-flavored-markdown.txt, the unit test from arc unit is always green to me :(

Running unit tests...
   PASS    2ms★  PhabricatorAnchorTestCase::testAnchors
   PASS   <1ms★  PhabricatorMarkupEngineTestCase::testRemarkupSentenceSummmaries
   PASS   <1ms★  PhutilMarkupTestCase::testAppendHTML
   PASS   <1ms★  PhutilMarkupTestCase::testURIPathComponentEscape
   PASS   <1ms★  PhutilMarkupTestCase::testTagBasics
   PASS   <1ms★  PhutilMarkupTestCase::testTagEmpty
   PASS   <1ms★  PhutilMarkupTestCase::testTagAttributes
   PASS   <1ms★  PhutilMarkupTestCase::testArrayEscaping
   PASS   <1ms★  PhutilMarkupTestCase::testTagDefaults
   PASS   <1ms★  PhutilMarkupTestCase::testTagEscapes
   PASS   <1ms★  PhutilMarkupTestCase::testTagJavascriptProtocolRejection
   PASS   <1ms★  PhutilMarkupTestCase::testHsprintf
   PASS   <1ms★  PhutilMarkupTestCase::testTagNullAttribute
   PASS   <1ms★  PhutilMarkupTestCase::testURIEscape
   SKIP  PhutilSafeHTMLTestCase::testOperator
Operator extension not available.
   PASS   <1ms★  PhutilTranslatedHTMLTestCase::testHTMLTranslations
   PASS  670ms   PhabricatorLibraryTestCase::testMethodVisibility
   PASS    3ms★  PhabricatorLibraryTestCase::testEverythingImplemented
   PASS  178ms   PhabricatorLibraryTestCase::testLibraryMap
   PASS   82ms   PhabricatorCelerityTestCase::testCelerityMaps
   PASS   10ms★  PhabricatorConduitTestCase::testConduitMethods
   PASS   <1ms★  PhabricatorInfrastructureTestCase::testApplicationsInstalled
   PASS   <1ms★  PhabricatorInfrastructureTestCase::testRejectMySQLNonUTF8Queries
 UNIT OKAY  No unit test failures.

add unit tests, and force execution

  • update the test plan
src/infrastructure/markup/blockrule/PhutilRemarkupCodeBlockRule.php
48

Declare $matches = null.

src/infrastructure/markup/remarkup/__tests__/PhutilRemarkupEngineTestCase.php
5

File a ticket? there's also something about celerity test not being invoked correctly.

valerio.bozzolan added inline comments.
src/infrastructure/markup/remarkup/__tests__/remarkup/tick-block-multi.txt
1
IMPORTANT: Having a single word "code" here is NOT compatible with StackOverflow, GitLab and GitHub, so it should not be encouraged anymore to avoid social impacts. That should be parsed as an heading, not as content.

So I changed this unit test to have "code code" (two words). So, not ambiguous.

src/infrastructure/markup/remarkup/__tests__/remarkup/tick-block-multi.txt
1

Please keep the old behavior and test. You can copy it to a new one if needed.

Compatibility with Markdown is an explicit non-goal for Remarkup, so we shouldn't change existing behavior.

src/infrastructure/markup/remarkup/__tests__/remarkup/tick-block-multi.txt
1

The patch should be abandoned in that case since code is a syntactically valid language code in that position for some billion people in Stack Overflow, GitHub, GitLab, and so on :(

The current blocking is: the lack of a function like is_a_recognized_language($lang). Without that, we cannot pass this test.

src/infrastructure/markup/remarkup/__tests__/remarkup/tick-block-multi.txt
1
  1. In the context of this input:


```foo
bar
```

Is foo being displayed when it's not on a list of known languages?

  1. is code a name of a known language?
  1. Can we limit the list of supported languages to the top 10, or to the list the install actually knows (we have such a list somewhere in the highlighting code)? And only consume the line if it matches this list?
valerio.bozzolan added inline comments.
src/infrastructure/markup/remarkup/__tests__/remarkup/tick-block-multi.txt
1

Uhm. Talking strictly for GitHub, GitLab and StackOverflow:

  1. Nope, whatever thing there is never shown
  2. Nope foo is not a known language.
  3. Yep we can do a semantic check and it would be a lovely compromise

To check the semantic, we have something here:

https://we.phorge.it/source/phorge/browse/master/src/infrastructure/markup/syntax/highlighter/PhutilPygmentsSyntaxHighlighter.php$56

A sublist can maybe created from that.

valerio.bozzolan marked an inline comment as done and an inline comment as not done.Jun 25 2023, 19:44
src/infrastructure/markup/remarkup/__tests__/remarkup/tick-block-multi.txt
1

My question was strictly about this Remarkup + this change.

valerio.bozzolan added inline comments.
src/infrastructure/markup/remarkup/__tests__/remarkup/tick-block-multi.txt
1

Talking strictly for this change as-is:

  1. No the word foo is not shown
  2. Still nope
  3. Still yep

be 70% more strict, and implement an allow-list of very important languages

I have a better feeling now. Thanks for your suggestions

valerio.bozzolan retitled this revision from Remarkup Code-block: Parse "GitHub-Flavored-Markdown" language specifier to Remarkup Code-block: Parse "Flavored Markdown" language specifier.Jun 26 2023, 08:16
valerio.bozzolan edited the summary of this revision. (Show Details)
valerio.bozzolan edited the summary of this revision. (Show Details)
valerio.bozzolan marked an inline comment as done.

declare $matches = null;

  1. It's not called "Flavored Markdown" - that's not a complete name. It's either "XXX-flavored markdown" or just "markdown".
  2. Fix the test plan in the description.
  3. what's the deal with the footer? What does it do?
src/infrastructure/markup/blockrule/PhutilRemarkupCodeBlockRule.php
92

this can just be !$valid_options.
Actually, I think checking === false should fail in the (three-ticks)php case?

95

you use if ($lines) in line 68, and if (count($lines)) here - is there a difference?

289

array(

and even better,

array(
  'xx' => true,
);

to make search faster.

Making this a static function variable might improve performance as well - see https://we.phorge.it/source/phorge/browse/master/src/infrastructure/markup/syntax/highlighter/PhutilPygmentsSyntaxHighlighter.php$56

src/infrastructure/markup/blockrule/PhutilRemarkupCodeBlockRule.php
289

OK thanks. What about shorter 'xx' => 1, ?

So also parsing is faster, saving 3*37 = 111 bytes :D

Since PHP is heavily built over if(1) and if(0) it seems OK to me

src/infrastructure/markup/blockrule/PhutilRemarkupCodeBlockRule.php
289

🤷‍♀️

valerio.bozzolan marked 3 inline comments as done.

add comments to answer questions

valerio.bozzolan added inline comments.
src/infrastructure/markup/blockrule/PhutilRemarkupCodeBlockRule.php
95

Yep there is difference

The check here is "please 2 lines or more".

src/infrastructure/markup/blockrule/PhutilRemarkupCodeBlockRule.php
92

Since we give high-priority to legitimate Remarkup, probably it's nice here to proceed only if we have parsed the options (not null) and if it was not valid (not true). That is why that === false

  1. It's not called "Flavored Markdown" - that's not a complete name. It's either "XXX-flavored markdown" or just "markdown".
  1. what's the deal with the footer? What does it do?
valerio.bozzolan retitled this revision from Remarkup Code-block: Parse "Flavored Markdown" language specifier to Remarkup Code-block: parse language specifier in markdown.Jun 26 2023, 19:57
valerio.bozzolan edited the summary of this revision. (Show Details)
In D25299#9134, @avivey wrote:
  1. It's not called "Flavored Markdown" - that's not a complete name. It's either "XXX-flavored markdown" or just "markdown".
  1. what's the deal with the footer? What does it do?
  1. I tried to improve the title. Feel free to change :(
  2. See src/infrastructure/markup/blockrule/PhutilRemarkupCodeBlockRule.php line 105. The footer is parsed, since the footer must be empty. When it's not empty, it's surely Remarkup and not Markdown. Too risky to proceed.

Sorry, I still don't understand the footer part.

In which cases must it be empty? Who said that?

src/infrastructure/markup/blockrule/PhutilRemarkupCodeBlockRule.php
92

What does the parser returns in these cases:

  1. ```
  2. ```php
  3. this thing:


```
lang=php
...

?

In D25299#9166, @avivey wrote:

Sorry, I still don't understand the footer part.

In which cases must it be empty? Who said that?

Premising the goal is to keep full-support to Remarkup and be strict in giving support to xxx-markdown,

This is the only accepted xxx-flavored-markdown where the backticks are on their dedicated lines:

```language_header
stuff
stuff
```

And having legitimate content "in the footer" is not allowed:


```HEADER
stuff
stuff
extra_stuff_not_supported```

The backticks should be on their own line in xxx-flavored-markdown. Using the counter example will cause the last backticks to be printed on the video in most platforms.

That is why, in the case we have content before the last backticks ("footer"), as extra-check, we can safely consider that as Remarkup, skipping considerations related to xxx-flavored-markdown.

valerio.bozzolan marked an inline comment as done.

OK finally the Diff description is somehow readable from Phorge, without fancy '%%%' or saying (backticks) etc.

This revision is now accepted and ready to land.Jul 4 2023, 09:10