home chevron_right
Markdown‑aware Translation

Markdown-aware Translationlink

Docara’s translator (in core) extracts text from Markdown, sends it to Azure, and writes translations back without breaking markup.


Where it happenslink

  • Class: core translator (see Simai\Docara\Translation)
  • Entry: generateTranslateContent(string $file, string $lang): string
  • Pre-step: front matter is handled separately (frontMatterParser(), translateFromMatter()).

Parser setuplink

CommonMark environment with Custom Tags + core extensions, using a MarkdownParser (not converter):

private function initParser(): void
{
    $environment = new Environment([]);
    $environment->addExtension(new CustomTagsExtension($this->registry));
    $environment->addExtension(new CommonMarkCoreExtension());
    $environment->addExtension(new FrontMatterExtension());
    $this->parser = new MarkdownParser($environment);
}

Collecting text nodeslink

Parse to AST, walk, and collect Text nodes only; code blocks/inline code are separate node types and skipped.

$document  = $this->parser->parse($file);
$textNodes = [];
$walker = $document->walker();
while ($event = $walker->next()) {
    $node = $event->getNode();
    if ($event->isEntering() && ($node instanceof Text)) {
        $text = trim($node->getLiteral());
        if ($text !== '') {
            $textNodes[] = $node;
        }
    }
}

Line ranges of a text segmentlink

Bubble up to the nearest AbstractBlock to get start/end lines:

private function getNodeLines(Node $node): array
{
    $parent = $node;
    $range  = ['start' => 0, 'end' => 0];
    while ($parent !== null && !$parent instanceof AbstractBlock) {
        $parent = $parent->parent();
    }
    if ($parent !== null) {
        if (method_exists($parent, 'getStartLine')) $range['start'] = $parent->getStartLine();
        if (method_exists($parent, 'getEndLine'))   $range['end']   = $parent->getEndLine();
    }
    return $range;
}

Filtering non-linguistic stringslink

Skip strings without letters:

if (!preg_match('/\p{L}/u', $text)) continue; // skip numbers, symbols, etc.

Build the candidate list:

$textsToTranslateArray[] = [
  'text'  => $text,
  'start' => $lines['start'],
  'end'   => $lines['end'],
];

Cache passlink

Replace any strings found in cache; send only misses:

$flatten = array_map(fn($x) => $x['text'], $textsToTranslateArray);
[$cachedIdx, $flatten] = $this->checkCached($flatten, $lang);
$keys      = array_keys($textsToTranslateArray);
$keysAssoc = array_flip($cachedIdx);
$extracted = array_intersect_key($textsToTranslateArray, $keysAssoc);

foreach ($extracted as $k => $val) {
    $extracted[$k]['translated'] = $flatten[$k];
}

$textsToTranslateArray = array_values(array_diff_key($textsToTranslateArray, $keysAssoc));

Cache keys are SHA-1 over normalized source strings (normalize() strips CRLF and collapses whitespace).


Batching & sendinglink

Split remaining items into ~9,000-char chunks, call Azure, then throttle by characters-per-minute:

$chunks = $this->chunkTextArray($textsToTranslateArray);
$finalTranslated = [];
foreach ($chunks as $chunk) {
    $translatedChunk   = $this->translateText($chunk, $lang); // uses curlRequest()
    $finalTranslated   = array_merge($finalTranslated, $translatedChunk);
    $chars = array_sum(array_map(fn($c) => mb_strlen($c['text']), $chunk));
    $this->throttleByCharsPerMinute($chars);
}

translateText() maps responses back by index and updates cache:

foreach ($textsToTranslate as $i => &$original) {
    $original['translated'] = $translateData[$i]['translations'][0]['text'] ?? $original['text'];
    $this->setCached($toLang, $original['translated'], $original['text']);
}

Re-assembling results in original orderlink

Merge cached hits and fresh translations, aligned to original indices:

$finalBlock = $finalTranslated; // API results
$i = 0;
foreach ($keys as $k) {
    if (array_key_exists($k, $extracted)) {
        $finalTranslated[$k] = $extracted[$k];
    } else {
        $finalTranslated[$k] = $finalBlock[$i++];
    }
}

Bottom-up replacement by line rangeslink

Normalize EOLs, split into lines, then apply edits from bottom to top:

$normalized = str_replace("\r\n", "\n", $file);
$lines = preg_split('/\R/u', $normalized);

foreach (array_reverse($finalTranslated) as $block) {
    $start = $block['start'];
    $end   = $block['end'];
    $slice = implode("\n", array_slice($lines, $start - 1, $end - $start + 1));

    $replaced = $this->replace_last_literal($slice, $block['text'], $block['translated']);
    $replacedLines = explode("\n", $replaced);

    array_splice($lines, $start - 1, $end - $start + 1, $replacedLines);
}

return implode("\n", $lines);

Helper:

private function replace_last_literal(string $haystack, string $search, string $replace): string {
    $pos = mb_strrpos($haystack, $search);
    if ($pos === false) return $haystack;
    return mb_substr($haystack, 0, $pos)
         . $replace
         . mb_substr($haystack, $pos + mb_strlen($search));
}

Using the last occurrence reduces the chance of touching earlier duplicates within the same block when multiple Text nodes share identical content.


What remains untouchedlink

  • Code blocks (FencedCode, IndentedCode) and inline code (Code).
  • URLs and link/image destinations; only human-readable labels/alt text are translated.
  • Custom tag attributes; only inner text content is processed.

Edge cases & noteslink

  • Start/end lines = 0: if a node’s ancestor doesn’t expose line info, start/end may be 0. Guard against negative indices when slicing; CommonMark block nodes usually provide line numbers.
  • Duplicate phrases in one range: we target the last match in the block. For precise targeting of multiple identical phrases, add column offsets.
  • CRLF: input is normalized to LF; output is joined with \n.

Safety checklistlink

  • Gather only Text nodes (instanceof Text).
  • Skip non-linguistic strings (/\p{L}/u).
  • De-dupe via cache before sending to the provider.
  • Batch by size and throttle by CPM.
  • Replace bottom-up using captured line ranges.
  • Persist caches after the run.

Related code pathslink

  • Front matter: frontMatterParser(), translateFromMatter()
  • PHP arrays: translateLangFiles(), generateSettingsTranslate(), makeContent()
  • Azure calls: curlRequest(), translateText()