Markdown-aware Translationlink

Docara’s translator (in core) extracts text from Markdown, sends it to Azure, and writes translations back without breaking markup.

Where it happenslink

Class: core translator (see Simai\Docara\Translation)
Entry: generateTranslateContent(string $file, string $lang): string
Pre-step: front matter is handled separately (frontMatterParser(), translateFromMatter()).

Parser setuplink

CommonMark environment with Custom Tags + core extensions, using a MarkdownParser (not converter):

private function initParser(): void
{
    $environment = new Environment([]);
    $environment->addExtension(new CustomTagsExtension($this->registry));
    $environment->addExtension(new CommonMarkCoreExtension());
    $environment->addExtension(new FrontMatterExtension());
    $this->parser = new MarkdownParser($environment);
}

Collecting text nodeslink

Parse to AST, walk, and collect Text nodes only; code blocks/inline code are separate node types and skipped.

$document  = $this->parser->parse($file);
$textNodes = [];
$walker = $document->walker();
while ($event = $walker->next()) {
    $node = $event->getNode();
    if ($event->isEntering() && ($node instanceof Text)) {
        $text = trim($node->getLiteral());
        if ($text !== '') {
            $textNodes[] = $node;
        }
    }
}

Line ranges of a text segmentlink

Bubble up to the nearest AbstractBlock to get start/end lines:

private function getNodeLines(Node $node): array
{
    $parent = $node;
    $range  = ['start' => 0, 'end' => 0];
    while ($parent !== null && !$parent instanceof AbstractBlock) {
        $parent = $parent->parent();
    }
    if ($parent !== null) {
        if (method_exists($parent, 'getStartLine')) $range['start'] = $parent->getStartLine();
        if (method_exists($parent, 'getEndLine'))   $range['end']   = $parent->getEndLine();
    }
    return $range;
}

Filtering non-linguistic stringslink

Skip strings without letters:

if (!preg_match('/\p{L}/u', $text)) continue; // skip numbers, symbols, etc.

Build the candidate list:

$textsToTranslateArray[] = [
  'text'  => $text,
  'start' => $lines['start'],
  'end'   => $lines['end'],
];

Cache passlink

Replace any strings found in cache; send only misses:

$flatten = array_map(fn($x) => $x['text'], $textsToTranslateArray);
[$cachedIdx, $flatten] = $this->checkCached($flatten, $lang);
$keys      = array_keys($textsToTranslateArray);
$keysAssoc = array_flip($cachedIdx);
$extracted = array_intersect_key($textsToTranslateArray, $keysAssoc);

foreach ($extracted as $k => $val) {
    $extracted[$k]['translated'] = $flatten[$k];
}

$textsToTranslateArray = array_values(array_diff_key($textsToTranslateArray, $keysAssoc));

Cache keys are SHA-1 over normalized source strings (normalize() strips CRLF and collapses whitespace).

Batching & sendinglink

Split remaining items into ~9,000-char chunks, call Azure, then throttle by characters-per-minute:

$chunks = $this->chunkTextArray($textsToTranslateArray);
$finalTranslated = [];
foreach ($chunks as $chunk) {
    $translatedChunk   = $this->translateText($chunk, $lang); // uses curlRequest()
    $finalTranslated   = array_merge($finalTranslated, $translatedChunk);
    $chars = array_sum(array_map(fn($c) => mb_strlen($c['text']), $chunk));
    $this->throttleByCharsPerMinute($chars);
}

translateText() maps responses back by index and updates cache:

foreach ($textsToTranslate as $i => &$original) {
    $original['translated'] = $translateData[$i]['translations'][0]['text'] ?? $original['text'];
    $this->setCached($toLang, $original['translated'], $original['text']);
}

Re-assembling results in original orderlink

Merge cached hits and fresh translations, aligned to original indices:

$finalBlock = $finalTranslated; // API results
$i = 0;
foreach ($keys as $k) {
    if (array_key_exists($k, $extracted)) {
        $finalTranslated[$k] = $extracted[$k];
    } else {
        $finalTranslated[$k] = $finalBlock[$i++];
    }
}

Bottom-up replacement by line rangeslink

Normalize EOLs, split into lines, then apply edits from bottom to top:

$normalized = str_replace("\r\n", "\n", $file);
$lines = preg_split('/\R/u', $normalized);

foreach (array_reverse($finalTranslated) as $block) {
    $start = $block['start'];
    $end   = $block['end'];
    $slice = implode("\n", array_slice($lines, $start - 1, $end - $start + 1));

    $replaced = $this->replace_last_literal($slice, $block['text'], $block['translated']);
    $replacedLines = explode("\n", $replaced);

    array_splice($lines, $start - 1, $end - $start + 1, $replacedLines);
}

return implode("\n", $lines);

Helper:

private function replace_last_literal(string $haystack, string $search, string $replace): string {
    $pos = mb_strrpos($haystack, $search);
    if ($pos === false) return $haystack;
    return mb_substr($haystack, 0, $pos)
         . $replace
         . mb_substr($haystack, $pos + mb_strlen($search));
}

Using the last occurrence reduces the chance of touching earlier duplicates within the same block when multiple Text nodes share identical content.

What remains untouchedlink

Code blocks (FencedCode, IndentedCode) and inline code (Code).
URLs and link/image destinations; only human-readable labels/alt text are translated.
Custom tag attributes; only inner text content is processed.

Edge cases & noteslink

Start/end lines = 0: if a node’s ancestor doesn’t expose line info, start/end may be 0. Guard against negative indices when slicing; CommonMark block nodes usually provide line numbers.
Duplicate phrases in one range: we target the last match in the block. For precise targeting of multiple identical phrases, add column offsets.
CRLF: input is normalized to LF; output is joined with \n.

Safety checklistlink

Gather only Text nodes (instanceof Text).
Skip non-linguistic strings (/\p{L}/u).
De-dupe via cache before sending to the provider.
Batch by size and throttle by CPM.
Replace bottom-up using captured line ranges.
Persist caches after the run.

Related code pathslink

Front matter: frontMatterParser(), translateFromMatter()
PHP arrays: translateLangFiles(), generateSettingsTranslate(), makeContent()
Azure calls: curlRequest(), translateText()