<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Postgresql on Guillaume Delré</title><link>https://guillaumedelre.github.io/tags/postgresql/</link><description>Recent content in Postgresql on Guillaume Delré</description><generator>Hugo</generator><language>en</language><lastBuildDate>Mon, 10 Feb 2025 00:00:00 +0000</lastBuildDate><atom:link href="https://guillaumedelre.github.io/tags/postgresql/index.xml" rel="self" type="application/rss+xml"/><item><title>PostgreSQL full-text search through Doctrine, without a line of raw SQL</title><link>https://guillaumedelre.github.io/2025/02/10/postgresql-full-text-search-through-doctrine-without-a-line-of-raw-sql/</link><pubDate>Mon, 10 Feb 2025 00:00:00 +0000</pubDate><guid>https://guillaumedelre.github.io/2025/02/10/postgresql-full-text-search-through-doctrine-without-a-line-of-raw-sql/</guid><description>How we layered custom DBAL types and DQL wrappers on top of postgresql-for-doctrine to bring PostgreSQL full-text search to a Symfony API Platform project.</description><content:encoded><![CDATA[<p>The search box on the media library returned results in 800 milliseconds on staging. Production had forty times more rows. The query plan showed a sequential scan: no index involved, no way to fix it with a standard B-tree. The product team also wanted multi-word search: type &ldquo;interview president&rdquo;, get results containing both words. A <code>LIKE</code> query with wildcards has no clean way to express that without multiple independent conditions, each requiring its own scan.</p>
<p>PostgreSQL has had built-in full-text search for over fifteen years. The platform was already on PostgreSQL. The catch: the project uses Doctrine ORM, and Doctrine doesn&rsquo;t natively know what a <code>tsvector</code> is.</p>
<p>A community library, <a href="https://github.com/martin-georgiev/postgresql-for-doctrine" target="_blank" rel="noopener noreferrer">postgresql-for-doctrine</a>, covers part of that gap. It registers basic DQL functions like <code>TO_TSQUERY</code>, <code>TO_TSVECTOR</code>, and the <code>@@</code> match operator as separate atomic pieces. The foundation was there. Three things still had to be built on top.</p>
<h2 id="the-type-doctrine-has-never-seen">The type Doctrine has never seen</h2>
<p><a href="https://www.postgresql.org/docs/current/datatype-textsearch.html" target="_blank" rel="noopener noreferrer">PostgreSQL&rsquo;s full-text search</a> is built around two types: <code>tsvector</code> (a pre-processed list of normalized tokens) and <code>tsquery</code> (a search expression). You maintain a <code>tsvector</code> column, index it with GIN, and query with the <code>@@</code> match operator.</p>
<p>Doctrine&rsquo;s DBAL ships no <code>tsvector</code> type. Declaring <code>#[ORM\Column(type: 'tsvector')]</code> without registering it first throws a <code>UnknownColumnTypeException</code>. The fix is a custom DBAL type:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code class="language-php" data-lang="php"><span style="display:flex;"><span><span style="color:#66d9ef">class</span> <span style="color:#a6e22e">TsVector</span> <span style="color:#66d9ef">extends</span> <span style="color:#a6e22e">Type</span>
</span></span><span style="display:flex;"><span>{
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">final</span> <span style="color:#66d9ef">public</span> <span style="color:#66d9ef">const</span> <span style="color:#66d9ef">string</span> <span style="color:#a6e22e">DBAL_TYPE</span> <span style="color:#f92672">=</span> <span style="color:#e6db74">&#39;tsvector&#39;</span>;
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">public</span> <span style="color:#66d9ef">function</span> <span style="color:#a6e22e">getSQLDeclaration</span>(<span style="color:#66d9ef">array</span> $column, <span style="color:#a6e22e">AbstractPlatform</span> $platform)<span style="color:#f92672">:</span> <span style="color:#a6e22e">string</span>
</span></span><span style="display:flex;"><span>    {
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">return</span> <span style="color:#a6e22e">self</span><span style="color:#f92672">::</span><span style="color:#a6e22e">DBAL_TYPE</span>;
</span></span><span style="display:flex;"><span>    }
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">public</span> <span style="color:#66d9ef">function</span> <span style="color:#a6e22e">getName</span>()<span style="color:#f92672">:</span> <span style="color:#a6e22e">string</span>
</span></span><span style="display:flex;"><span>    {
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">return</span> <span style="color:#a6e22e">self</span><span style="color:#f92672">::</span><span style="color:#a6e22e">DBAL_TYPE</span>;
</span></span><span style="display:flex;"><span>    }
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">public</span> <span style="color:#66d9ef">function</span> <span style="color:#a6e22e">convertToDatabaseValueSQL</span>(<span style="color:#a6e22e">string</span> $sqlExpr, <span style="color:#a6e22e">AbstractPlatform</span> $platform)<span style="color:#f92672">:</span> <span style="color:#a6e22e">string</span>
</span></span><span style="display:flex;"><span>    {
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">return</span> <span style="color:#a6e22e">sprintf</span>(<span style="color:#e6db74">&#34;to_tsvector(&#39;simple&#39;, %s)&#34;</span>, $sqlExpr);
</span></span><span style="display:flex;"><span>    }
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">public</span> <span style="color:#66d9ef">function</span> <span style="color:#a6e22e">convertToDatabaseValue</span>(<span style="color:#a6e22e">mixed</span> $value, <span style="color:#a6e22e">AbstractPlatform</span> $platform)<span style="color:#f92672">:</span> <span style="color:#a6e22e">mixed</span>
</span></span><span style="display:flex;"><span>    {
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">if</span> (<span style="color:#a6e22e">is_array</span>($value) <span style="color:#f92672">&amp;&amp;</span> <span style="color:#a6e22e">isset</span>($value[<span style="color:#e6db74">&#39;data&#39;</span>])) {
</span></span><span style="display:flex;"><span>            <span style="color:#66d9ef">return</span> $value[<span style="color:#e6db74">&#39;data&#39;</span>];
</span></span><span style="display:flex;"><span>        }
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">return</span> <span style="color:#a6e22e">is_string</span>($value) <span style="color:#f92672">?</span> $value <span style="color:#f92672">:</span> <span style="color:#66d9ef">null</span>;
</span></span><span style="display:flex;"><span>    }
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">public</span> <span style="color:#66d9ef">function</span> <span style="color:#a6e22e">getMappedDatabaseTypes</span>(<span style="color:#a6e22e">AbstractPlatform</span> $platform)<span style="color:#f92672">:</span> <span style="color:#66d9ef">array</span>
</span></span><span style="display:flex;"><span>    {
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">return</span> [<span style="color:#a6e22e">self</span><span style="color:#f92672">::</span><span style="color:#a6e22e">DBAL_TYPE</span>];
</span></span><span style="display:flex;"><span>    }
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p>The interesting method is <code>convertToDatabaseValueSQL()</code>. Doctrine calls it to wrap the SQL placeholder before the value reaches the database. The written value automatically becomes <code>to_tsvector('simple', ?)</code> at the DBAL boundary with no extra step needed on the calling side.</p>
<p>Register the type in <code>doctrine.yaml</code>, then map the column on the entity:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code class="language-yaml" data-lang="yaml"><span style="display:flex;"><span><span style="color:#f92672">doctrine</span>:
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">dbal</span>:
</span></span><span style="display:flex;"><span>        <span style="color:#f92672">types</span>:
</span></span><span style="display:flex;"><span>            <span style="color:#f92672">tsvector</span>: <span style="color:#ae81ff">App\Doctrine\DBAL\Types\TsVector</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code class="language-php" data-lang="php"><span style="display:flex;"><span><span style="color:#75715e">#[ORM\Column(type: &#39;tsvector&#39;, nullable: true)]
</span></span></span><span style="display:flex;"><span><span style="color:#66d9ef">protected</span> <span style="color:#f92672">?</span><span style="color:#a6e22e">string</span> $textSearch <span style="color:#f92672">=</span> <span style="color:#66d9ef">null</span>;
</span></span></code></pre></div><p>PHP-side, the value is a plain string. The conversion to a proper <code>tsvector</code> happens invisibly at the DBAL layer.</p>
<p>We used the <code>'simple'</code> dictionary, which tokenizes on whitespace and punctuation without language-specific stemming. The platform handles multiple languages, and French stemming rules would break Spanish. Simple is good enough for phonetics.</p>
<h2 id="keeping-the-column-current">Keeping the column current</h2>
<p>A <code>tsvector</code> column is derived data: it has to stay in sync with the source fields whenever the entity changes. A Doctrine event listener handles that:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code class="language-php" data-lang="php"><span style="display:flex;"><span><span style="color:#75715e">#[AsDoctrineListener(event: Events::prePersist)]
</span></span></span><span style="display:flex;"><span><span style="color:#75715e">#[AsDoctrineListener(event: Events::preUpdate)]
</span></span></span><span style="display:flex;"><span><span style="color:#66d9ef">class</span> <span style="color:#a6e22e">MediaTsVectorSubscriber</span>
</span></span><span style="display:flex;"><span>{
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">public</span> <span style="color:#66d9ef">function</span> <span style="color:#a6e22e">prePersist</span>(<span style="color:#a6e22e">PrePersistEventArgs</span> $event)<span style="color:#f92672">:</span> <span style="color:#a6e22e">void</span>
</span></span><span style="display:flex;"><span>    {
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">if</span> (<span style="color:#f92672">!</span>$event<span style="color:#f92672">-&gt;</span><span style="color:#a6e22e">getObject</span>() <span style="color:#a6e22e">instanceof</span> <span style="color:#a6e22e">Media</span>) {
</span></span><span style="display:flex;"><span>            <span style="color:#66d9ef">return</span>;
</span></span><span style="display:flex;"><span>        }
</span></span><span style="display:flex;"><span>        $this<span style="color:#f92672">-&gt;</span><span style="color:#a6e22e">updateTextSearch</span>($event<span style="color:#f92672">-&gt;</span><span style="color:#a6e22e">getObject</span>());
</span></span><span style="display:flex;"><span>    }
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">public</span> <span style="color:#66d9ef">function</span> <span style="color:#a6e22e">preUpdate</span>(<span style="color:#a6e22e">PreUpdateEventArgs</span> $event)<span style="color:#f92672">:</span> <span style="color:#a6e22e">void</span>
</span></span><span style="display:flex;"><span>    {
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">if</span> (<span style="color:#f92672">!</span>$event<span style="color:#f92672">-&gt;</span><span style="color:#a6e22e">getObject</span>() <span style="color:#a6e22e">instanceof</span> <span style="color:#a6e22e">Media</span>) {
</span></span><span style="display:flex;"><span>            <span style="color:#66d9ef">return</span>;
</span></span><span style="display:flex;"><span>        }
</span></span><span style="display:flex;"><span>        $this<span style="color:#f92672">-&gt;</span><span style="color:#a6e22e">updateTextSearch</span>($event<span style="color:#f92672">-&gt;</span><span style="color:#a6e22e">getObject</span>());
</span></span><span style="display:flex;"><span>    }
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">private</span> <span style="color:#66d9ef">function</span> <span style="color:#a6e22e">updateTextSearch</span>(<span style="color:#a6e22e">Media</span> $entity)<span style="color:#f92672">:</span> <span style="color:#a6e22e">void</span>
</span></span><span style="display:flex;"><span>    {
</span></span><span style="display:flex;"><span>        $entity<span style="color:#f92672">-&gt;</span><span style="color:#a6e22e">setTextSearch</span>(
</span></span><span style="display:flex;"><span>            <span style="color:#a6e22e">sprintf</span>(<span style="color:#e6db74">&#39;%s %s&#39;</span>, $entity<span style="color:#f92672">-&gt;</span><span style="color:#a6e22e">getTitle</span>(), $entity<span style="color:#f92672">-&gt;</span><span style="color:#a6e22e">getCaption</span>())
</span></span><span style="display:flex;"><span>        );
</span></span><span style="display:flex;"><span>    }
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p>Before every persist and update, the subscriber concatenates the fields that should be searchable into <code>textSearch</code>. Doctrine flushes the combined string, the DBAL type wraps it in <code>to_tsvector('simple', ...)</code>, and PostgreSQL stores the tokenized form.</p>
<p>One subtlety: the PHP-side value is <code>&quot;title caption&quot;</code>, not the actual tsvector output. The database shows <code>'caption' 'title'</code> (sorted tokens), but the entity holds a plain string. That&rsquo;s expected: the conversion is a DBAL responsibility, not a PHP one. It can be confusing to debug until you remember where the boundary is.</p>
<h2 id="extending-dql-with-fts-operators">Extending DQL with FTS operators</h2>
<p>Doctrine&rsquo;s DQL covers common SQL operations, but anything PostgreSQL-specific is out of scope. That&rsquo;s where <code>postgresql-for-doctrine</code> starts: it registers <code>TO_TSQUERY</code>, <code>TO_TSVECTOR</code>, and <code>TSMATCH</code> as individual DQL functions. Writing a full-text query in DQL without it would mean dropping to native SQL entirely.</p>
<p>The library&rsquo;s functions are atomic, though. Each maps to one SQL call. Expressing a full match check in DQL looks like <code>TSMATCH(o.textSearch, TO_TSQUERY(:term))</code>. Readable enough, but the team wanted something more compact: a single DQL function that encodes both the match operator and the query type, including <code>websearch_to_tsquery</code>, which <code>postgresql-for-doctrine</code> didn&rsquo;t ship.</p>
<p>The solution is <a href="https://www.doctrine-project.org/projects/doctrine-orm/en/latest/cookbook/dql-user-defined-functions.html" target="_blank" rel="noopener noreferrer">custom DQL functions</a> via <code>FunctionNode</code>. You parse the DQL syntax, then emit SQL. All FTS functions share the same two-argument signature, so an abstract base class handles parsing:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code class="language-php" data-lang="php"><span style="display:flex;"><span><span style="color:#66d9ef">abstract</span> <span style="color:#66d9ef">class</span> <span style="color:#a6e22e">TsFunction</span> <span style="color:#66d9ef">extends</span> <span style="color:#a6e22e">FunctionNode</span>
</span></span><span style="display:flex;"><span>{
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">public</span> <span style="color:#a6e22e">PathExpression</span><span style="color:#f92672">|</span><span style="color:#a6e22e">Node</span><span style="color:#f92672">|</span><span style="color:#66d9ef">null</span> $ftsField <span style="color:#f92672">=</span> <span style="color:#66d9ef">null</span>;
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">public</span> <span style="color:#a6e22e">PathExpression</span><span style="color:#f92672">|</span><span style="color:#a6e22e">Node</span><span style="color:#f92672">|</span><span style="color:#66d9ef">null</span> $queryString <span style="color:#f92672">=</span> <span style="color:#66d9ef">null</span>;
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">public</span> <span style="color:#66d9ef">function</span> <span style="color:#a6e22e">parse</span>(<span style="color:#a6e22e">Parser</span> $parser)<span style="color:#f92672">:</span> <span style="color:#a6e22e">void</span>
</span></span><span style="display:flex;"><span>    {
</span></span><span style="display:flex;"><span>        $parser<span style="color:#f92672">-&gt;</span><span style="color:#a6e22e">match</span>(<span style="color:#a6e22e">TokenType</span><span style="color:#f92672">::</span><span style="color:#a6e22e">T_IDENTIFIER</span>);
</span></span><span style="display:flex;"><span>        $parser<span style="color:#f92672">-&gt;</span><span style="color:#a6e22e">match</span>(<span style="color:#a6e22e">TokenType</span><span style="color:#f92672">::</span><span style="color:#a6e22e">T_OPEN_PARENTHESIS</span>);
</span></span><span style="display:flex;"><span>        $this<span style="color:#f92672">-&gt;</span><span style="color:#a6e22e">ftsField</span> <span style="color:#f92672">=</span> $parser<span style="color:#f92672">-&gt;</span><span style="color:#a6e22e">StringPrimary</span>();
</span></span><span style="display:flex;"><span>        $parser<span style="color:#f92672">-&gt;</span><span style="color:#a6e22e">match</span>(<span style="color:#a6e22e">TokenType</span><span style="color:#f92672">::</span><span style="color:#a6e22e">T_COMMA</span>);
</span></span><span style="display:flex;"><span>        $this<span style="color:#f92672">-&gt;</span><span style="color:#a6e22e">queryString</span> <span style="color:#f92672">=</span> $parser<span style="color:#f92672">-&gt;</span><span style="color:#a6e22e">StringPrimary</span>();
</span></span><span style="display:flex;"><span>        $parser<span style="color:#f92672">-&gt;</span><span style="color:#a6e22e">match</span>(<span style="color:#a6e22e">TokenType</span><span style="color:#f92672">::</span><span style="color:#a6e22e">T_CLOSE_PARENTHESIS</span>);
</span></span><span style="display:flex;"><span>    }
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p>Each concrete class implements <code>getSql()</code> to emit its PostgreSQL expression:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code class="language-php" data-lang="php"><span style="display:flex;"><span><span style="color:#75715e">// e.textSearch @@ websearch_to_tsquery(&#39;simple&#39;, :term)
</span></span></span><span style="display:flex;"><span><span style="color:#66d9ef">class</span> <span style="color:#a6e22e">TsWebsearchQueryFunction</span> <span style="color:#66d9ef">extends</span> <span style="color:#a6e22e">TsFunction</span>
</span></span><span style="display:flex;"><span>{
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">public</span> <span style="color:#66d9ef">function</span> <span style="color:#a6e22e">getSql</span>(<span style="color:#a6e22e">SqlWalker</span> $sqlWalker)<span style="color:#f92672">:</span> <span style="color:#a6e22e">string</span>
</span></span><span style="display:flex;"><span>    {
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">return</span> $this<span style="color:#f92672">-&gt;</span><span style="color:#a6e22e">ftsField</span><span style="color:#f92672">-&gt;</span><span style="color:#a6e22e">dispatch</span>($sqlWalker)
</span></span><span style="display:flex;"><span>            <span style="color:#f92672">.</span><span style="color:#e6db74">&#34; @@ websearch_to_tsquery(&#39;simple&#39;, &#34;</span>
</span></span><span style="display:flex;"><span>            <span style="color:#f92672">.</span>$this<span style="color:#f92672">-&gt;</span><span style="color:#a6e22e">queryString</span><span style="color:#f92672">-&gt;</span><span style="color:#a6e22e">dispatch</span>($sqlWalker)<span style="color:#f92672">.</span><span style="color:#e6db74">&#39;)&#39;</span>;
</span></span><span style="display:flex;"><span>    }
</span></span><span style="display:flex;"><span>}
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">// ts_rank(e.textSearch, to_tsquery(:term)) for relevance ordering
</span></span></span><span style="display:flex;"><span><span style="color:#66d9ef">class</span> <span style="color:#a6e22e">TsRankFunction</span> <span style="color:#66d9ef">extends</span> <span style="color:#a6e22e">TsFunction</span>
</span></span><span style="display:flex;"><span>{
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">public</span> <span style="color:#66d9ef">function</span> <span style="color:#a6e22e">getSql</span>(<span style="color:#a6e22e">SqlWalker</span> $sqlWalker)<span style="color:#f92672">:</span> <span style="color:#a6e22e">string</span>
</span></span><span style="display:flex;"><span>    {
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">return</span> <span style="color:#e6db74">&#39;ts_rank(&#39;</span>
</span></span><span style="display:flex;"><span>            <span style="color:#f92672">.</span>$this<span style="color:#f92672">-&gt;</span><span style="color:#a6e22e">ftsField</span><span style="color:#f92672">-&gt;</span><span style="color:#a6e22e">dispatch</span>($sqlWalker)
</span></span><span style="display:flex;"><span>            <span style="color:#f92672">.</span><span style="color:#e6db74">&#39;, to_tsquery(&#39;</span><span style="color:#f92672">.</span>$this<span style="color:#f92672">-&gt;</span><span style="color:#a6e22e">queryString</span><span style="color:#f92672">-&gt;</span><span style="color:#a6e22e">dispatch</span>($sqlWalker)<span style="color:#f92672">.</span><span style="color:#e6db74">&#39;))&#39;</span>;
</span></span><span style="display:flex;"><span>    }
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code class="language-yaml" data-lang="yaml"><span style="display:flex;"><span><span style="color:#f92672">doctrine</span>:
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">orm</span>:
</span></span><span style="display:flex;"><span>        <span style="color:#f92672">entity_managers</span>:
</span></span><span style="display:flex;"><span>            <span style="color:#f92672">default</span>:
</span></span><span style="display:flex;"><span>                <span style="color:#f92672">dql</span>:
</span></span><span style="display:flex;"><span>                    <span style="color:#f92672">string_functions</span>:
</span></span><span style="display:flex;"><span>                        <span style="color:#f92672">tswebsearchquery</span>: <span style="color:#ae81ff">App\Doctrine\ORM\Query\AST\Functions\TsWebsearchQueryFunction</span>
</span></span><span style="display:flex;"><span>                        <span style="color:#f92672">tsrank</span>: <span style="color:#ae81ff">App\Doctrine\ORM\Query\AST\Functions\TsRankFunction</span>
</span></span><span style="display:flex;"><span>                        <span style="color:#f92672">tsquery</span>: <span style="color:#ae81ff">App\Doctrine\ORM\Query\AST\Functions\TsQueryFunction</span>
</span></span><span style="display:flex;"><span>                        <span style="color:#f92672">tsplainquery</span>: <span style="color:#ae81ff">App\Doctrine\ORM\Query\AST\Functions\TsPlainQueryFunction</span>
</span></span></code></pre></div><p><code>websearch_to_tsquery</code> is the right choice for user-facing search: spaces become AND, quoted strings become phrases, <code>-word</code> excludes a term. No need to teach users to type <code>interview &amp; president</code>. It was added in PostgreSQL 11. On older versions, <code>plainto_tsquery</code> is the closest equivalent.</p>
<h2 id="the-api-platform-filter-and-the-gin-index">The API Platform filter and the GIN index</h2>
<p>With the DQL functions registered, the API Platform filter is straightforward. A custom <code>AbstractFilter</code> calls the DQL function directly in the <code>QueryBuilder</code>:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code class="language-php" data-lang="php"><span style="display:flex;"><span><span style="color:#66d9ef">class</span> <span style="color:#a6e22e">TextSearchFilter</span> <span style="color:#66d9ef">extends</span> <span style="color:#a6e22e">AbstractFilter</span>
</span></span><span style="display:flex;"><span>{
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">protected</span> <span style="color:#66d9ef">function</span> <span style="color:#a6e22e">filterProperty</span>(
</span></span><span style="display:flex;"><span>        <span style="color:#a6e22e">string</span> $property,
</span></span><span style="display:flex;"><span>        $value,
</span></span><span style="display:flex;"><span>        <span style="color:#a6e22e">QueryBuilder</span> $queryBuilder,
</span></span><span style="display:flex;"><span>        <span style="color:#a6e22e">QueryNameGeneratorInterface</span> $queryNameGenerator,
</span></span><span style="display:flex;"><span>        <span style="color:#a6e22e">string</span> $resourceClass,
</span></span><span style="display:flex;"><span>        <span style="color:#f92672">?</span><span style="color:#a6e22e">Operation</span> $operation <span style="color:#f92672">=</span> <span style="color:#66d9ef">null</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">array</span> $context <span style="color:#f92672">=</span> []
</span></span><span style="display:flex;"><span>    )<span style="color:#f92672">:</span> <span style="color:#a6e22e">void</span> {
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">if</span> (<span style="color:#e6db74">&#39;textSearch&#39;</span> <span style="color:#f92672">!==</span> $property <span style="color:#f92672">||</span> <span style="color:#66d9ef">empty</span>($value)) {
</span></span><span style="display:flex;"><span>            <span style="color:#66d9ef">return</span>;
</span></span><span style="display:flex;"><span>        }
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>        $queryBuilder
</span></span><span style="display:flex;"><span>            <span style="color:#f92672">-&gt;</span><span style="color:#a6e22e">andWhere</span>(<span style="color:#e6db74">&#39;tswebsearchquery(o.textSearch, :value) = true&#39;</span>)
</span></span><span style="display:flex;"><span>            <span style="color:#f92672">-&gt;</span><span style="color:#a6e22e">setParameter</span>(<span style="color:#e6db74">&#39;:value&#39;</span>, $value);
</span></span><span style="display:flex;"><span>    }
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">public</span> <span style="color:#66d9ef">function</span> <span style="color:#a6e22e">getDescription</span>(<span style="color:#a6e22e">string</span> $resourceClass)<span style="color:#f92672">:</span> <span style="color:#66d9ef">array</span>
</span></span><span style="display:flex;"><span>    {
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">return</span> [];
</span></span><span style="display:flex;"><span>    }
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p>Apply it on the entity alongside the index declaration:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code class="language-php" data-lang="php"><span style="display:flex;"><span><span style="color:#75715e">#[ORM\Index(
</span></span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">columns</span><span style="color:#f92672">:</span> [<span style="color:#e6db74">&#39;text_search&#39;</span>],
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">name</span><span style="color:#f92672">:</span> <span style="color:#e6db74">&#39;media_text_search_idx_gin&#39;</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">options</span><span style="color:#f92672">:</span> [<span style="color:#e6db74">&#39;USING&#39;</span> <span style="color:#f92672">=&gt;</span> <span style="color:#e6db74">&#39;gin (text_search)&#39;</span>]
</span></span><span style="display:flex;"><span>)]
</span></span><span style="display:flex;"><span><span style="color:#75715e">#[ApiFilter(TextSearchFilter::class, properties: [&#39;textSearch&#39; =&gt; &#39;partial&#39;])]
</span></span></span><span style="display:flex;"><span><span style="color:#66d9ef">class</span> <span style="color:#a6e22e">Media</span>
</span></span><span style="display:flex;"><span>{
</span></span><span style="display:flex;"><span>    <span style="color:#75715e">// ...
</span></span></span><span style="display:flex;"><span>    <span style="color:#75715e">#[ORM\Column(type: &#39;tsvector&#39;, nullable: true)]
</span></span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">protected</span> <span style="color:#f92672">?</span><span style="color:#a6e22e">string</span> $textSearch <span style="color:#f92672">=</span> <span style="color:#66d9ef">null</span>;
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p>The <code>USING gin</code> option is non-negotiable. A standard B-tree index on a <code>tsvector</code> column is useless: PostgreSQL can&rsquo;t use it for <code>@@</code> queries. GIN (Generalized Inverted Index) works differently: it indexes each token individually, so lookups by any token are <code>O(log n)</code> rather than <code>O(n)</code>. Without it, you&rsquo;ve built a fast-looking system that still does a full table scan.</p>
<p>A <code>GET /media?textSearch=interview+president</code> now hits the GIN index and returns in single-digit milliseconds regardless of table size.</p>
<h2 id="what-the-split-actually-looked-like">What the split actually looked like</h2>
<p>The library covered the low-level atomic functions. The custom code covered the gaps: a <code>tsvector</code> DBAL type the library didn&rsquo;t provide, convenience DQL wrappers that combined <code>@@</code> and <code>websearch_to_tsquery</code> into a single call, and the application-specific glue connecting it all to Doctrine&rsquo;s event system and API Platform. Nothing needed to drop to a native query.</p>
<p>The split is worth noting in general: <code>postgresql-for-doctrine</code> gives you the atomic PostgreSQL building blocks, but you still need to compose them into something the rest of the codebase can use without thinking about it. The <code>FunctionNode</code> pattern and the <code>convertToDatabaseValueSQL()</code> hook are the two extension points that make that composition clean. Both are worth knowing about, regardless of what library you start from.</p>
]]></content:encoded></item><item><title>Revision pruning with window functions and logarithms, when DQL wasn't enough</title><link>https://guillaumedelre.github.io/2020/09/27/revision-pruning-with-window-functions-and-logarithms-when-dql-wasnt-enough/</link><pubDate>Sun, 27 Sep 2020 00:00:00 +0000</pubDate><guid>https://guillaumedelre.github.io/2020/09/27/revision-pruning-with-window-functions-and-logarithms-when-dql-wasnt-enough/</guid><description>How a logarithmic score and ROW_NUMBER() OVER PARTITION BY solved runaway revision table growth after DQL hit its limits.</description><content:encoded><![CDATA[<p>Every content update on the platform creates a revision. That&rsquo;s by design: editors need a history they can roll back to, and the platform needs an audit trail. What nobody anticipated was the rate. Some articles go through forty saves in a single afternoon. A high-traffic piece accumulates hundreds of revisions over its lifetime. After a few months, the revision table had several million rows.</p>
<p>Deleting them naively wasn&rsquo;t an option. &ldquo;Keep the last 50&rdquo; loses all historical context for articles that haven&rsquo;t been touched in a year. &ldquo;Keep one per day&rdquo; loses all the detail for content that&rsquo;s actively being edited. What we needed was a distribution that matched how revisions are actually used: dense coverage for recent history, sparse coverage for old history.</p>
<p>That&rsquo;s a logarithmic distribution. And building it required raw SQL.</p>
<h2 id="why-simple-strategies-fail">Why simple strategies fail</h2>
<p>The appeal of a fixed window is obvious: keep the N most recent revisions and delete the rest. It&rsquo;s one line of SQL and zero math. The problem is that it treats a revision from yesterday and a revision from three years ago as equally valuable, which they aren&rsquo;t. An editor who opens an article from 2017 doesn&rsquo;t need its last 50 versions; they might need one per quarter. An article that shipped this morning might need every save from the past hour.</p>
<p>A time-based strategy (one revision per calendar day) has the opposite problem: it&rsquo;s too aggressive for active content. If an article gets 30 saves between 09:00 and 10:00, all of them except one disappear. That&rsquo;s not history, that&rsquo;s erasure.</p>
<p>Neither strategy can express &ldquo;keep more detail for recent content, less for old content.&rdquo; That relationship is logarithmic.</p>
<h2 id="the-scoring-idea">The scoring idea</h2>
<p>The algorithm assigns each revision a score based on its age, then keeps only one revision per score bucket. The score formula produces high, widely-spaced values for recent revisions and small, clustered values for old ones.</p>
<p>The core expression, simplified, looks like this:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code class="language-sql" data-lang="sql"><span style="display:flex;"><span>(
</span></span><span style="display:flex;"><span>  ln( <span style="color:#66d9ef">EXTRACT</span>(epoch <span style="color:#66d9ef">FROM</span> (now() <span style="color:#f92672">-</span> created_at)) )
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">/</span>
</span></span><span style="display:flex;"><span>  ( <span style="color:#66d9ef">EXTRACT</span>(epoch <span style="color:#66d9ef">FROM</span> (now() <span style="color:#f92672">-</span> created_at)) <span style="color:#f92672">/</span> <span style="color:#ae81ff">6000</span> )
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span><span style="color:#f92672">*</span> ( <span style="color:#ae81ff">1</span> <span style="color:#f92672">/</span> (<span style="color:#66d9ef">EXTRACT</span>(epoch <span style="color:#66d9ef">FROM</span> (now() <span style="color:#f92672">-</span> created_at)) <span style="color:#f92672">/</span> <span style="color:#ae81ff">60</span> <span style="color:#f92672">/</span> <span style="color:#ae81ff">1440</span>) )
</span></span><span style="display:flex;"><span><span style="color:#f92672">*</span> <span style="color:#ae81ff">1000</span>
</span></span></code></pre></div><p>Let <code>s</code> be the age in seconds. The formula is roughly <code>ln(s) / s * C</code>, where both the logarithm in the numerator and <code>s</code> in the denominator make the result decrease rapidly as <code>s</code> grows.</p>
<p>Cast to an integer, the effect is this: a revision saved 10 minutes ago might score 8432, one saved 11 minutes ago scores 8431. They&rsquo;re in different buckets. A revision from six months ago scores 2, one from eight months ago also scores 2. Same bucket. The window function then picks the most recent revision from each bucket and discards the rest.</p>
<p>The result is automatic: recent saves are all kept because each has a distinct score; old saves are thinned because many share the same score.</p>
<h2 id="the-dql-attempt-that-didnt-ship">The DQL attempt that didn&rsquo;t ship</h2>
<p>Window functions aren&rsquo;t part of DQL. Doctrine&rsquo;s query language has no syntax for <code>OVER</code>, <code>PARTITION BY</code>, or <code>ROW_NUMBER()</code>. Before going to raw SQL, the team tried to add them.</p>
<p>The <code>FunctionNode</code> approach works for single SQL functions, as we&rsquo;d already seen with FTS. A <code>RowNumber</code> node emitting <code>ROW_NUMBER()</code> is trivial:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code class="language-php" data-lang="php"><span style="display:flex;"><span><span style="color:#66d9ef">class</span> <span style="color:#a6e22e">RowNumber</span> <span style="color:#66d9ef">extends</span> <span style="color:#a6e22e">FunctionNode</span>
</span></span><span style="display:flex;"><span>{
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">public</span> <span style="color:#66d9ef">function</span> <span style="color:#a6e22e">getSql</span>(<span style="color:#a6e22e">SqlWalker</span> $sqlWalker)<span style="color:#f92672">:</span> <span style="color:#a6e22e">string</span>
</span></span><span style="display:flex;"><span>    {
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">return</span> <span style="color:#e6db74">&#39;ROW_NUMBER()&#39;</span>;
</span></span><span style="display:flex;"><span>    }
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p>The harder part is <code>OVER(PARTITION BY ... ORDER BY ...)</code>. An <code>Over</code> function node was drafted, with a custom <code>PartitionByClause</code> AST node to handle the <code>PARTITION BY</code> clause:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code class="language-php" data-lang="php"><span style="display:flex;"><span><span style="color:#66d9ef">class</span> <span style="color:#a6e22e">Over</span> <span style="color:#66d9ef">extends</span> <span style="color:#a6e22e">FunctionNode</span>
</span></span><span style="display:flex;"><span>{
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">protected</span> <span style="color:#f92672">?</span><span style="color:#a6e22e">PartitionByClause</span> $partitionByClause <span style="color:#f92672">=</span> <span style="color:#66d9ef">null</span>;
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">protected</span> <span style="color:#f92672">?</span><span style="color:#a6e22e">OrderByClause</span> $orderByClause <span style="color:#f92672">=</span> <span style="color:#66d9ef">null</span>;
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">public</span> <span style="color:#66d9ef">function</span> <span style="color:#a6e22e">getSql</span>(<span style="color:#a6e22e">SqlWalker</span> $sqlWalker)<span style="color:#f92672">:</span> <span style="color:#a6e22e">string</span>
</span></span><span style="display:flex;"><span>    {
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">return</span> <span style="color:#e6db74">&#39;OVER(&#39;</span>
</span></span><span style="display:flex;"><span>            <span style="color:#f92672">.</span>($this<span style="color:#f92672">-&gt;</span><span style="color:#a6e22e">partitionByClause</span>
</span></span><span style="display:flex;"><span>                <span style="color:#f92672">?</span> $this<span style="color:#f92672">-&gt;</span><span style="color:#a6e22e">partitionByClause</span><span style="color:#f92672">-&gt;</span><span style="color:#a6e22e">dispatch</span>($sqlWalker)
</span></span><span style="display:flex;"><span>                <span style="color:#f92672">:</span> ($this<span style="color:#f92672">-&gt;</span><span style="color:#a6e22e">orderByClause</span>
</span></span><span style="display:flex;"><span>                    <span style="color:#f92672">?</span> $this<span style="color:#f92672">-&gt;</span><span style="color:#a6e22e">orderByClause</span><span style="color:#f92672">-&gt;</span><span style="color:#a6e22e">dispatch</span>($sqlWalker)
</span></span><span style="display:flex;"><span>                    <span style="color:#f92672">:</span> <span style="color:#e6db74">&#39;&#39;</span>))
</span></span><span style="display:flex;"><span>            <span style="color:#f92672">.</span><span style="color:#e6db74">&#39;)&#39;</span>;
</span></span><span style="display:flex;"><span>    }
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p>It was never finished. The classes shipped marked <code>@deprecated</code> and &ldquo;NOT TESTED YET&rdquo;. The issue is composability: DQL&rsquo;s <code>FunctionNode</code> works cleanly for functions that appear in WHERE clauses or SELECT expressions. A window function like <code>ROW_NUMBER() OVER (PARTITION BY ...)</code> is a different structure: it appears in a SELECT position, modifies the surrounding query semantics, and requires the parser to handle <code>PARTITION BY</code> as an extension to DQL&rsquo;s grammar. Making that robust enough to trust in production is a significant investment. Going to DBAL and writing the SQL directly took an afternoon.</p>
<h2 id="the-query-layer-by-layer">The query, layer by layer</h2>
<p>The final implementation is three nested queries:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code class="language-sql" data-lang="sql"><span style="display:flex;"><span><span style="color:#66d9ef">DELETE</span> <span style="color:#66d9ef">FROM</span> revision
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">WHERE</span> iri <span style="color:#f92672">=</span> <span style="color:#f92672">?</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">AND</span> id <span style="color:#66d9ef">NOT</span> <span style="color:#66d9ef">IN</span> (
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">SELECT</span> id <span style="color:#66d9ef">FROM</span> (
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">SELECT</span>
</span></span><span style="display:flex;"><span>            row_number() OVER (
</span></span><span style="display:flex;"><span>                PARTITION <span style="color:#66d9ef">BY</span> num, iri
</span></span><span style="display:flex;"><span>                <span style="color:#66d9ef">ORDER</span> <span style="color:#66d9ef">BY</span> num <span style="color:#66d9ef">DESC</span>, created_at <span style="color:#66d9ef">DESC</span>
</span></span><span style="display:flex;"><span>            ) <span style="color:#66d9ef">AS</span> lines,
</span></span><span style="display:flex;"><span>            <span style="color:#f92672">*</span>
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">FROM</span> (
</span></span><span style="display:flex;"><span>            <span style="color:#66d9ef">SELECT</span>
</span></span><span style="display:flex;"><span>                (
</span></span><span style="display:flex;"><span>                    ( ln( <span style="color:#66d9ef">EXTRACT</span>(epoch <span style="color:#66d9ef">FROM</span> (now() <span style="color:#f92672">-</span> created_at)) )
</span></span><span style="display:flex;"><span>                      <span style="color:#f92672">/</span> ( <span style="color:#66d9ef">EXTRACT</span>(epoch <span style="color:#66d9ef">FROM</span> (now() <span style="color:#f92672">-</span> created_at)) <span style="color:#f92672">/</span> <span style="color:#ae81ff">6000</span> ) )
</span></span><span style="display:flex;"><span>                    <span style="color:#f92672">*</span> ( <span style="color:#ae81ff">1</span> <span style="color:#f92672">/</span> (<span style="color:#66d9ef">EXTRACT</span>(epoch <span style="color:#66d9ef">FROM</span> (now() <span style="color:#f92672">-</span> created_at)) <span style="color:#f92672">/</span> <span style="color:#ae81ff">60</span> <span style="color:#f92672">/</span> <span style="color:#ae81ff">1440</span>) )
</span></span><span style="display:flex;"><span>                    <span style="color:#f92672">*</span> <span style="color:#ae81ff">1000</span>
</span></span><span style="display:flex;"><span>                )::numeric::integer <span style="color:#66d9ef">AS</span> num,
</span></span><span style="display:flex;"><span>                <span style="color:#f92672">*</span>
</span></span><span style="display:flex;"><span>            <span style="color:#66d9ef">FROM</span> revision
</span></span><span style="display:flex;"><span>            <span style="color:#66d9ef">WHERE</span> iri <span style="color:#f92672">=</span> <span style="color:#f92672">?</span>
</span></span><span style="display:flex;"><span>            <span style="color:#66d9ef">ORDER</span> <span style="color:#66d9ef">BY</span> created_at <span style="color:#66d9ef">DESC</span>
</span></span><span style="display:flex;"><span>        ) <span style="color:#66d9ef">AS</span> lst
</span></span><span style="display:flex;"><span>    ) <span style="color:#66d9ef">AS</span> rst
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">WHERE</span> lines <span style="color:#f92672">=</span> <span style="color:#ae81ff">1</span>
</span></span><span style="display:flex;"><span>);
</span></span></code></pre></div><p><strong>Inner query:</strong> computes <code>num</code>, the integer score, for every revision of the given IRI. Rows are sorted by <code>created_at DESC</code> at this stage.</p>
<p><strong>Middle query:</strong> runs <code>ROW_NUMBER() OVER (PARTITION BY num, iri ORDER BY num DESC, created_at DESC)</code>. Within each score bucket (<code>num</code>), revisions are numbered starting from 1 in descending age order. The most recent revision in each bucket gets <code>lines = 1</code>.</p>
<p><strong>Outer filter:</strong> keeps only the <code>lines = 1</code> rows, one revision per score bucket.</p>
<p><strong>DELETE:</strong> removes every revision for this IRI that isn&rsquo;t in the kept set.</p>
<p>The <code>PARTITION BY num, iri</code> is redundant on the IRI (the whole query is already filtered to one IRI), but makes the intent explicit and keeps the logic correct if the query is ever reused in a broader context.</p>
<p>The method is called from a companion query that identifies which IRIs have accumulated more than a threshold of revisions:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code class="language-php" data-lang="php"><span style="display:flex;"><span><span style="color:#66d9ef">public</span> <span style="color:#66d9ef">function</span> <span style="color:#a6e22e">getIrisWithMoreRevisionThan</span>(<span style="color:#a6e22e">int</span> $maxRevisionsCount, <span style="color:#a6e22e">int</span> $limit <span style="color:#f92672">=</span> <span style="color:#ae81ff">0</span>, <span style="color:#f92672">?</span><span style="color:#a6e22e">int</span> $retencyDay <span style="color:#f92672">=</span> <span style="color:#66d9ef">null</span>)<span style="color:#f92672">:</span> <span style="color:#66d9ef">array</span>
</span></span><span style="display:flex;"><span>{
</span></span><span style="display:flex;"><span>    $queryBuilder <span style="color:#f92672">=</span> $this
</span></span><span style="display:flex;"><span>        <span style="color:#f92672">-&gt;</span><span style="color:#a6e22e">createQueryBuilder</span>(<span style="color:#e6db74">&#39;revision&#39;</span>)
</span></span><span style="display:flex;"><span>        <span style="color:#f92672">-&gt;</span><span style="color:#a6e22e">select</span>(<span style="color:#e6db74">&#39;revision.iri&#39;</span>)
</span></span><span style="display:flex;"><span>        <span style="color:#f92672">-&gt;</span><span style="color:#a6e22e">groupBy</span>(<span style="color:#e6db74">&#39;revision.iri&#39;</span>)
</span></span><span style="display:flex;"><span>        <span style="color:#f92672">-&gt;</span><span style="color:#a6e22e">having</span>(<span style="color:#e6db74">&#39;COUNT(1) &gt; :maxRevisions&#39;</span>)
</span></span><span style="display:flex;"><span>        <span style="color:#f92672">-&gt;</span><span style="color:#a6e22e">orderBy</span>(<span style="color:#e6db74">&#39;COUNT(1)&#39;</span>, <span style="color:#a6e22e">Order</span><span style="color:#f92672">::</span><span style="color:#a6e22e">Descending</span><span style="color:#f92672">-&gt;</span><span style="color:#a6e22e">value</span>)
</span></span><span style="display:flex;"><span>        <span style="color:#f92672">-&gt;</span><span style="color:#a6e22e">setParameter</span>(<span style="color:#e6db74">&#39;maxRevisions&#39;</span>, $maxRevisionsCount);
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e">// ...
</span></span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">return</span> <span style="color:#a6e22e">array_column</span>($queryBuilder<span style="color:#f92672">-&gt;</span><span style="color:#a6e22e">getQuery</span>()<span style="color:#f92672">-&gt;</span><span style="color:#a6e22e">getResult</span>(), <span style="color:#e6db74">&#39;iri&#39;</span>);
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p>The two methods run together in a scheduled cleanup: find the IRIs over the threshold, prune each one.</p>
<h2 id="wiring-it-to-a-scheduled-command">Wiring it to a scheduled command</h2>
<p>The pruning query doesn&rsquo;t run in a request. It runs behind a Symfony command, called on a schedule.</p>
<p>The command takes a few options to control how aggressively it runs:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code class="language-php" data-lang="php"><span style="display:flex;"><span><span style="color:#75715e">#[AsCommand(&#39;app:purge:revision&#39;, &#39;Remove useless revisions&#39;)]
</span></span></span><span style="display:flex;"><span><span style="color:#66d9ef">final</span> <span style="color:#66d9ef">class</span> <span style="color:#a6e22e">PurgeRevisionCommand</span> <span style="color:#66d9ef">extends</span> <span style="color:#a6e22e">Command</span>
</span></span><span style="display:flex;"><span>{
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">protected</span> <span style="color:#66d9ef">function</span> <span style="color:#a6e22e">configure</span>()<span style="color:#f92672">:</span> <span style="color:#a6e22e">void</span>
</span></span><span style="display:flex;"><span>    {
</span></span><span style="display:flex;"><span>        $this
</span></span><span style="display:flex;"><span>            <span style="color:#f92672">-&gt;</span><span style="color:#a6e22e">addOption</span>(<span style="color:#e6db74">&#39;max-revisions&#39;</span>, <span style="color:#e6db74">&#39;m&#39;</span>, <span style="color:#a6e22e">InputOption</span><span style="color:#f92672">::</span><span style="color:#a6e22e">VALUE_REQUIRED</span>,
</span></span><span style="display:flex;"><span>                <span style="color:#e6db74">&#39;Revision threshold above which an IRI gets pruned&#39;</span>, <span style="color:#ae81ff">30</span>)
</span></span><span style="display:flex;"><span>            <span style="color:#f92672">-&gt;</span><span style="color:#a6e22e">addOption</span>(<span style="color:#e6db74">&#39;limit&#39;</span>, <span style="color:#e6db74">&#39;l&#39;</span>, <span style="color:#a6e22e">InputOption</span><span style="color:#f92672">::</span><span style="color:#a6e22e">VALUE_REQUIRED</span>,
</span></span><span style="display:flex;"><span>                <span style="color:#e6db74">&#39;Max number of IRIs to process per run&#39;</span>)
</span></span><span style="display:flex;"><span>            <span style="color:#f92672">-&gt;</span><span style="color:#a6e22e">addOption</span>(<span style="color:#e6db74">&#39;delay&#39;</span>, <span style="color:#e6db74">&#39;w&#39;</span>, <span style="color:#a6e22e">InputOption</span><span style="color:#f92672">::</span><span style="color:#a6e22e">VALUE_REQUIRED</span>,
</span></span><span style="display:flex;"><span>                <span style="color:#e6db74">&#39;Delay in seconds between each IRI&#39;</span>)
</span></span><span style="display:flex;"><span>            <span style="color:#f92672">-&gt;</span><span style="color:#a6e22e">addOption</span>(<span style="color:#e6db74">&#39;retencyDay&#39;</span>, <span style="color:#e6db74">&#39;r&#39;</span>, <span style="color:#a6e22e">InputOption</span><span style="color:#f92672">::</span><span style="color:#a6e22e">VALUE_OPTIONAL</span>,
</span></span><span style="display:flex;"><span>                <span style="color:#e6db74">&#39;Only process IRIs whose last revision is older than N days&#39;</span>);
</span></span><span style="display:flex;"><span>    }
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">protected</span> <span style="color:#66d9ef">function</span> <span style="color:#a6e22e">execute</span>(<span style="color:#a6e22e">InputInterface</span> $input, <span style="color:#a6e22e">OutputInterface</span> $output)<span style="color:#f92672">:</span> <span style="color:#a6e22e">int</span>
</span></span><span style="display:flex;"><span>    {
</span></span><span style="display:flex;"><span>        $iris <span style="color:#f92672">=</span> $this<span style="color:#f92672">-&gt;</span><span style="color:#a6e22e">revisionRepository</span><span style="color:#f92672">-&gt;</span><span style="color:#a6e22e">getIrisWithMoreRevisionThan</span>(
</span></span><span style="display:flex;"><span>            (<span style="color:#a6e22e">int</span>) $input<span style="color:#f92672">-&gt;</span><span style="color:#a6e22e">getOption</span>(<span style="color:#e6db74">&#39;max-revisions&#39;</span>),
</span></span><span style="display:flex;"><span>            (<span style="color:#a6e22e">int</span>) $input<span style="color:#f92672">-&gt;</span><span style="color:#a6e22e">getOption</span>(<span style="color:#e6db74">&#39;limit&#39;</span>),
</span></span><span style="display:flex;"><span>            (<span style="color:#a6e22e">int</span>) $input<span style="color:#f92672">-&gt;</span><span style="color:#a6e22e">getOption</span>(<span style="color:#e6db74">&#39;retencyDay&#39;</span>),
</span></span><span style="display:flex;"><span>        );
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">foreach</span> ($iris <span style="color:#66d9ef">as</span> $iri) {
</span></span><span style="display:flex;"><span>            $totalDeleted <span style="color:#f92672">+=</span> $this<span style="color:#f92672">-&gt;</span><span style="color:#a6e22e">revisionRepository</span><span style="color:#f92672">-&gt;</span><span style="color:#a6e22e">deleteOldRevisionForIri</span>($iri);
</span></span><span style="display:flex;"><span>            <span style="color:#a6e22e">usleep</span>((<span style="color:#a6e22e">int</span>) $input<span style="color:#f92672">-&gt;</span><span style="color:#a6e22e">getOption</span>(<span style="color:#e6db74">&#39;delay&#39;</span>) <span style="color:#f92672">*</span> <span style="color:#ae81ff">1_000_000</span>);
</span></span><span style="display:flex;"><span>        }
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">return</span> <span style="color:#a6e22e">Command</span><span style="color:#f92672">::</span><span style="color:#a6e22e">SUCCESS</span>;
</span></span><span style="display:flex;"><span>    }
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p>The <code>--delay</code> option is worth noting: on a busy database, hammering a hundred <code>DELETE</code> statements back-to-back can cause lock contention. A small sleep between iterations keeps the purge from competing with production traffic.</p>
<p>The command runs behind two crontab entries with different thresholds:</p>
<pre tabindex="0"><code># Hourly: keep 30 revisions per IRI, process 100 IRIs per run
0 * * * * php bin/console app:purge:revision --max-revisions 30 --limit 100

# Nightly: for content untouched for a year, keep only 3
0 0 * * * php bin/console app:purge:revision --max-revisions 3 --limit 100 --retencyDay 365
</code></pre><p>The two-level strategy matters. The hourly job keeps 30 revisions per IRI, which is a reasonable ceiling for actively-edited content. The nightly job targets only IRIs not updated in over a year and keeps just 3. An article that hasn&rsquo;t moved in twelve months doesn&rsquo;t need thirty versions in its history.</p>
<h2 id="what-it-looks-like-in-practice">What it looks like in practice</h2>
<p>An article saved 200 times will typically keep 20 to 30 revisions after pruning: most of the recent saves, a handful from last month, one or two from each quarter of the previous year. The exact count depends on the age distribution of the saves, not on an arbitrary cap.</p>
<p>An article last edited two years ago might end up with 5 or 6 revisions. Recent edits are all there; the old history is compressed but not gone.</p>
<p>It&rsquo;s not a perfect history. It&rsquo;s a useful one.</p>
<h2 id="the-line-between-dql-and-raw-sql">The line between DQL and raw SQL</h2>
<p>The window function attempt isn&rsquo;t a failure worth hiding. It&rsquo;s a useful data point: <code>FunctionNode</code> works well for scalar functions in WHERE and SELECT positions, but composing a full <code>ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...)</code> expression in DQL is harder than it looks. The grammar extension, the AST nodes, the SQL walker integration: it&rsquo;s a non-trivial amount of code for something that native SQL handles in three lines.</p>
<p>The practical boundary is roughly this: if a PostgreSQL feature maps to a function call with fixed arity, custom DQL works. If it requires new clause syntax (window frames, CTEs, lateral joins), native DBAL is usually the better trade-off.</p>
]]></content:encoded></item></channel></rss>