Auto-indexing inference configuration reference
This document details how a site administrator can supply a Lua script to customize the way Sourcegraph detects precise code intelligence indexing jobs from repository contents.
By default, Sourcegraph will attempt to infer index jobs for the following languages:
Go
Java
/Scala
/Kotlin
Python
Ruby
Rust
TypeScript
/JavaScript
Inference logic can be disabled or altered in the case when the target repositories do not conform to a pattern that the Sourcegraph default inference logic recognizes. Inference logic is controlled by a Lua override script that can be supplied in the UI under Admin > Code graph > Inference
.
Example
The Lua override script ultimately must return an auto-indexing config object. A configuration that neither disables or adds new recognizers does not change the default inference behavior.
LUAreturn require("sg.autoindex.config").new({ -- Empty configuration (see below for usage) })
To disable default behaviors, you can re-assign a recognizer value to false
. Each of the built-in recognizers are prefixed with sg.
(and are the only ones allowed to be).
LUAreturn require("sg.autoindex.config").new({ -- Disable default Python inference ["sg.python"] = false })
To add additional behaviors, you can create and register a new recognizer. A recognizer is an interface that requests some set of files from a repository, and returns a set of auto-indexing job configurations that could produce a precise code intelligence index.
A path recognizer is a concrete recognizer that advertises a set of path globs it is interested in, then invokes its generate
function with matching paths from a repository. In the following, all files matching Snek.module
(Snek.module
, proj/Snek.module
, proj/sub/Snek.module
, etc) are passed to a call to generate
(if non-empty). The generate function will then return a list of indexing job descriptions. The guide for auto-indexing jobs configuration gives detailed descriptions on the fields of this object.
The ordering of paths and limits are defined in the Ordering guarantees and limits section.
LUAlocal path = require("path") local pattern = require("sg.autoindex.patterns") local recognizer = require("sg.autoindex.recognizer") local snek_recognizer = recognizer.new_path_recognizer { patterns = { -- Look for Snek.module files -- (would match Snek.module; proj/Snek.module, proj/sub/Snek.module, etc) pattern.new_path_basename("Snek.module"), -- Ignore any files in test or vendor directories pattern.new_path_exclude( pattern.new_path_segment("test"), pattern.new_path_segment("vendor") ), }, -- Called with list of matching Snek.module files generate = function(_, paths) local jobs = {} for i = 1, #paths do -- Create indexing job description for each matching file table.insert(jobs, { indexer = "acme/snek:latest", -- Run this indexer... root = path.dirname(paths[i]), -- ...in this directory local_steps = {"snekpm install"}, -- Install dependencies indexer_args = {"snek", "index", ".", "--output", "index.scip"}, outfile = "index.scip", }) end return jobs end } return require("sg.autoindex.config").new({ -- Register new recognizer ["acme.snek"] = snek_recognizer, })
Available libraries
There are a number of specific and general-purpose Lua libraries made accessible via the built-in require
.
The type signatures for the functions below use the following syntax:
(A1, ..., An) -> R
: Function type with arguments of typeA1, ..., An
and return typeR
.array[A]
: Table with indexes 1 to N of elements of typeA
.table[K, V]
: Table with keys of typeK
and values of typeV
.A | B
: Union type (includes values of typeA
and typeB
).A...
: Variadic (0 or more values of A, without being wrapped in a table)."mystring"
: Literal string type with only"mystring"
as the allowed value.{K1: V1, K2: V2, ...}
: Heterogenous table (object) with a key of typeK1
mapping to a value of typeV1
etc.void
: no value returned from function
sg.autoindex.recognizer
This auto-indexing-specific library defines the following two functions.
-
new_path_recognizer
creates aRecognizer
from a config object containingpatterns
andgenerate
fields. See the example above for basic usage.- Type:
whereSHELL
({ -- List of patterns to match against paths in the repository "patterns": array[pattern], -- List of patterns to match against paths in the repository -- for getting contents (see contents_by_path below) "patterns_for_content": array[pattern], -- Callback function invoked with paths requested by patterns above -- for creating index jobs "generate": ( registration_api, -- List of paths obtained from 'patterns' and -- 'patterns_for_content' combined. paths: array[string], -- Table mapping paths to contents for paths matched by -- 'patterns_for_content' contents_by_path: table[string, string] ) -> array[index_job], }) -> recognizer
index_job
is an object with the following shape:For installing dependencies, if the indexer image contains the relevant package manager(s), then it is simpler to install dependencies usingSHELLindex_job = { -- Docker image for the indexer "indexer": string, -- Working directory for invoking the indexer "root": string, -- Preparatory steps to run before invoking the indexer -- such as installing dependencies "steps": array[{ -- Working directory for this step "root": string, -- Docker image to use for this step "image": string, -- List of commands to run inside the Docker image "commands": array[string] }], -- List of commands to run inside the indexer image at "root" -- before invoking the indexer, such as installing dependencies. "local_steps": array[string], -- Command-line invocation for the indexer "indexer_args": array[string], -- Path to the index generated by the indexer "outfile": string, -- Names of necessary environment variables. These are -- made accessible to steps, local_steps, and the -- indexer_args command. -- -- These are generally used for passing secrets. "requested_envvars": array[string], }
local_steps
. Otherwise, thesteps
field allows more customizability.
- Type:
-
new_fallback_recognizer
creates arecognizer
from an ordered list ofrecognizer
s. Eachrecognizer
is called sequentially, until one of them emits non-empty results.- Type:
(array[recognizer]) -> recognizer
- Type:
The registration_api
object has the following API:
register
which queues arecognizer
to be run at a later stage. This makes it possible to add more recognizers dynamically, such as based on whether specific configuration files were found or not.- Type:
(recognizer) -> void
- Type:
sg.autoindex.patterns
This auto-indexing-specific library defines the following four path pattern constructors.
new_path_literal(fullpath)
creates apattern
that matches an exact filepath.- Type:
(string) -> pattern
- Type:
new_path_segment(segment)
creates apattern
that matches a directory name.- Type:
(string) -> pattern
- Type:
new_path_basename(basename)
creates apattern
that matches a basename exactly.- Type:
(string) -> pattern
- Type:
new_path_extension(ext_no_leading_dot)
creates apattern
that matches files with a given extension.- Type:
(string) -> pattern
- Type:
This library also defines the following two pattern collection constructors.
new_path_combine(patterns)
creates a pattern collection object (to be used with recognizers) from the given set of pathpattern
s.- Type:
((pattern | array[pattern])...) -> pattern
- Type:
new_path_exclude(patterns)
creates a new inverted pattern collection object. Paths matching thesepattern
s are filtered out from the set of matching filepaths given to a recognizer'sgenerate
function.- Type:
((pattern | array[pattern])...) -> pattern
- Type:
path
This library defines the following utility functions:
ancestors(path)
returns a list{dirname(path), dirname(dirname(path)), ...}
. The last element in the list will be an empty string.- Type:
(string) -> array[string]
- Type:
basename(path)
returns the basename of the given path as defined by Go's filepath.Base.- Type:
(string) -> string
- Type:
dirname(path)
returns the dirname of the given path as defined by Go's filepath.Dir, except that it (1) returns an empty path instead of"."
if the path is empty and (2) removes a leading/
if present.- Type:
string -> string
- Type:
join(path1, path2)
returns a filepath created by joining the given path segments via filepath separator.- Type:
(string, string) -> string
- Type:
split(path)
is a convenience function that returnsdirname(path), basename(path)
.- Type:
(string) -> string, string
- Type:
json
This library defines the following two JSON utility functions:
encode(val)
returns a JSON-ified version of the given Lua object.decode(json)
returns a Lua table representation of the given JSON text.
fun
Lua Functional is a high-performance functional programming library accessible via local fun = require("fun")
. This library has a number of functional utilities to help make recognizer code a bit more expressive.
Ordering guarantees and limits
Sourcegraph enforces several limits to avoid inference timeouts and ever-growing auto-indexing queues. These limits apply for a single round of inference for a single repository, combined across all recognizers, including any implicitly included Sourcegraph recognizers.
Limit | Default value |
---|---|
The number of auto-indexing jobs inferred | 100 |
The number of total paths passed to the inference script's generate functions as the second argument paths | 500 |
The number of total paths with contents passed to the inference script's generate functions as the third argument contents_by_paths | 100 |
Maximum size limit for file contents, in bytes | 1 MiB |
Auto-indexing jobs and paths are first ranked based on the criteria described below. If the number of jobs and/or paths exceeds the limits above, lower ranked items are discarded.
-
For auto-indexing jobs, ranking is done based on the following:
- Descending order of indexer frequency (total number of inferred jobs with the same
indexer
field). - Ascending lexicographic ordering of
indexer
. - Descending order of number of path components for
root
. Shallower roots are preferrred over deeper ones as they are more likely to cover more code. - Ascending lexicographic ordering of
root
paths.
- Descending order of indexer frequency (total number of inferred jobs with the same
-
For paths, ranking happens in the following order:
- Paths for which the contents are requested are ranked higher.
- Paths with fewer components are ranked higher.
- Otherwise, lexicographic ordering of paths is used.