Pachyderm Client¶

class pachypy.client.PachydermClient(host=None, port=None, add_image_digests=True, build_images=True, pipeline_spec_files=None, pipeline_spec_transformer=None, pachd_timezone='utc', user_timezone=None)[source]¶

Bases: object

Pachyderm client.

Parameters

host (Optional[str]) – Hostname or IP address to reach pachd. Attempts to get this from the environment variable PACHD_ADDRESS or ~/.pachyderm/config.json if not set.
port (Optional[int]) – Port on which pachd is listening. Defaults to 30650.
add_image_digests (bool) – Whether to add a digest to the image field in pipeline specs to to force Pachyderm to pull the latest version from the container registry.
build_images (bool) – Whether to build Docker images for pipelines and push them to the container registry. Only applies to pipelines that have the dockerfile or dockerfile_path directive set within the transform field.
pipeline_spec_files (Union[str, Path, Iterable[Union[str, Path]], None]) – Glob pattern(s) to pipeline spec files in YAML or JSON format.
pipeline_spec_transformer (Optional[Callable[[dict], dict]]) – Function that takes a pipeline spec as dictionary as the only argument and returns a transformed pipeline spec.
pachd_timezone (Union[str, tzinfo]) – Timezone of the system that pachd is running on. This determines when cron inputs trigger. Defaults to UTC.
user_timezone (Union[str, tzinfo, None]) – Timezone of the user which is used to localize timestamps. Attempts to get this from the environment variable TZ or system settings if not specified.

property logger¶

Return type: Logger

property pipeline_spec_files¶

Return type: List[str]

property docker_client¶

Return type: DockerClient

property docker_registry¶

Return type: DockerRegistryAdapter

property amazon_ecr¶

Return type: AmazonECRAdapter

property user_timezone¶

Return type: tzinfo

property pachd_timezone¶

property pachd_version¶

Return type: str

list_repos(repos='*')[source]¶

Get list of repos as pandas DataFrame.

Parameters: repos (Union[str, Iterable[str], None]) – Name pattern to filter repos returned. Supports shell-style wildcards.
Return type: DataFrame

list_commits(repos, n=10)[source]¶

Get list of commits as pandas DataFrame.

Parameters

repos (Union[str, Iterable[str], None]) – Name pattern to filter repos to return commits for. Supports shell-style wildcards.
n (int) – Maximum number of commits returned per repo.

Return type

DataFrame

list_files(repos, glob='**', branch='master', commit=None, files_only=True)[source]¶

Get list of files as pandas DataFrame.

Parameters

repos (Union[str, Iterable[str], None]) – Name pattern to filter repos to return files for. Supports shell-style wildcards.
glob (str) – Glob pattern to filter files returned.
branch (Optional[str]) – Branch to list files for. Defaults to ‘master’.
commit (Optional[str]) – Commit ID to return files for. Overrides branch if specified. If specified, the repos parameter must only match the repo this commit ID belongs to.
files_only (bool) – Whether to return only files or include directories.

Return type

DataFrame

list_pipelines(pipelines='*')[source]¶

Get list of pipelines as pandas DataFrame.

Parameters: pipelines (Union[str, Iterable[str], None]) – Name pattern to filter pipelines returned. Supports shell-style wildcards.
Return type: DataFrame

list_jobs(pipelines='*', n=20, hide_null_jobs=True)[source]¶

Get list of jobs as pandas DataFrame.

Parameters

pipelines (Union[str, Iterable[str], None]) – Pattern to filter jobs by pipeline name. Supports shell-style wildcards.
n (int) – Maximum number of jobs returned.
hide_null_jobs (bool) – If true, empty jobs with no data to process will be filtered out.

Return type

DataFrame

list_datums(job)[source]¶

Get list of datums for a job as pandas DataFrame.

Parameters: job (str) – Job ID to return datums for.
Return type: DataFrame

get_logs(pipelines='*', datum=None, last_job_only=True, user_only=False, master=False, tail=0)[source]¶

Get logs for jobs.

Parameters

pipelines (Union[str, Iterable[str], None]) – Pattern to filter logs by pipeline name. Supports shell-style wildcards.
datum (Optional[str]) – If specified logs are filtered to a datum ID.
last_job_only (bool) – Whether to only show/return logs for the last job of each pipeline. Ignored if tail is specified.
user_only (bool) – Whether to only return logs generated by user code.
master (bool) – Whether to include logs from the master process.
tail (int) – Lines of recent logs to retrieve. This is applied before filtering according to user_only, so less lines may be returned.

Return type

DataFrame

inspect_repo(repo)[source]¶

Returns info about a repo.

Parameters: repo (str) – Name of repo to get info for.
Return type: Dict[str, Any]

inspect_pipeline(pipeline)[source]¶

Returns info about a pipeline.

Parameters: pipeline (str) – Name of pipeline to get info for.
Return type: Dict[str, Any]

inspect_job(job)[source]¶

Returns info about a job.

Parameters: job (str) – ID of job to get info for.
Return type: Dict[str, Any]

inspect_datum(job, datum)[source]¶

Returns info about a datum.

Only works if stats tracking is enabled for the pipeline (enable_stats) and raises a PachydermError otherwise.

Parameters

job (str) – ID of job the datum belongs to.
datum (str) – ID of datum to get info for.

Return type

Dict[str, Any]

create_repos(repos)[source]¶

Create one or multiple new repositories in pfs.

Parameters: repos (Union[str, Iterable[str]]) – Name of new repository or iterable of names.
Return type: List[str]
Returns: Created repos.

delete_repos(repos)[source]¶

Delete repositories.

Parameters: repos (Union[str, Iterable[str], None]) – Pattern to filter repos to delete. Supports shell-style wildcards.
Return type: List[str]
Returns: Deleted repos.

commit(repo, branch='master', parent_commit=None, flush=False)[source]¶

Returns a context manager for a new commit.

The context manager automatically starts and finishes the commit. If an exception occurs, the commit is not finished, but deleted.

Parameters

repo (str) – Name of repository.
branch (Optional[str]) – Branch in repository. When the commit is started on a branch, the previous head of the branch is used as the parent of the commit. You may pass None in which case the new commit will have no parent (unless parent_commit is specified) and will initially appear empty.
parent_commit (Optional[str]) – ID of parent commit. Upon creation the new commit will appear identical to the parent commit. Data can safely be added to the new commit without affecting the contents of the parent commit.
flush (bool) – If true, blocks until all jobs triggered by this commit have finished when the context is exited (only when leaving the with statement).

Return type

PachydermCommit

Returns

Commit object allowing operations inside the commit.

delete_commit(repo, commit)[source]¶

Deletes a commit.

Parameters

repo (str) – Name of repository.
commit (str) – ID of commit to delete.

Return type

None

create_branch(repo, commit, branch)[source]¶

Sets a commit as a branch.

Parameters

repo (str) – Name of repository.
commit (str) – ID of commit to set as branch.
branch (str) – Name of the branch.

Return type

None

delete_branch(repo, branch)[source]¶

Deletes a branch, but leaves the commits intact.

The commits can still be accessed via their commit IDs.

Parameters

repo (str) – Name of repository.
branch (str) – Name of branch to delete.

Return type

None

get_file(repo, path, branch='master', commit=None, destination='.')[source]¶

Retrieves a file from a repository in PFS and writes it to destination.

Parameters

repo (str) – Repository to retrieve file from.
path (str) – Path within repository in PFS to retrieve file from.
branch (Optional[str]) – Branch to retrieve file from.
commit (Optional[str]) – Commit to retrieve file from. Overrides branch if specified.
destination (Union[str, Path, Binaryio]) – Local path or binary file object to write file to. If it is a directory the file’s basename will be appended.

Return type

None

get_file_content(repo, path, branch='master', commit=None, encoding=None)[source]¶

Retrieves a file from a repository in PFS and returns its content.

Parameters

repo (str) – Repository to retrieve file from.
path (str) – Path within repository in PFS to retrieve file from.
branch (Optional[str]) – Branch to retrieve file from.
commit (Optional[str]) – Commit to retrieve file from. Overrides branch if specified.
encoding (Optional[str]) – If specified, the file content will be returned as a decoded string. May be set to ‘auto’ to try to infer the encoding automatically.

Return type

Union[bytes, str]

Returns

File contents as bytes, or as string if encoding is set.

get_files(repo, glob='**', branch='master', commit=None, path='/', destination='.', ignore_existing=False, verbose=False)[source]¶

Retrieves multiple files from a repository in PFS and writes them to a local directory.

Parameters

repo (str) – Repository to retrieve files from.
glob (str) – Glob pattern to filter files retrieved from path. Is interpreted relative to path.
branch (Optional[str]) – Branch to retrieve files from.
commit (Optional[str]) – Commit to retrieve files from. Overrides branch if specified.
path (str) – Path within repository in PFS to retrieve files from.
destination (Union[str, Path]) – Local path to write files to. Must be a directory. Will be created if it doesn’t exist.
ignore_existing (bool) – Whether to ignore or overwrite files that already exist locally.
verbose (bool) – Whether to log which files were downloaded.

Return type

None

create_pipelines(pipelines='*', pipeline_specs=None, recreate=False, build_options=None)[source]¶

Creates or recreates pipelines.

Parameters

pipelines (Union[str, Iterable[str], None]) – Pattern to filter pipeline specs by pipeline name. Supports shell-style wildcards.
pipeline_specs (Optional[List[dict]]) – Pipeline specifications. These are read from files (see property pipeline_spec_files) if not specified.
recreate (bool) – Whether to delete existing pipelines before recreating them.
build_options (Optional[dict]) – Keyword arguments to pass into the Docker build() method when building images. A useful example is dict(nocache=True).

Return type

PipelineChanges

Returns

Created and deleted pipeline names.

update_pipelines(pipelines='*', pipeline_specs=None, recreate=False, reprocess=False, build_options=None)[source]¶

Updates or recreates pipelines.

Non-existing pipelines will be created.

Parameters

pipelines (Union[str, Iterable[str], None]) – Pattern to filter pipeline specs by pipeline name. Supports shell-style wildcards.
pipeline_specs (Optional[List[dict]]) – Pipeline specifications. These are read from files (see property pipeline_spec_files) if not specified.
recreate (bool) – Whether to delete existing pipelines before recreating them.
reprocess (bool) – Whether to reprocess datums with updated pipeline.
build_options (Optional[dict]) – Keyword arguments to pass into the Docker build() method when building images. A useful example is dict(nocache=True).

Return type

PipelineChanges

Returns

Updated, created and deleted pipeline names.

delete_pipelines(pipelines)[source]¶

Deletes existing pipelines.

Parameters: pipelines (Union[str, Iterable[str], None]) – Pattern to filter pipelines by name. Supports shell-style wildcards.
Return type: List[str]
Returns: Names of deleted pipelines.

start_pipelines(pipelines)[source]¶

Restarts stopped pipelines.

Parameters: pipelines (Union[str, Iterable[str], None]) – Pattern to filter pipelines by name. Supports shell-style wildcards.
Return type: List[str]
Returns: Names of started pipelines.

stop_pipelines(pipelines)[source]¶

Stops running pipelines.

Parameters: pipelines (Union[str, Iterable[str], None]) – Pattern to filter pipelines by name. Supports shell-style wildcards.
Return type: List[str]
Returns: Names of stopped pipelines.

run_pipelines(pipelines)[source]¶

Rerun the latest jobs for pipelines.

Parameters: pipelines (Union[str, Iterable[str], None]) – Pattern to filter pipelines by name. Supports shell-style wildcards.
Return type: List[str]
Returns: Names of ran pipelines.

trigger_pipeline(pipeline, flush=False)[source]¶

Triggers a pipeline with a cron input by committing a timestamp file into its cron input repository.

Parameters

pipeline (str) – Name of pipeline to trigger.
flush (bool) – If true, blocks until all triggered jobs have finished.

Return type

None

delete_job(job)[source]¶

Deletes a job.

Parameters: job (str) – ID of job to delete.
Return type: None

read_pipeline_specs(pipelines='*')[source]¶

Read pipelines specifications from YAML or JSON files.

The spec files are defined through the pipeline_spec_files property, which can be a list of file paths or glob patterns.

File names are expected to be a prefix of pipeline names defined in them or to also match the given pipelines pattern.

Parameters: pipelines (Union[str, Iterable[str], None]) – Pattern to filter pipeline specs by pipeline name. Supports shell-style wildcards.
Return type: List[dict]

clear_cache()[source]¶

Return type: None

class pachypy.client.PachydermCommit(client, repo, branch='master', parent_commit=None, flush=False)[source]¶

Bases: pachypy.adapter.PachydermCommitAdapter

Represents a commit in Pachyderm.

Objects of this class are typically created via commit() and used as a context manager, which automatically starts and finishes a commit.

property commit¶

Commit ID.

Return type: Optional[str]

create_branch(*args, **kwargs)¶

delete(*args, **kwargs)¶

delete_file(*args, **kwargs)¶

finish(*args, **kwargs)¶

property finished¶

Whether the commit is finished.

Return type: bool

flush(*args, **kwargs)¶

put_file_bytes(*args, **kwargs)¶

put_file_url(*args, **kwargs)¶

start(*args, **kwargs)¶

put_file(file, path=None, delimiter='none', target_file_datums=0, target_file_bytes=0)[source]¶

Uploads a file or the content of a file-like to the given path in PFS.

Parameters

file (Union[str, Path, Io[AnyStr]]) – A local file path or a file-like object.
path (Optional[str]) – PFS path to upload file to. Defaults to the root directory. If path ends with a slash (/), the basename of file will be appended. If file is a file-like object, path needs to be the full PFS path including filename.
delimiter (str) – Causes data to be broken up into separate files with path as a prefix. Possible values are ‘none’ (default), ‘json’, ‘line’, ‘sql’ and ‘csv’.
target_file_datum – Specifies the target number of datums in each written file. It may be lower if data does not split evenly, but will never be higher, unless the value is 0 (default).
target_file_bytes (int) – Specifies the target number of bytes in each written file. Files may have more or fewer bytes than the target.

Return type

None

put_files(files, path='/', keep_structure=False, base_path='.')[source]¶

Uploads one or multiple files defined by files to the given path in PFS.

Parameters

files (Union[str, Path, Iterable[Union[str, Path]], None]) – Glob pattern(s) to files that should be uploaded.
path (str) – PFS path to upload files to. Must be a directory. Will be created if it doesn’t exist. Defaults to the root directory.
keep_structure (bool) – If true, the local directory structure is recreated in PFS. Use in conjunction with base_path.
base_path (Union[str, Path]) – If keep_structure is true, PFS paths will be constructed relative to path and using the local file structure relative to base_path. This defaults to the current working directory.

Return type

List[str]

Returns

Local paths of uploaded files.

put_timestamp_file(overwrite=False)[source]¶

Uploads a timestamp file to simulate a cron tick.

Parameters: overwrite (bool) – Whether to overwrite existing timestamp files (True) or to just add a new timestamp file (False). Only applies to pachd >=1.8.6.
Return type: None

delete_files(glob)[source]¶

Deletes all files matching glob in the current commit.

Parameters: glob (str) – Glob pattern to filter files to delete.
Return type: List[str]
Returns: PFS paths of deleted files.

Container Registry Adapters¶

class pachypy.registry.DockerRegistryAdapter(docker_client)[source]¶

Bases: object

Docker registry adapter.

get_image_digest(image)[source]¶

Return type: str

push_image(image)[source]¶

Return type: str

class pachypy.registry.AmazonECRAdapter(docker_client)[source]¶

Bases: pachypy.registry.DockerRegistryAdapter

Amazon Elastic Container Registry (ECR) adapter using boto3.

Getting image digests via boto3 is faster than using the DockerRegistryAdapter, which otherwise works fine for ECR too. When pushing images, this class also handles logging in to ECR by getting an authorization token using boto3 and passing it to Docker to log in.

Parameters

aws_access_key_id – AWS access key ID.
aws_secret_access_key – AWS secret access key.

property ecr_client¶

set_credentials(aws_access_key_id, aws_secret_access_key)[source]¶

login(registry)[source]¶

Return type: None

get_image_digest(image)[source]¶

Return type: str

push_image(image)[source]¶

Return type: str