Pachyderm Client¶
-
class
pachypy.client.PachydermClient(host=None, port=None, add_image_digests=True, build_images=True, pipeline_spec_files=None, pipeline_spec_transformer=None, pachd_timezone='utc', user_timezone=None)[source]¶ Bases:
objectPachyderm client.
- Parameters
host (
Optional[str]) – Hostname or IP address to reach pachd. Attempts to get this from the environment variable PACHD_ADDRESS or~/.pachyderm/config.jsonif not set.port (
Optional[int]) – Port on which pachd is listening. Defaults to 30650.add_image_digests (
bool) – Whether to add a digest to the image field in pipeline specs to to force Pachyderm to pull the latest version from the container registry.build_images (
bool) – Whether to build Docker images for pipelines and push them to the container registry. Only applies to pipelines that have the dockerfile or dockerfile_path directive set within the transform field.pipeline_spec_files (
Union[str,Path,Iterable[Union[str,Path]],None]) – Glob pattern(s) to pipeline spec files in YAML or JSON format.pipeline_spec_transformer (
Optional[Callable[[dict],dict]]) – Function that takes a pipeline spec as dictionary as the only argument and returns a transformed pipeline spec.pachd_timezone (
Union[str,tzinfo]) – Timezone of the system that pachd is running on. This determines when cron inputs trigger. Defaults to UTC.user_timezone (
Union[str,tzinfo,None]) – Timezone of the user which is used to localize timestamps. Attempts to get this from the environment variable TZ or system settings if not specified.
-
property
docker_client¶ - Return type
DockerClient
-
property
docker_registry¶ - Return type
-
property
amazon_ecr¶ - Return type
-
property
pachd_timezone¶
-
list_files(repos, glob='**', branch='master', commit=None, files_only=True)[source]¶ Get list of files as pandas DataFrame.
- Parameters
repos (
Union[str,Iterable[str],None]) – Name pattern to filter repos to return files for. Supports shell-style wildcards.glob (
str) – Glob pattern to filter files returned.branch (
Optional[str]) – Branch to list files for. Defaults to ‘master’.commit (
Optional[str]) – Commit ID to return files for. Overrides branch if specified. If specified, the repos parameter must only match the repo this commit ID belongs to.files_only (
bool) – Whether to return only files or include directories.
- Return type
DataFrame
-
list_datums(job)[source]¶ Get list of datums for a job as pandas DataFrame.
- Parameters
job (
str) – Job ID to return datums for.- Return type
DataFrame
-
get_logs(pipelines='*', datum=None, last_job_only=True, user_only=False, master=False, tail=0)[source]¶ Get logs for jobs.
- Parameters
pipelines (
Union[str,Iterable[str],None]) – Pattern to filter logs by pipeline name. Supports shell-style wildcards.datum (
Optional[str]) – If specified logs are filtered to a datum ID.last_job_only (
bool) – Whether to only show/return logs for the last job of each pipeline. Ignored if tail is specified.user_only (
bool) – Whether to only return logs generated by user code.master (
bool) – Whether to include logs from the master process.tail (
int) – Lines of recent logs to retrieve. This is applied before filtering according to user_only, so less lines may be returned.
- Return type
DataFrame
-
inspect_datum(job, datum)[source]¶ Returns info about a datum.
Only works if stats tracking is enabled for the pipeline (enable_stats) and raises a PachydermError otherwise.
-
commit(repo, branch='master', parent_commit=None, flush=False)[source]¶ Returns a context manager for a new commit.
The context manager automatically starts and finishes the commit. If an exception occurs, the commit is not finished, but deleted.
- Parameters
repo (
str) – Name of repository.branch (
Optional[str]) – Branch in repository. When the commit is started on a branch, the previous head of the branch is used as the parent of the commit. You may pass None in which case the new commit will have no parent (unless parent_commit is specified) and will initially appear empty.parent_commit (
Optional[str]) – ID of parent commit. Upon creation the new commit will appear identical to the parent commit. Data can safely be added to the new commit without affecting the contents of the parent commit.flush (
bool) – If true, blocks until all jobs triggered by this commit have finished when the context is exited (only when leaving the with statement).
- Return type
- Returns
Commit object allowing operations inside the commit.
-
delete_branch(repo, branch)[source]¶ Deletes a branch, but leaves the commits intact.
The commits can still be accessed via their commit IDs.
-
get_file(repo, path, branch='master', commit=None, destination='.')[source]¶ Retrieves a file from a repository in PFS and writes it to destination.
- Parameters
repo (
str) – Repository to retrieve file from.path (
str) – Path within repository in PFS to retrieve file from.commit (
Optional[str]) – Commit to retrieve file from. Overrides branch if specified.destination (
Union[str,Path,Binaryio]) – Local path or binary file object to write file to. If it is a directory the file’s basename will be appended.
- Return type
None
-
get_file_content(repo, path, branch='master', commit=None, encoding=None)[source]¶ Retrieves a file from a repository in PFS and returns its content.
- Parameters
repo (
str) – Repository to retrieve file from.path (
str) – Path within repository in PFS to retrieve file from.commit (
Optional[str]) – Commit to retrieve file from. Overrides branch if specified.encoding (
Optional[str]) – If specified, the file content will be returned as a decoded string. May be set to ‘auto’ to try to infer the encoding automatically.
- Return type
- Returns
File contents as bytes, or as string if encoding is set.
-
get_files(repo, glob='**', branch='master', commit=None, path='/', destination='.', ignore_existing=False, verbose=False)[source]¶ Retrieves multiple files from a repository in PFS and writes them to a local directory.
- Parameters
repo (
str) – Repository to retrieve files from.glob (
str) – Glob pattern to filter files retrieved from path. Is interpreted relative to path.commit (
Optional[str]) – Commit to retrieve files from. Overrides branch if specified.path (
str) – Path within repository in PFS to retrieve files from.destination (
Union[str,Path]) – Local path to write files to. Must be a directory. Will be created if it doesn’t exist.ignore_existing (
bool) – Whether to ignore or overwrite files that already exist locally.verbose (
bool) – Whether to log which files were downloaded.
- Return type
None
-
create_pipelines(pipelines='*', pipeline_specs=None, recreate=False, build_options=None)[source]¶ Creates or recreates pipelines.
- Parameters
pipelines (
Union[str,Iterable[str],None]) – Pattern to filter pipeline specs by pipeline name. Supports shell-style wildcards.pipeline_specs (
Optional[List[dict]]) – Pipeline specifications. These are read from files (see property pipeline_spec_files) if not specified.recreate (
bool) – Whether to delete existing pipelines before recreating them.build_options (
Optional[dict]) – Keyword arguments to pass into the Docker build() method when building images. A useful example isdict(nocache=True).
- Return type
PipelineChanges- Returns
Created and deleted pipeline names.
-
update_pipelines(pipelines='*', pipeline_specs=None, recreate=False, reprocess=False, build_options=None)[source]¶ Updates or recreates pipelines.
Non-existing pipelines will be created.
- Parameters
pipelines (
Union[str,Iterable[str],None]) – Pattern to filter pipeline specs by pipeline name. Supports shell-style wildcards.pipeline_specs (
Optional[List[dict]]) – Pipeline specifications. These are read from files (see property pipeline_spec_files) if not specified.recreate (
bool) – Whether to delete existing pipelines before recreating them.reprocess (
bool) – Whether to reprocess datums with updated pipeline.build_options (
Optional[dict]) – Keyword arguments to pass into the Docker build() method when building images. A useful example isdict(nocache=True).
- Return type
PipelineChanges- Returns
Updated, created and deleted pipeline names.
-
trigger_pipeline(pipeline, flush=False)[source]¶ Triggers a pipeline with a cron input by committing a timestamp file into its cron input repository.
-
delete_job(job)[source]¶ Deletes a job.
- Parameters
job (
str) – ID of job to delete.- Return type
None
-
read_pipeline_specs(pipelines='*')[source]¶ Read pipelines specifications from YAML or JSON files.
The spec files are defined through the pipeline_spec_files property, which can be a list of file paths or glob patterns.
File names are expected to be a prefix of pipeline names defined in them or to also match the given pipelines pattern.
-
class
pachypy.client.PachydermCommit(client, repo, branch='master', parent_commit=None, flush=False)[source]¶ Bases:
pachypy.adapter.PachydermCommitAdapterRepresents a commit in Pachyderm.
Objects of this class are typically created via
commit()and used as a context manager, which automatically starts and finishes a commit.-
create_branch(*args, **kwargs)¶
-
delete(*args, **kwargs)¶
-
delete_file(*args, **kwargs)¶
-
finish(*args, **kwargs)¶
-
flush(*args, **kwargs)¶
-
put_file_bytes(*args, **kwargs)¶
-
put_file_url(*args, **kwargs)¶
-
start(*args, **kwargs)¶
-
put_file(file, path=None, delimiter='none', target_file_datums=0, target_file_bytes=0)[source]¶ Uploads a file or the content of a file-like to the given path in PFS.
- Parameters
file (
Union[str,Path,Io[AnyStr]]) – A local file path or a file-like object.path (
Optional[str]) – PFS path to upload file to. Defaults to the root directory. If path ends with a slash (/), the basename of file will be appended. If file is a file-like object, path needs to be the full PFS path including filename.delimiter (
str) – Causes data to be broken up into separate files with path as a prefix. Possible values are ‘none’ (default), ‘json’, ‘line’, ‘sql’ and ‘csv’.target_file_datum – Specifies the target number of datums in each written file. It may be lower if data does not split evenly, but will never be higher, unless the value is 0 (default).
target_file_bytes (
int) – Specifies the target number of bytes in each written file. Files may have more or fewer bytes than the target.
- Return type
None
-
put_files(files, path='/', keep_structure=False, base_path='.')[source]¶ Uploads one or multiple files defined by files to the given path in PFS.
- Parameters
files (
Union[str,Path,Iterable[Union[str,Path]],None]) – Glob pattern(s) to files that should be uploaded.path (
str) – PFS path to upload files to. Must be a directory. Will be created if it doesn’t exist. Defaults to the root directory.keep_structure (
bool) – If true, the local directory structure is recreated in PFS. Use in conjunction with base_path.base_path (
Union[str,Path]) – If keep_structure is true, PFS paths will be constructed relative to path and using the local file structure relative to base_path. This defaults to the current working directory.
- Return type
- Returns
Local paths of uploaded files.
-
Container Registry Adapters¶
-
class
pachypy.registry.DockerRegistryAdapter(docker_client)[source]¶ Bases:
objectDocker registry adapter.
-
class
pachypy.registry.AmazonECRAdapter(docker_client)[source]¶ Bases:
pachypy.registry.DockerRegistryAdapterAmazon Elastic Container Registry (ECR) adapter using boto3.
Getting image digests via boto3 is faster than using the DockerRegistryAdapter, which otherwise works fine for ECR too. When pushing images, this class also handles logging in to ECR by getting an authorization token using boto3 and passing it to Docker to log in.
- Parameters
aws_access_key_id – AWS access key ID.
aws_secret_access_key – AWS secret access key.
-
property
ecr_client¶