[PATCH 0/9] Add python-spacy-curated-transformers

  • Open
  • quality assurance status badge
Details
One participant
  • Nicolas Graves
Owner
unassigned
Submitted by
Nicolas Graves
Severity
normal
Blocked by
N
N
Nicolas Graves wrote on 15 Sep 10:11 +0200
(address . guix-patches@gnu.org)(address . ngraves@ngraves.fr)
20240915081449.5244-1-ngraves@ngraves.fr
This patch series builds upon 73094, 73106, 73109, 73115 to add all
necessary packages to be able to create packages from spacy models,
which I will probably propose in a dedicated channel.

Nicolas Graves (9):
gnu: Add python-azure-storage-file-datalake.
gnu: Add python-cloudpathlib.
gnu: Add python-weasel.
gnu: python-thinc: Update to 8.2.2.
gnu: python-spacy: Update to 3.7.5.
gnu: Add python-cutlery.
gnu: Add python-curated-transformers.
gnu: Add python-curated-tokenizers.
gnu: Add python-spacy-curated-transformers.

gnu/packages/machine-learning.scm | 186 ++++++++++++++++++++++++++++--
gnu/packages/python-web.scm | 46 ++++++++
gnu/packages/python-xyz.scm | 47 ++++++++
3 files changed, 267 insertions(+), 12 deletions(-)

--
2.46.0
N
N
Nicolas Graves wrote on 15 Sep 10:57 +0200
[PATCH 1/9] gnu: Add python-azure-storage-file-datalake.
(address . 73266@debbugs.gnu.org)(address . ngraves@ngraves.fr)
20240915085720.13323-1-ngraves@ngraves.fr
* gnu/packages/python-web.scm (python-azure-storage-file-datalake): New variable.

Change-Id: Iba59fc31822b95361558f1ef62a2a40be0865080
---
gnu/packages/python-web.scm | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)

Toggle diff (35 lines)
diff --git a/gnu/packages/python-web.scm b/gnu/packages/python-web.scm
index 8b29f1cd93..a8510dbcc1 100644
--- a/gnu/packages/python-web.scm
+++ b/gnu/packages/python-web.scm
@@ -8072,6 +8072,28 @@ (define-public python-azure-core
Python.")
(license license:expat)))
+(define-public python-azure-storage-file-datalake
+ (package
+ (name "python-azure-storage-file-datalake")
+ (version "12.16.0")
+ (source
+ (origin
+ (method url-fetch)
+ (uri (pypi-uri "azure-storage-file-datalake" version))
+ (sha256
+ (base32 "0cq302vpnhb2fwlm6zxhpypqs97gn93cp27vhkpn50a39q75i19i"))))
+ (build-system pyproject-build-system)
+ (propagated-inputs (list python-azure-core
+ python-azure-storage-blob
+ python-isodate
+ python-typing-extensions))
+ (home-page "https://github.com/Azure/azure-sdk-for-python")
+ (synopsis "Microsoft Azure File DataLake Storage Client Library for Python")
+ (description
+ "This package provides the Microsoft Azure File @code{DataLake} Storage
+Client Library for Python.")
+ (license license:expat)))
+
(define-public python-azure-storage-blob
(package
(name "python-azure-storage-blob")
--
2.46.0
N
N
Nicolas Graves wrote on 15 Sep 10:57 +0200
[PATCH 3/9] gnu: Add python-weasel.
(address . 73266@debbugs.gnu.org)(address . ngraves@ngraves.fr)
20240915085720.13323-3-ngraves@ngraves.fr
* gnu/packages/python-xyz.scm (python-weasel): New variable.

Change-Id: If9be639d55503ec5dd1c0867708ea63746a5887b
---
gnu/packages/python-xyz.scm | 47 +++++++++++++++++++++++++++++++++++++
1 file changed, 47 insertions(+)

Toggle diff (60 lines)
diff --git a/gnu/packages/python-xyz.scm b/gnu/packages/python-xyz.scm
index 42a678ceba..38c877d481 100644
--- a/gnu/packages/python-xyz.scm
+++ b/gnu/packages/python-xyz.scm
@@ -30718,6 +30718,53 @@ (define-public python-watchgod
operating systems and an elegant approach to concurrency using threading.")
(license license:expat)))
+(define-public python-weasel
+ (package
+ (name "python-weasel")
+ (version "0.4.1")
+ (source
+ (origin
+ (method url-fetch)
+ (uri (pypi-uri "weasel" version))
+ (sha256
+ (base32 "1aas113r29y6yxrmdlsw80rj8w4kgw1jhfjw9rsgc4rf0w7j3g5a"))))
+ (build-system pyproject-build-system)
+ (arguments ; These tests require network.
+ (list #:test-flags
+ `(list "-k" ,(string-append
+ "not test_project_git_dir_asset"
+ " and not test_project_git_file_asset"
+ " and not test_project_assets"
+ " and not test_project_clone"
+ " and not test_remote"))))
+ (propagated-inputs (list python-cloudpathlib
+ python-confection
+ python-packaging
+ python-pydantic
+ python-requests
+ python-smart-open
+ python-srsly
+ python-typer
+ python-wasabi))
+ (native-inputs (list git-minimal
+ python-pytest))
+ (home-page "https://github.com/explosion/weasel")
+ (synopsis "Small workflow system in Python")
+ (description "This package provides management and sharing end-to-end
+workflows for different use cases and domains, and orchestrate training,
+packaging and serving custom pipelines. It provides the following
+functionality:
+@itemize
+@item clone a pre-defined project template,
+@item adjust it to fit particular needs,
+@item load in data,
+@item train a pipeline,
+@item export it as a Python package,
+@item upload outputs to a remote storage,
+@item share results with a team.
+@end itemize")
+ (license license:expat)))
+
(define-public python-wget
(package
(name "python-wget")
--
2.46.0
N
N
Nicolas Graves wrote on 15 Sep 10:57 +0200
[PATCH 2/9] gnu: Add python-cloudpathlib.
(address . 73266@debbugs.gnu.org)(address . ngraves@ngraves.fr)
20240915085720.13323-2-ngraves@ngraves.fr
* gnu/packages/python-web.scm (python-cloudpathlib): New variable.

Change-Id: I492abd6bea422faee1b5054edcf8f9e46c286fcf
---
gnu/packages/python-web.scm | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)

Toggle diff (37 lines)
diff --git a/gnu/packages/python-web.scm b/gnu/packages/python-web.scm
index a8510dbcc1..0d0e76e0bd 100644
--- a/gnu/packages/python-web.scm
+++ b/gnu/packages/python-web.scm
@@ -7569,6 +7569,30 @@ (define-public python-cloud-init
;; Either license can be chosen
(license (list license:asl2.0 license:gpl3))))
+(define-public python-cloudpathlib
+ (package
+ (name "python-cloudpathlib")
+ (version "0.19.0")
+ (source
+ (origin
+ (method url-fetch)
+ (uri (pypi-uri "cloudpathlib" version))
+ (sha256
+ (base32 "1s2gcv89ybpsvyh6d0rpdpnb177q7la0n8fs4cj5v4sfkbyxp7li"))))
+ (build-system pyproject-build-system)
+ (arguments (list #:tests? #f)) ; No tests bundled.
+ (propagated-inputs (list python-azure-storage-blob
+ python-azure-storage-file-datalake
+ python-boto3
+ python-google-cloud-storage
+ python-typing-extensions))
+ (native-inputs (list python-flit-core))
+ (home-page "https://cloudpathlib.drivendata.org/stable")
+ (synopsis "Python classes for cloud storage services")
+ (description "This package provides provides @code{pathlib.Path}-like
+classes for different cloud storage services.")
+ (license license:expat)))
+
(define-public python-cloudscraper
(package
(name "python-cloudscraper")
--
2.46.0
N
N
Nicolas Graves wrote on 15 Sep 10:57 +0200
[PATCH 5/9] gnu: python-spacy: Update to 3.7.5.
(address . 73266@debbugs.gnu.org)(address . ngraves@ngraves.fr)
20240915085720.13323-5-ngraves@ngraves.fr
* gnu/packages/machine-learning.scm (python-spacy): Update to 3.7.5.
[arguments]<#:test-flags>: Ignore test_pass_doc_to_pipeline.
[propagated-inputs]: Remove python-pathy, python-smart-open,
python-typing-extensions. Add python-weasel.

Change-Id: Ieae58f004d06323990e80859e3c4dce2166c447c
---
gnu/packages/machine-learning.scm | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)

Toggle diff (56 lines)
diff --git a/gnu/packages/machine-learning.scm b/gnu/packages/machine-learning.scm
index 4b834a847f..008bf2060a 100644
--- a/gnu/packages/machine-learning.scm
+++ b/gnu/packages/machine-learning.scm
@@ -1336,13 +1336,13 @@ (define-public python-spacy-loggers
(define-public python-spacy
(package
(name "python-spacy")
- (version "3.5.3")
+ (version "3.7.5")
(source (origin
(method url-fetch)
(uri (pypi-uri "spacy" version))
(sha256
(base32
- "13141hc966d8nxbnlwj01vhndgq0rq4nmii3qkb3hrap45kiv5rm"))))
+ "1lrd7k7hizygqldpln4aham6kprbyaj7z7pfd5dabixcyb5wcj56"))))
(build-system pyproject-build-system)
(arguments
(list
@@ -1356,7 +1356,9 @@ (define-public python-spacy
;; This tries to run the application with typer, which fails
;; with an unspecified error, possibly because the build
;; container doesn't have /bin/sh.
- " and not test_project_assets"))
+ " and not test_project_assets"
+ ;; Fails with DeprecationWarning
+ " and not test_pass_doc_to_pipeline"))
#:phases
'(modify-phases %standard-phases
(add-after 'build 'build-ext
@@ -1370,20 +1372,18 @@ (define-public python-spacy
python-murmurhash
python-numpy
python-packaging
- python-pathy
python-preshed
python-pydantic
python-requests
python-setuptools
- python-smart-open
python-spacy-legacy
python-spacy-loggers
python-srsly
python-thinc
python-tqdm
python-typer
- python-typing-extensions
- python-wasabi))
+ python-wasabi
+ python-weasel))
(native-inputs
(list python-cython python-pytest python-mock))
(home-page "https://spacy.io")
--
2.46.0
N
N
Nicolas Graves wrote on 15 Sep 10:57 +0200
[PATCH 4/9] gnu: python-thinc: Update to 8.2.2.
(address . 73266@debbugs.gnu.org)(address . ngraves@ngraves.fr)
20240915085720.13323-4-ngraves@ngraves.fr
* gnu/packages/machine-learning.scm (python-thinc): Update to 8.2.2.
[propagated-inputs]: Remove python-contextvars, python-dataclasses,
python-typing-extensions.

Change-Id: Ic9176f75b7a7fe075a72c7b6607bc1a1b39294f4
---
gnu/packages/machine-learning.scm | 7 ++-----
1 file changed, 2 insertions(+), 5 deletions(-)

Toggle diff (39 lines)
diff --git a/gnu/packages/machine-learning.scm b/gnu/packages/machine-learning.scm
index 107954577b..4b834a847f 100644
--- a/gnu/packages/machine-learning.scm
+++ b/gnu/packages/machine-learning.scm
@@ -2093,13 +2093,13 @@ (define-public python-mord
(define-public python-thinc
(package
(name "python-thinc")
- (version "8.1.10")
+ (version "8.2.2")
(source (origin
(method url-fetch)
(uri (pypi-uri "thinc" version))
(sha256
(base32
- "14drmwa2sh8fqszv1fm2jl4lky1j5yrbkjv89bl49q07vbblhjkc"))))
+ "11qmdpw4r7qm1l5p8iws8jf33az1hac7zxki38j9a3rccx2bk1bf"))))
(build-system pyproject-build-system)
(arguments
'(#:phases
@@ -2111,16 +2111,13 @@ (define-public python-thinc
(propagated-inputs (list python-blis-for-thinc
python-catalogue
python-confection
- python-contextvars
python-cymem
- python-dataclasses
python-murmurhash
python-numpy
python-packaging
python-preshed
python-pydantic
python-srsly
- python-typing-extensions
python-wasabi))
(native-inputs (list python-cython python-mock python-pytest))
(home-page "https://github.com/explosion/thinc")
--
2.46.0
N
N
Nicolas Graves wrote on 15 Sep 10:57 +0200
[PATCH 6/9] gnu: Add python-cutlery.
(address . 73266@debbugs.gnu.org)(address . ngraves@ngraves.fr)
20240915085720.13323-6-ngraves@ngraves.fr
* gnu/packages/machine-learning.scm (python-cutlery): New variable.

Change-Id: I5304205737330850ce84a49df814b96a4d605699
---
gnu/packages/machine-learning.scm | 38 +++++++++++++++++++++++++++++++
1 file changed, 38 insertions(+)

Toggle diff (51 lines)
diff --git a/gnu/packages/machine-learning.scm b/gnu/packages/machine-learning.scm
index 008bf2060a..89fcd3c1b7 100644
--- a/gnu/packages/machine-learning.scm
+++ b/gnu/packages/machine-learning.scm
@@ -2442,6 +2442,44 @@ (define-public python-cmaes
Covariance Matrix Adaptation Evolution Strategy (CMA-ES) for Python.")
(license license:expat)))
+(define-public python-cutlery
+ (package
+ (name "python-cutlery")
+ (version "0.0.6")
+ (source
+ (origin
+ (method url-fetch)
+ (uri (pypi-uri "cutlery" version))
+ (sha256
+ (base32 "1l5jv0mkmvzlmglz61py6f4inil2iwgh1ap8881cyb6k7hnnccc9"))))
+ (build-system pyproject-build-system)
+ (arguments
+ (list
+ #:phases
+ #~(modify-phases %standard-phases
+ ;; For some reason when both local and installed exist,
+ ;; local is chosen and is missing shared libraries.
+ ;; Use installed version to run tests instead.
+ (add-before 'check 'pre-check
+ (lambda* (#:key tests? inputs outputs #:allow-other-keys)
+ (when tests?
+ (copy-recursively "cutlery/tests" "tests")
+ (delete-file-recursively "cutlery")
+ (add-installed-pythonpath inputs outputs)))))))
+ (propagated-inputs (list python-regex))
+ (native-inputs (list python-cython python-pytest))
+ (home-page "https://github.com/explosion/curated-tokenizers")
+ (synopsis "Lightweight piece tokenization library")
+ (description "This package provides a lightweight wordpiece and
+sentencepiece tokenization library. It supports multiple tokenizers:
+@itemize
+@item BPE
+@item Byte BPE
+@item Unigram
+@item Wordpiece
+@end itemize")
+ (license license:expat)))
+
(define-public python-autograd
(let* ((commit "c6d81ce7eede6db801d4e9a92b27ec5d409d0eab")
(revision "0")
--
2.46.0
N
N
Nicolas Graves wrote on 15 Sep 10:57 +0200
[PATCH 7/9] gnu: Add python-curated-transformers.
(address . 73266@debbugs.gnu.org)(address . ngraves@ngraves.fr)
20240915085720.13323-7-ngraves@ngraves.fr
* gnu/packages/machine-learning.scm (python-curated-transformers): New variable.

Change-Id: I42cf780097456f5a8a9a9efc2a56e2c082d2a938
---
gnu/packages/machine-learning.scm | 55 +++++++++++++++++++++++++++++++
1 file changed, 55 insertions(+)

Toggle diff (68 lines)
diff --git a/gnu/packages/machine-learning.scm b/gnu/packages/machine-learning.scm
index 89fcd3c1b7..d1b282fea8 100644
--- a/gnu/packages/machine-learning.scm
+++ b/gnu/packages/machine-learning.scm
@@ -2480,6 +2480,61 @@ (define-public python-cutlery
@end itemize")
(license license:expat)))
+(define-public python-curated-transformers
+ (package
+ (name "python-curated-transformers")
+ (version "0.1.0")
+ (source
+ (origin
+ (method url-fetch)
+ (uri (pypi-uri "curated-transformers" version))
+ (sha256
+ (base32 "04k54r5cxjl3l7xs4kx4cfnqsjr7gdlr577sp7sl7qgrk3kfqjbm"))))
+ (build-system pyproject-build-system)
+ (arguments
+ (list
+ #:test-flags
+ '(list ; Most ignored tests require network.
+ "--ignore=curated_transformers/tests/tokenizers/test_auto_tokenizer.py"
+ "-k" (string-append "not test_special_pieces"
+ " and not test_auto_encoder"
+ " and not test_auto_decoder"
+ " and not test_auto_causal_lm"
+ " and not test_from_hf_hub_to_cache"
+ " and not test_from_hf_hub_to_cache_legacy"
+ " and not test_checkpoint_type_without_safetensors"
+ " and not test_hf_hub_failures"
+ ;; These have been added when downgrading curated_tokenizers.
+ " and not test_camembert_tokenizer_toy_tokenizer"
+ " and not test_roberta_tokenizer"
+ " and not test_xlmr_toy_tokenizer"))))
+ (propagated-inputs (list python-catalogue
+ python-cutlery
+ python-huggingface-hub
+ python-pytorch
+ python-tokenizers))
+ (native-inputs (list python-pytest))
+ (home-page "https://github.com/explosion/curated-transformers")
+ (synopsis "PyTorch library of transformer models and components")
+ (description
+ "This package provides a @code{PyTorch} library of transformer models and
+components. It helps to download state-of-the-art models that are composed
+from a set of reusable components. The stand-out features of Curated
+Transformer are:
+
+@itemize
+@item Supports state-of-the art transformer models, including LLMs such as
+Falcon, Llama, and Dolly v2.
+@item Each model is composed from a set of reusable building blocks, providing
+many benefits: implementing a feature or bugfix benefits all models ; Adding
+new models to the library is low-effort.
+@item Consistent type annotations of all public APIs, hence a great coding
+support from IDEs. Integrates well with your existing type-checked code.
+@item Great for education, because the building blocks are easy to study.
+@item Minimal dependencies.
+@end itemize")
+ (license license:expat)))
+
(define-public python-autograd
(let* ((commit "c6d81ce7eede6db801d4e9a92b27ec5d409d0eab")
(revision "0")
--
2.46.0
N
N
Nicolas Graves wrote on 15 Sep 10:57 +0200
[PATCH 8/9] gnu: Add python-curated-tokenizers.
(address . 73266@debbugs.gnu.org)(address . ngraves@ngraves.fr)
20240915085720.13323-8-ngraves@ngraves.fr
* gnu/packages/machine-learning.scm (python-curated-tokenizers): New variable.

Change-Id: I719d2ffd499c86e6bb2f9215ed979e47c0e32484
---
gnu/packages/machine-learning.scm | 41 +++++++++++++++++++++++++++++++
1 file changed, 41 insertions(+)

Toggle diff (54 lines)
diff --git a/gnu/packages/machine-learning.scm b/gnu/packages/machine-learning.scm
index d1b282fea8..e80412ed41 100644
--- a/gnu/packages/machine-learning.scm
+++ b/gnu/packages/machine-learning.scm
@@ -2480,6 +2480,47 @@ (define-public python-cutlery
@end itemize")
(license license:expat)))
+(define-public python-curated-tokenizers
+ (package
+ (name "python-curated-tokenizers")
+ (version "0.0.9")
+ ;; This source includes third_party protobuf, but a version that
+ ;; is not currently packaged in guix (3.6 < version <= 3.19.5).
+ ;; Try using guix's protobuf when updating.
+ (source
+ (origin
+ (method url-fetch)
+ (uri (pypi-uri "curated-tokenizers" version))
+ (sha256
+ (base32 "09ffs2qjlli35wnf8wf64s14xm75vi5ynvkrn9nqllmk9bjlfgf9"))))
+ (build-system pyproject-build-system)
+ (arguments
+ (list
+ #:phases
+ #~(modify-phases %standard-phases
+ ;; For some reason when both local and installed exist,
+ ;; local is chosen and is missing shared libraries.
+ ;; Use installed version to run tests instead.
+ (add-before 'check 'pre-check
+ (lambda* (#:key tests? inputs outputs #:allow-other-keys)
+ (when tests?
+ (copy-recursively "curated_tokenizers/tests" "tests")
+ (delete-file-recursively "curated_tokenizers")
+ (add-installed-pythonpath inputs outputs)))))))
+ (propagated-inputs (list python-regex))
+ (native-inputs (list python-cython python-pytest))
+ (home-page "https://github.com/explosion/curated-tokenizers")
+ (synopsis "Lightweight piece tokenization library")
+ (description "This package provides a lightweight wordpiece and
+sentencepiece tokenization library. It supports multiple tokenizers:
+@itemize
+@item BPE
+@item Byte BPE
+@item Unigram
+@item Wordpiece
+@end itemize")
+ (license license:expat)))
+
(define-public python-curated-transformers
(package
(name "python-curated-transformers")
--
2.46.0
N
N
Nicolas Graves wrote on 15 Sep 10:57 +0200
[PATCH 9/9] gnu: Add python-spacy-curated-transformers.
(address . 73266@debbugs.gnu.org)(address . ngraves@ngraves.fr)
20240915085720.13323-9-ngraves@ngraves.fr
* gnu/packages/machine-learning.scm (python-spacy-curated-transformers): New variable.

Change-Id: Id4b67b2ea2de4745831c3536124304860e9764d8
---
gnu/packages/machine-learning.scm | 31 +++++++++++++++++++++++++++++++
1 file changed, 31 insertions(+)

Toggle diff (44 lines)
diff --git a/gnu/packages/machine-learning.scm b/gnu/packages/machine-learning.scm
index e80412ed41..3afc224e7c 100644
--- a/gnu/packages/machine-learning.scm
+++ b/gnu/packages/machine-learning.scm
@@ -1292,6 +1292,37 @@ (define-public python-sentence-transformers
models, to achieve maximal performance on your specific task.")
(license license:asl2.0)))
+(define-public python-spacy-curated-transformers
+ (package
+ (name "python-spacy-curated-transformers")
+ (version "0.2.2")
+ (source
+ (origin
+ (method url-fetch)
+ (uri (pypi-uri "spacy-curated-transformers" version))
+ (sha256
+ (base32 "1hsqaai666yy9xzj14azli0hgipdkkc5x7xwszh58ndvxsij3dq3"))))
+ (build-system pyproject-build-system)
+ (arguments (list #:tests? #f)) ; Missing python-cupy dependency
+ (propagated-inputs (list python-curated-tokenizers
+ python-curated-transformers
+ python-cutlery
+ python-fsspec
+ python-pytorch
+ python-spacy
+ python-thinc))
+ (home-page "https://github.com/explosion/spacy-curated-transformers")
+ (synopsis "Curated transformer models for spaCy pipelines")
+ (description "This package provides transformer models for @code{spaCy}
+pipelines. It allows you to use pretrained models based on one of the
+following architectures to power your spaCy pipeline: ALBERT, BERT, CamemBERT,
+RoBERTa, XLM-RoBERTa. It provides all the features supported by
+spacy-transformers such as support for Hugging Face Hub, multi-task learning,
+the extensible config system and out-of-the-box serialization, as well as deep
+integration into spaCy, which lays the groundwork for deployment-focused
+features such as distillation and quantization.")
+ (license license:expat)))
+
(define-public python-spacy-legacy
(package
(name "python-spacy-legacy")
--
2.46.0
N
N
Nicolas Graves wrote on 16 Oct 18:55 +0200
control message for bug #73266
(address . control@debbugs.gnu.org)
8734kwvvgh.fsf@ngraves.fr
block 73266 by 73115
quit


--
Best regards,
Nicolas Graves
?
Your comment

Commenting via the web interface is currently disabled.

To comment on this conversation send an email to 73266@debbugs.gnu.org

To respond to this issue using the mumi CLI, first switch to it
mumi current 73266
Then, you may apply the latest patchset in this issue (with sign off)
mumi am -- -s
Or, compose a reply to this issue
mumi compose
Or, send patches to this issue
mumi send-email *.patch