[PATCH 00/10] Add python-tokenizers.

  • Open
  • quality assurance status badge
Details
One participant
  • Nicolas Graves
Owner
unassigned
Submitted by
Nicolas Graves
Severity
normal
Blocked by
N
N
Nicolas Graves wrote on 7 Sep 18:21 +0200
(address . guix-patches@gnu.org)(address . ngraves@ngraves.fr)
20240907162236.8570-1-ngraves@ngraves.fr
This patch series adds the package python-tokenizers, which is a
prerequisite for packaging python-transformers.

Nicolas Graves (10):
gnu: Add rust-esaxx-rs-0.1.
gnu: Add rust-spm-precompiled-0.1.
gnu: Add rust-macro-rules-attribute-proc-macro-0.2.
gnu: Add rust-macro-rules-attribute-0.2.
gnu: Add rust-hf-hub-0.3.
gnu: Add rust-monostate-impl-0.1.
gnu: Add rust-monostate-0.1.
gnu: Add rust-tokenizers.
gnu: Add rust-numpy-0.21.
gnu: Add python-tokenizers.

gnu/packages/crates-io.scm | 133 +++++++++++++++
gnu/packages/machine-learning.scm | 266 ++++++++++++++++++++++++++++++
2 files changed, 399 insertions(+)

--
2.45.2
N
N
Nicolas Graves wrote on 7 Sep 18:56 +0200
[PATCH 01/10] gnu: Add rust-esaxx-rs-0.1.
(address . 73106@debbugs.gnu.org)(address . ngraves@ngraves.fr)
20240907165626.22651-1-ngraves@ngraves.fr
* gnu/packages/machine-learning.scm (rust-esaxx-rs-0.1): New variable.

Change-Id: I38a666dd5b9f20dc721e0a28ad718ff5f227b708
---
gnu/packages/machine-learning.scm | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)

Toggle diff (33 lines)
diff --git a/gnu/packages/machine-learning.scm b/gnu/packages/machine-learning.scm
index 12be1d7bf6..4385603a4a 100644
--- a/gnu/packages/machine-learning.scm
+++ b/gnu/packages/machine-learning.scm
@@ -5580,6 +5580,26 @@ (define-public python-torchfile
Python.")
(license license:bsd-3)))
+(define-public rust-esaxx-rs-0.1
+ (package
+ (name "rust-esaxx-rs")
+ (version "0.1.10")
+ (source
+ (origin
+ (method url-fetch)
+ (uri (crate-uri "esaxx-rs" version))
+ (file-name (string-append name "-" version ".tar.gz"))
+ (sha256
+ (base32 "1rm6vm5yr7s3n5ly7k9x9j6ra5p2l2ld151gnaya8x03qcwf05yq"))))
+ (build-system cargo-build-system)
+ (arguments
+ `(#:cargo-inputs (("rust-cc" ,rust-cc-1))))
+ (home-page "https://github.com/Narsil/esaxx-rs")
+ (synopsis "Wrapper for sentencepiece's esaxxx library")
+ (description
+ "This package provides a wrapper around sentencepiece's esaxxx library.")
+ (license license:asl2.0)))
+
(define-public python-hmmlearn
(package
(name "python-hmmlearn")
--
2.45.2
N
N
Nicolas Graves wrote on 7 Sep 18:56 +0200
[PATCH 02/10] gnu: Add rust-spm-precompiled-0.1.
(address . 73106@debbugs.gnu.org)(address . ngraves@ngraves.fr)
20240907165626.22651-2-ngraves@ngraves.fr
* gnu/packages/machine-learning.scm (rust-spm-precompiled-0.1): New variable.

Change-Id: I622c1a875e10041703ef0a32e7c35074f534276b
---
gnu/packages/machine-learning.scm | 27 +++++++++++++++++++++++++++
1 file changed, 27 insertions(+)

Toggle diff (40 lines)
diff --git a/gnu/packages/machine-learning.scm b/gnu/packages/machine-learning.scm
index 4385603a4a..d3f76ebeba 100644
--- a/gnu/packages/machine-learning.scm
+++ b/gnu/packages/machine-learning.scm
@@ -5600,6 +5600,33 @@ (define-public rust-esaxx-rs-0.1
"This package provides a wrapper around sentencepiece's esaxxx library.")
(license license:asl2.0)))
+(define-public rust-spm-precompiled-0.1
+ (package
+ (name "rust-spm-precompiled")
+ (version "0.1.4")
+ (source
+ (origin
+ (method url-fetch)
+ (uri (crate-uri "spm_precompiled" version))
+ (file-name (string-append name "-" version ".tar.gz"))
+ (sha256
+ (base32 "09pkdk2abr8xf4pb9kq3rk80dgziq6vzfk7aywv3diik82f6jlaq"))))
+ (build-system cargo-build-system)
+ (arguments
+ `(#:cargo-inputs
+ (("rust-base64" ,rust-base64-0.13)
+ ("rust-nom" ,rust-nom-7)
+ ("rust-serde" ,rust-serde-1)
+ ("rust-unicode-segmentation" ,rust-unicode-segmentation-1))))
+ (home-page "https://github.com/huggingface/spm_precompiled")
+ (synopsis "Emulate sentencepiece's DoubleArray")
+ (description
+ "This crate aims to emulate
+@url{https://github.com/google/sentencepiece,sentencepiece}
+Dart::@code{DoubleArray} struct and it's Normalizer. This crate is highly
+specialized and not intended for general use.")
+ (license license:asl2.0)))
+
(define-public python-hmmlearn
(package
(name "python-hmmlearn")
--
2.45.2
N
N
Nicolas Graves wrote on 7 Sep 18:56 +0200
[PATCH 03/10] gnu: Add rust-macro-rules-attribute-proc-macro-0.2.
(address . 73106@debbugs.gnu.org)(address . ngraves@ngraves.fr)
20240907165626.22651-3-ngraves@ngraves.fr
* gnu/packages/crates-io.scm (rust-macro-rules-attribute-proc-macro-0.2): New variable.

Change-Id: I1fab6de81c897643cae52e733bd06bb00ea1bd7f
---
gnu/packages/crates-io.scm | 21 +++++++++++++++++++++
1 file changed, 21 insertions(+)

Toggle diff (34 lines)
diff --git a/gnu/packages/crates-io.scm b/gnu/packages/crates-io.scm
index 36ecbe4430..d04f8723fd 100644
--- a/gnu/packages/crates-io.scm
+++ b/gnu/packages/crates-io.scm
@@ -41076,6 +41076,27 @@ (define-public rust-macaddr-1
(description "This pakcage provides MAC address types.")
(license (list license:asl2.0 license:expat))))
+(define-public rust-macro-rules-attribute-proc-macro-0.2
+ (package
+ (name "rust-macro-rules-attribute-proc-macro")
+ (version "0.2.0")
+ (source
+ (origin
+ (method url-fetch)
+ (uri (crate-uri "macro_rules_attribute-proc_macro" version))
+ (file-name (string-append name "-" version ".tar.gz"))
+ (sha256
+ (base32 "0s45j4zm0a5d041g3vcbanvr76p331dfjb7gw9qdmh0w8mnqbpdq"))))
+ (build-system cargo-build-system)
+ (home-page
+ "https://github.com/danielhenrymantilla/macro_rules_attribute-rs")
+ (synopsis "Use declarative macros in Rust")
+ (description
+ "This package provides the ability to use Rust declarative macros as
+proc_macro attributes or derives. This package provides implementation
+details to @code{rust-macro-rules-attribute}.")
+ (license license:expat)))
+
(define-public rust-macrotest-1
(package
(name "rust-macrotest")
--
2.45.2
N
N
Nicolas Graves wrote on 7 Sep 18:56 +0200
[PATCH 04/10] gnu: Add rust-macro-rules-attribute-0.2.
(address . 73106@debbugs.gnu.org)(address . ngraves@ngraves.fr)
20240907165626.22651-4-ngraves@ngraves.fr
* gnu/packages/crates-io.scm (rust-macro-rules-attribute-0.2): New variable.

Change-Id: I62c9ba35a8a9f71f05f0f3c5307d7abe11f408c8
---
gnu/packages/crates-io.scm | 28 ++++++++++++++++++++++++++++
1 file changed, 28 insertions(+)

Toggle diff (41 lines)
diff --git a/gnu/packages/crates-io.scm b/gnu/packages/crates-io.scm
index d04f8723fd..658721b123 100644
--- a/gnu/packages/crates-io.scm
+++ b/gnu/packages/crates-io.scm
@@ -41097,6 +41097,34 @@ (define-public rust-macro-rules-attribute-proc-macro-0.2
details to @code{rust-macro-rules-attribute}.")
(license license:expat)))
+(define-public rust-macro-rules-attribute-0.2
+ (package
+ (name "rust-macro-rules-attribute")
+ (version "0.2.0")
+ (source
+ (origin
+ (method url-fetch)
+ (uri (crate-uri "macro_rules_attribute" version))
+ (file-name (string-append name "-" version ".tar.gz"))
+ (sha256
+ (base32 "04waa4qm28adwnxsxhx9135ki68mwkikr6m5pi5xhcy0gcgjg0la"))))
+ (build-system cargo-build-system)
+ (arguments
+ `(#:cargo-inputs
+ (("rust-macro-rules-attribute-proc-macro"
+ ,rust-macro-rules-attribute-proc-macro-0.2)
+ ("rust-paste" ,rust-paste-1))
+ #:cargo-development-inputs
+ (("rust-once-cell" ,rust-once-cell-1)
+ ("rust-pin-project-lite" ,rust-pin-project-lite-0.2)
+ ("rust-serde" ,rust-serde-1))))
+ (home-page "https://crates.io/crates/macro_rules_attribute")
+ (synopsis "Use declarative macros in Rust")
+ (description
+ "This package provides the ability to use Rust declarative macros as
+proc_macro attributes or derives.")
+ (license license:expat)))
+
(define-public rust-macrotest-1
(package
(name "rust-macrotest")
--
2.45.2
N
N
Nicolas Graves wrote on 7 Sep 18:56 +0200
[PATCH 05/10] gnu: Add rust-hf-hub-0.3.
(address . 73106@debbugs.gnu.org)(address . ngraves@ngraves.fr)
20240907165626.22651-5-ngraves@ngraves.fr
* gnu/packages/machine-learning.scm (rust-hf-hub-0.3): New variable.

Change-Id: I9e64c316dde8094e6142785af8549556953513e0
---
gnu/packages/machine-learning.scm | 48 +++++++++++++++++++++++++++++++
1 file changed, 48 insertions(+)

Toggle diff (69 lines)
diff --git a/gnu/packages/machine-learning.scm b/gnu/packages/machine-learning.scm
index d3f76ebeba..27d7f0526b 100644
--- a/gnu/packages/machine-learning.scm
+++ b/gnu/packages/machine-learning.scm
@@ -78,7 +78,10 @@ (define-module (gnu packages machine-learning)
#:use-module (gnu packages cmake)
#:use-module (gnu packages cpp)
#:use-module (gnu packages cran)
+ #:use-module (gnu packages crates-crypto)
#:use-module (gnu packages crates-io)
+ #:use-module (gnu packages crates-tls)
+ #:use-module (gnu packages crates-web)
#:use-module (gnu packages databases)
#:use-module (gnu packages dejagnu)
#:use-module (gnu packages documentation)
@@ -5627,6 +5630,51 @@ (define-public rust-spm-precompiled-0.1
specialized and not intended for general use.")
(license license:asl2.0)))
+(define-public rust-hf-hub-0.3
+ (package
+ (name "rust-hf-hub")
+ (version "0.3.2")
+ (source
+ (origin
+ (method url-fetch)
+ (uri (crate-uri "hf-hub" version))
+ (file-name (string-append name "-" version ".tar.gz"))
+ (sha256
+ (base32 "0cnpivy9fn62lm1fw85kmg3ryvrx8drq63c96vq94gabawshcy1b"))))
+ (build-system cargo-build-system)
+ (arguments
+ `(#:tests? #f ; require network connection
+ #:cargo-inputs
+ (("rust-dirs" ,rust-dirs-5)
+ ("rust-futures" ,rust-futures-0.3)
+ ("rust-indicatif" ,rust-indicatif-0.17)
+ ("rust-log" ,rust-log-0.4)
+ ("rust-native-tls" ,rust-native-tls-0.2)
+ ("rust-num-cpus" ,rust-num-cpus-1)
+ ("rust-rand" ,rust-rand-0.8)
+ ("rust-reqwest" ,rust-reqwest-0.11)
+ ("rust-serde" ,rust-serde-1)
+ ("rust-serde-json" ,rust-serde-json-1)
+ ("rust-thiserror" ,rust-thiserror-1)
+ ("rust-tokio" ,rust-tokio-1)
+ ("rust-ureq" ,rust-ureq-2))
+ #:cargo-development-inputs
+ (("rust-hex-literal" ,rust-hex-literal-0.4)
+ ("rust-sha2" ,rust-sha2-0.10)
+ ("rust-tokio-test" ,rust-tokio-test-0.4))))
+ (native-inputs
+ (list pkg-config))
+ (inputs
+ (list openssl))
+ (home-page "https://github.com/huggingface/hf-hub")
+ (synopsis "Interact with HuggingFace in Rust")
+ (description
+ "This crates aims ease the interaction with
+@url{https://huggingface.co/,huggingface}. It aims to be compatible with
+@url{https://github.com/huggingface/huggingface_hub/,huggingface_hub}
+python package, but only implements a smaller subset of functions.")
+ (license license:asl2.0)))
+
(define-public python-hmmlearn
(package
(name "python-hmmlearn")
--
2.45.2
N
N
Nicolas Graves wrote on 7 Sep 18:56 +0200
[PATCH 07/10] gnu: Add rust-monostate-0.1.
(address . 73106@debbugs.gnu.org)(address . ngraves@ngraves.fr)
20240907165626.22651-7-ngraves@ngraves.fr
* gnu/packages/crates-io.scm (rust-monostate-0.1): New variable.

Change-Id: I53f1ebfaf98e785eedeb3293f211bffa6f44bc76
---
gnu/packages/crates-io.scm | 26 ++++++++++++++++++++++++++
1 file changed, 26 insertions(+)

Toggle diff (39 lines)
diff --git a/gnu/packages/crates-io.scm b/gnu/packages/crates-io.scm
index 28ff81c801..7a8f090fd9 100644
--- a/gnu/packages/crates-io.scm
+++ b/gnu/packages/crates-io.scm
@@ -43741,6 +43741,32 @@ (define-public rust-monostate-impl-0.1
"This package provides Implementation detail of the monostate crate.")
(license (list license:expat license:asl2.0))))
+(define-public rust-monostate-0.1
+ (package
+ (name "rust-monostate")
+ (version "0.1.11")
+ (source
+ (origin
+ (method url-fetch)
+ (uri (crate-uri "monostate" version))
+ (file-name (string-append name "-" version ".tar.gz"))
+ (sha256
+ (base32 "0xchz8cs990g7g5f8jjybjnyi9xnhykiq44gl97p5rbh3hgjm347"))))
+ (build-system cargo-build-system)
+ (arguments
+ `(#:cargo-inputs
+ (("rust-monostate-impl" ,rust-monostate-impl-0.1)
+ ("rust-serde" ,rust-serde-1))
+ #:cargo-development-inputs
+ (("rust-serde" ,rust-serde-1)
+ ("rust-serde-json" ,rust-serde-json-1))))
+ (home-page "https://github.com/dtolnay/monostate")
+ (synopsis "Type that deserializes only from one specific value")
+ (description
+ "This package provides a Rust type that deserializes only from one
+specific value.")
+ (license (list license:expat license:asl2.0))))
+
(define-public rust-more-asserts-0.3
(package
(name "rust-more-asserts")
--
2.45.2
N
N
Nicolas Graves wrote on 7 Sep 18:56 +0200
[PATCH 06/10] gnu: Add rust-monostate-impl-0.1.
(address . 73106@debbugs.gnu.org)(address . ngraves@ngraves.fr)
20240907165626.22651-6-ngraves@ngraves.fr
* gnu/packages/crates-io.scm (rust-monostate-impl-0.1): New variable.

Change-Id: Ica72fb8bce3589ed1ee5b08c3d96dcc24aaee279
---
gnu/packages/crates-io.scm | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)

Toggle diff (36 lines)
diff --git a/gnu/packages/crates-io.scm b/gnu/packages/crates-io.scm
index 658721b123..28ff81c801 100644
--- a/gnu/packages/crates-io.scm
+++ b/gnu/packages/crates-io.scm
@@ -43718,6 +43718,29 @@ (define-public rust-modifier-0.1
"Chaining APIs for both self -> Self and &mut self methods.")
(license license:expat)))
+(define-public rust-monostate-impl-0.1
+ (package
+ (name "rust-monostate-impl")
+ (version "0.1.11")
+ (source
+ (origin
+ (method url-fetch)
+ (uri (crate-uri "monostate-impl" version))
+ (file-name (string-append name "-" version ".tar.gz"))
+ (sha256
+ (base32 "1km6kc6yxvpsxciaj02zar8cx1sq142s6jn6saqn77h7165dd1pn"))))
+ (build-system cargo-build-system)
+ (arguments
+ `(#:cargo-inputs
+ (("rust-proc-macro2" ,rust-proc-macro2-1)
+ ("rust-quote" ,rust-quote-1)
+ ("rust-syn" ,rust-syn-2))))
+ (home-page "https://github.com/dtolnay/monostate")
+ (synopsis "Implementation detail of the monostate crate")
+ (description
+ "This package provides Implementation detail of the monostate crate.")
+ (license (list license:expat license:asl2.0))))
+
(define-public rust-more-asserts-0.3
(package
(name "rust-more-asserts")
--
2.45.2
N
N
Nicolas Graves wrote on 7 Sep 18:56 +0200
[PATCH 08/10] gnu: Add rust-tokenizers.
(address . 73106@debbugs.gnu.org)(address . ngraves@ngraves.fr)
20240907165626.22651-8-ngraves@ngraves.fr
* gnu/packages/machine-learning.scm (rust-tokenizers): New variable.

Change-Id: I3189a2d826f072f65ad053d77eb39be39775f1c2
---
gnu/packages/machine-learning.scm | 60 +++++++++++++++++++++++++++++++
1 file changed, 60 insertions(+)

Toggle diff (73 lines)
diff --git a/gnu/packages/machine-learning.scm b/gnu/packages/machine-learning.scm
index 27d7f0526b..3b601f6c91 100644
--- a/gnu/packages/machine-learning.scm
+++ b/gnu/packages/machine-learning.scm
@@ -5675,6 +5675,66 @@ (define-public rust-hf-hub-0.3
python package, but only implements a smaller subset of functions.")
(license license:asl2.0)))
+(define-public rust-tokenizers
+ (package
+ (name "rust-tokenizers")
+ (version "0.19.1")
+ (source
+ (origin
+ (method url-fetch)
+ (uri (crate-uri "tokenizers" version))
+ (file-name (string-append name "-" version ".tar.gz"))
+ (sha256
+ (base32 "1zg6ffpllygijb5bh227m9p4lrhf0pjkysky68kddwrsvp8zl075"))
+ (modules '((guix build utils)))
+ (snippet
+ #~(substitute* "Cargo.toml"
+ (("0.1.12") ; rust-monostate requires a rust-syn-2 update
+ "0.1.11")
+ (("version = \"6.4\"") ; rust-onig
+ "version = \"6.1.1\"")))))
+ (build-system cargo-build-system)
+ (arguments
+ (list
+ #:tests? #f ; tests are relying on missing data.
+ #:cargo-inputs
+ `(("rust-aho-corasick" ,rust-aho-corasick-1)
+ ("rust-derive-builder" ,rust-derive-builder-0.20)
+ ("rust-esaxx-rs" ,rust-esaxx-rs-0.1)
+ ("rust-fancy-regex" ,rust-fancy-regex-0.13)
+ ("rust-getrandom" ,rust-getrandom-0.2)
+ ("rust-hf-hub" ,rust-hf-hub-0.3)
+ ("rust-indicatif" ,rust-indicatif-0.17)
+ ("rust-itertools" ,rust-itertools-0.12)
+ ("rust-lazy-static" ,rust-lazy-static-1)
+ ("rust-log" ,rust-log-0.4)
+ ("rust-macro-rules-attribute" ,rust-macro-rules-attribute-0.2)
+ ("rust-monostate" ,rust-monostate-0.1)
+ ("rust-onig" ,rust-onig-6)
+ ("rust-paste" ,rust-paste-1)
+ ("rust-rand" ,rust-rand-0.8)
+ ("rust-rayon" ,rust-rayon-1)
+ ("rust-rayon-cond" ,rust-rayon-cond-0.3)
+ ("rust-regex" ,rust-regex-1)
+ ("rust-regex-syntax" ,rust-regex-syntax-0.8)
+ ("rust-serde" ,rust-serde-1)
+ ("rust-serde-json" ,rust-serde-json-1)
+ ("rust-spm-precompiled" ,rust-spm-precompiled-0.1)
+ ("rust-thiserror" ,rust-thiserror-1)
+ ("rust-unicode-normalization-alignments" ,rust-unicode-normalization-alignments-0.1)
+ ("rust-unicode-segmentation" ,rust-unicode-segmentation-1)
+ ("rust-unicode-categories" ,rust-unicode-categories-0.1))
+ #:cargo-development-inputs
+ `(("rust-assert-approx-eq" ,rust-assert-approx-eq-1)
+ ("rust-criterion" ,rust-criterion-0.5)
+ ("rust-tempfile" ,rust-tempfile-3))))
+ (home-page "https://github.com/huggingface/tokenizers")
+ (synopsis "Implementation of various popular tokenizers")
+ (description
+ "This package provides a Rust implementation of today's most used
+tokenizers, with a focus on performances and versatility.")
+ (license license:asl2.0)))
+
(define-public python-hmmlearn
(package
(name "python-hmmlearn")
--
2.45.2
N
N
Nicolas Graves wrote on 7 Sep 18:56 +0200
[PATCH 09/10] gnu: Add rust-numpy-0.21.
(address . 73106@debbugs.gnu.org)(address . ngraves@ngraves.fr)
20240907165626.22651-9-ngraves@ngraves.fr
* gnu/packages/crates-io.scm (rust-numpy-0.21): New variable.

Change-Id: Idae5915f3cefa47c16c4bf9a5679f55621e35da7
---
gnu/packages/crates-io.scm | 35 +++++++++++++++++++++++++++++++++++
1 file changed, 35 insertions(+)

Toggle diff (48 lines)
diff --git a/gnu/packages/crates-io.scm b/gnu/packages/crates-io.scm
index 7a8f090fd9..ba5cb75d2c 100644
--- a/gnu/packages/crates-io.scm
+++ b/gnu/packages/crates-io.scm
@@ -48734,6 +48734,41 @@ (define-public rust-number-prefix-0.3
giga, kibi.")
(license license:expat)))
+(define-public rust-numpy-0.21
+ (package
+ (name "rust-numpy")
+ (version "0.21.0")
+ (source
+ (origin
+ (method url-fetch)
+ (uri (crate-uri "numpy" version))
+ (file-name (string-append name "-" version ".tar.gz"))
+ (sha256
+ (base32 "1x1p5x7lwfc5nsccwj98sln5vx3g3n8sbgm5fmfmy5rpr8rhf5zc"))))
+ (build-system cargo-build-system)
+ (arguments
+ `(#:cargo-inputs
+ (("rust-half" ,rust-half-2)
+ ("rust-libc" ,rust-libc-0.2)
+ ("rust-nalgebra" ,rust-nalgebra-0.32)
+ ("rust-ndarray" ,rust-ndarray-0.13)
+ ("rust-num-complex" ,rust-num-complex-0.2)
+ ("rust-num-integer" ,rust-num-integer-0.1)
+ ("rust-num-traits" ,rust-num-traits-0.2)
+ ("rust-pyo3" ,rust-pyo3-0.21)
+ ("rust-rustc-hash" ,rust-rustc-hash-1))
+ #:cargo-development-inputs
+ (("rust-nalgebra" ,rust-nalgebra-0.32)
+ ("rust-pyo3" ,rust-pyo3-0.21))))
+ (native-inputs (list python-minimal
+ (@ (gnu packages python-xyz) python-numpy)))
+ (home-page "https://github.com/PyO3/rust-numpy")
+ (synopsis "Rust bindings for the NumPy C-API")
+ (description
+ "This package provides @code{PyO3-based} Rust bindings of the
+@code{NumPy} C-API.")
+ (license license:bsd-2)))
+
(define-public rust-numtoa-0.2
(package
(name "rust-numtoa")
--
2.45.2
N
N
Nicolas Graves wrote on 7 Sep 18:56 +0200
[PATCH 10/10] gnu: Add python-tokenizers.
(address . 73106@debbugs.gnu.org)(address . ngraves@ngraves.fr)
20240907165626.22651-10-ngraves@ngraves.fr
* gnu/packages/machine-learning.scm (python-tokenizers): New variable.

Change-Id: I5db95172255dc4635c2a417f3b7252454eea27d7
---
gnu/packages/machine-learning.scm | 111 ++++++++++++++++++++++++++++++
1 file changed, 111 insertions(+)

Toggle diff (124 lines)
diff --git a/gnu/packages/machine-learning.scm b/gnu/packages/machine-learning.scm
index 3b601f6c91..412499d424 100644
--- a/gnu/packages/machine-learning.scm
+++ b/gnu/packages/machine-learning.scm
@@ -5735,6 +5735,117 @@ (define-public rust-tokenizers
tokenizers, with a focus on performances and versatility.")
(license license:asl2.0)))
+(define-public python-tokenizers
+ (package
+ (name "python-tokenizers")
+ (version "0.19.1")
+ (source
+ (origin
+ (method url-fetch)
+ (uri (pypi-uri "tokenizers" version))
+ (sha256
+ (base32 "1qw8mjp0q9w7j1raq1rvcbfw38000kbqpwscf9mvxzfh1rlfcngf"))
+ (modules '((guix build utils)
+ (ice-9 ftw)))
+ (snippet
+ #~(begin ;; Only keeping bindings.
+ (for-each (lambda (file)
+ (unless (member file '("." ".." "bindings" "PKG-INFO"))
+ (delete-file-recursively file)))
+ (scandir "."))
+ (for-each (lambda (file)
+ (unless (member file '("." ".."))
+ (rename-file (string-append "bindings/python/" file) file)))
+ (scandir "bindings/python"))
+ (delete-file-recursively ".cargo")))))
+ (build-system cargo-build-system)
+ (arguments
+ (list
+ #:cargo-test-flags ''("--no-default-features")
+ #:imported-modules `(,@%cargo-build-system-modules
+ ,@%pyproject-build-system-modules)
+ #:modules '((guix build cargo-build-system)
+ ((guix build pyproject-build-system) #:prefix py:)
+ (guix build utils)
+ (ice-9 regex)
+ (ice-9 textual-ports))
+ #:phases
+ #~(modify-phases %standard-phases
+ (add-after 'unpack-rust-crates 'inject-tokenizers
+ (lambda _
+ (substitute* "Cargo.toml"
+ (("\\[dependencies\\]")
+ (format #f "
+[dev-dependencies]
+tempfile = ~s
+pyo3 = { version = ~s, features = [\"auto-initialize\"] }
+
+[dependencies]
+tokenizers = ~s"
+ #$(package-version rust-tempfile-3)
+ #$(package-version rust-pyo3-0.21)
+ #$(package-version rust-tokenizers))))
+ (let ((file-path "Cargo.toml"))
+ (call-with-input-file file-path
+ (lambda (port)
+ (let* ((content (get-string-all port))
+ (top-match (string-match
+ "\\[dependencies.tokenizers" content)))
+ (call-with-output-file file-path
+ (lambda (out)
+ (format out "~a" (match:prefix top-match))))))))))
+ (add-after 'patch-cargo-checksums 'loosen-requirements
+ (lambda _
+ (substitute* "Cargo.toml"
+ (("version = \"6.4\"")
+ (format #f "version = ~s"
+ #$(package-version rust-onig-6))))))
+ (add-after 'check 'python-check
+ (lambda _
+ (copy-file "target/release/libtokenizers.so"
+ "py_src/tokenizers/tokenizers.so")
+ (invoke "python3"
+ "-c" (format #f
+ "import sys; sys.path.append(\"~a/py_src\")"
+ (getcwd))
+ "-m" "pytest"
+ "-s" "-v" "./tests/")))
+ (add-after 'install 'install-python
+ (lambda _
+ (let* ((pversion #$(version-major+minor (package-version python)))
+ (lib (string-append #$output "/lib/python" pversion
+ "/site-packages/"))
+ (info (string-append lib "tokenizers-"
+ #$(package-version this-package)
+ ".dist-info")))
+ (mkdir-p info)
+ (copy-file "PKG-INFO" (string-append info "/METADATA"))
+ (copy-recursively
+ "py_src/tokenizers"
+ (string-append lib "tokenizers"))))))
+ #:cargo-inputs
+ `(("rust-rayon" ,rust-rayon-1)
+ ("rust-serde" ,rust-serde-1)
+ ("rust-serde-json" ,rust-serde-json-1)
+ ("rust-libc" ,rust-libc-0.2)
+ ("rust-env-logger" ,rust-env-logger-0.11)
+ ("rust-pyo3" ,rust-pyo3-0.21)
+ ("rust-numpy" ,rust-numpy-0.21)
+ ("rust-ndarray" ,rust-ndarray-0.15)
+ ("rust-onig" ,rust-onig-6)
+ ("rust-itertools" ,rust-itertools-0.12)
+ ("rust-tokenizers" ,rust-tokenizers))
+ #:cargo-development-inputs
+ `(("rust-tempfile" ,rust-tempfile-3))))
+ (native-inputs
+ (list python-minimal python-pytest))
+ (home-page "https://huggingface.co/docs/tokenizers")
+ (synopsis "Implementation of various popular tokenizers")
+ (description
+ "This package provides bindings to a Rust implementation of the most used
+tokenizers, @code{rust-tokenizers}.")
+ (license license:asl2.0)))
+
(define-public python-hmmlearn
(package
(name "python-hmmlearn")
--
2.45.2
N
N
Nicolas Graves wrote on 7 Sep 19:07 +0200
control message for bug #73106
(address . control@debbugs.gnu.org)
87wmjnie8u.fsf@ngraves.fr
block 73106 by 73094
quit


--
Best regards,
Nicolas Graves
N
N
Nicolas Graves wrote on 7 Sep 19:08 +0200
control message for bug #73109
(address . control@debbugs.gnu.org)
87tterie8a.fsf@ngraves.fr
block 73109 by 73106
quit


--
Best regards,
Nicolas Graves
?
Your comment

Commenting via the web interface is currently disabled.

To comment on this conversation send an email to 73106@debbugs.gnu.org

To respond to this issue using the mumi CLI, first switch to it
mumi current 73106
Then, you may apply the latest patchset in this issue (with sign off)
mumi am -- -s
Or, compose a reply to this issue
mumi compose
Or, send patches to this issue
mumi send-email *.patch