Bumblebee Docs
  • Hi Bumblebee!
  • Install Bumblebee
    • Install via Docker
    • Build and Install From Source
  • Quick start
    • Setting up your first workspace
  • Bumblebee interface
    • Loading a Dataset
    • Saving a Dataset
    • Profile
    • Table
    • Columns
  • Transformations
    • Join dataframes
    • Rows functions
      • Sort rows
      • Filter rows
      • Drop empty rows
      • Drop duplicates
    • Column managing functions
      • Set
      • Rename
      • Duplicate
      • Keep
      • Drop
      • Nest
      • Unnest
    • Transformation functions
      • Fill null values
      • Replace
      • String functions
        • Lower case
        • Upper case
        • Proper case
        • Remove accents
        • Remove special chars
        • Normalize white spaces
        • Left (substring)
        • Right (substring)
        • Mid (substring)
      • Math functions
        • Absolute value
        • Round
        • Floor
        • Ceil
        • Modulo
        • Logarithm
        • Natural logarithm
        • Power
        • Square root
      • Trigonometric functions
        • Sine
        • Cosine
        • Tangent
        • Inverse Sine
        • Inverse Cosine
        • Inverse Tangent
        • Hyperbolic Sine
        • Hyperbolic Cosine
        • Hyperbolic Tangent
        • Inverse Hyperbolic Sine
        • Inverse Hyperbolic Cosine
        • Inverse Hyperbolic Tangent
      • Time and Date
        • Transform format
        • Year
        • Year (short)
        • Month name
        • Month name (short)
        • Month as a number
        • Day of month
        • Weekday
        • Weekday (short)
        • Weekday as a number
        • Minute
        • Hour (00-23)
        • AM/PM
        • UTC offset
        • Timezone
        • Day number of year
        • Weekday of year (Mon as 1st)
        • Weekday of year (Sun as 1st)
      • Web related functions
        • Domain
        • Subdomain
        • Url scheme
        • Port
        • Url path
        • Url params
        • Email domain
        • Email username
        • Strip HTML
      • Machine Learning
        • Random sampling
  • Help
    • Bigger than memory data
    • Which engine to use
Powered by GitBook
On this page

Was this helpful?

  1. Help

Bigger than memory data

In some scenarios the data you want to load could be bigger that your system or cluster memory which could cause slowdown or crash the system depending on the engine you are select.

To handle this we recommend:

  1. Just select the number of rows you want to load when previewing the file in Bumblebee. When saving your processed data you can decide if to process the whole dataset.

  2. Use Vaex as Bumblebee Engine. It can handle greater than memory data processing.

  3. If you have access to Dask/Dask-cuDF cluster use it. If your data is smaller than the total cluster memory it can slow thing down.

  4. Use an external service like Coiled to load as much data as you need. With Coiled you can get a Dask/Dask-cuDF cluster on demand and pay for what you use.

I raise and issue about automatically load as much data as possible depending on the memory available. You can see it here.

Please let me know if you have another question.

PreviousRandom samplingNextWhich engine to use

Last updated 4 years ago

Was this helpful?